Detection of Key Residues Involving Functional Divergence into the Translation Elongation Factor Tu/1A Family Using Quantitative Measurements for Specific Conservation of Protein Subfamilies

Delivery of an aminoacyl-tRNA to the ribosomal A-site during protein biosynthesis is mediated by elongation factor Tu/1A (EF-Tu/1A). This function is inferred as a common function of the EF-Tu/1A family. Moonlighting functions and several functional divergences are speculated in the EF-Tu/1A molecules such as actin and fibronectin binding functions. Two variant eEF1A forms, referred to as eEF1A1 and eEF1A2, are surmised to have different actin binding affinities. Mycoplasma pneumoniae EF-Tu has higher fibronectin binding affinity than M. genitalium EF-Tu. Incidentally; quantitative description for specific conservation of protein subfamilies could be helpful for assessment of the functional differences. Our paper defines two types of variability measurements of a multiple sequence alignment site. One is based upon a substitution matrix and sequence weights. The other is based upon information entropy. These variabilities are converted into a specific conservation score by the comparison of different two groups in the evolutionary branches including a target protein. Our paper describes whether the conservation score can divide different residues between two sequences with functional differences into the actin or fibronectin binding residues and the others. The result shows that the functional divergence involving the actin and fibronectin binding functions of the EF-Tu/1A molecules highly correlates with the evolutionary branches supposedly dividing their sequences. This implies that an inherent property of amino acids is an essential factor for the functional differences of the actin and fibronectin binding residues. Our paper describes one possible story for identification of key residues involving functional divergences of the human eEF1A1 and eEF1A2 molecules under the conjecture that the property of amino acids is critical. We expect that such quantitative approach is effective for further assessment of functional differences of the EF-Tu/1A subfamilies and helpful for detection of key residues involving functional divergence of protein families.


Introduction
The well-known function of the translation elongation factor Tu (EF-Tu) and eukaryotic translation elongation factor 1A (eEF1A) molecules is a carrier of aminoacyl-tRNAs to the A site of the ribosome in bacteria and eukaryotes, respectively [1]. Then, the EF-1A·GTP·tRNA complex elongates a new polypeptide chain upon the ribosome. Such an elongation event triggers the ribosome to induce GTP hydrolysis. The inactive GDP forms of the eEF1A molecule are recycled to the active GTP form by guanine nucleotide exchange factors including eukaryotic translation elongation factor 1Bα (eEF1Bα) [2]. These functions are inferred as common functions of the EF-Tu/1A family. Some species have several eEF1A encoding genes. There are two variant human eEF1A forms, referred to as eEF1A1 and eEF1A2, shared with ~96% similarity between their protein sequences [3]. Both eEF1A proteins seem to exhibit similar translation activities although they have different binding affinities for the GTP/GDP and eEF1Bα molecules [4,5]. The expression patterns of the two isoforms are exclusive in the human tissues [6,7]. eEF1A2 is highly expressed in skeletal muscle, heart muscle and brain whereas eEF1A1 is lowly expressed in these tissues but highly expressed in other tissues like lung, liver, and placenta. Loss of expression of eEF1A2 in mice has an effect on motor neuron degradation [8,9].
Moonlighting functions have been observed in the eEF1A proteins [10,11]. In around 1990, Dictyostelium discoideum eEF1A was identified as an actin-binding protein [12]. Gross et al. [13] have identified actin binding residues by site-directed mutagenesis studies of the yeast eEF1A molecule. Saccharomyces cerevisiae has two eEF1A encoding genes [14]. Overexpression of these genes resulted in defective budding and enlarged cells [15]. eEF1A2 proteins induce filopodia production in rodent and human cell lines [16]. This indicates an important role for eEF1A2 in controlling actin remodeling and cell motility. eEF1A2 is likely to be an important human oncogene [17]. Overexpression of the eEF1A2 protein causes ovarian cancer probably triggered by inhibition of apoptosis [18]. The protection mechanism exerted by the eEF1A2 protein may correlate with regulation of caspase-3 activation whereas the increase in eEF1A1 protein levels may facilitate rapid death of cells [19]. Elucidation of the complete mechanism how the eEF1A proteins regulate these cellular processes would be a painstaking task because they have numerous moonlighting functions.
Some bacteria seem to capitalize the membrane localized EF-Tu to infect their host cell. Mycoplasma species can be found in many different hosts but individual species have a strict host, organ and tissue specificity [20]. Mycoplasma pneumoniae is an established pathogen of the human respiratory tract [21] but has also been isolated from the urogenital tract [22]. M. genitalium, an emeging sexually transmitted disease pathogen [23], has also been associated with the human respiratory tract [24]. The tissue specific tropisms could be determined by genetic distinctions between them. Balasubramanian et al. [25] have shown that M. pneumoniae EF-Tu (MpEF-Tu) has higher fibronectin binding affinity than M. genitalium EF-Tu (MgEF-Tu). They have identified fibronectin binding residues by site-directed mutagenesis studies of the MpEF-Tu molecule.
Currently, many sequence data are accumulated into protein sequence databases. This makes more attractive to capture functional information from their sequences. There are many scoring methodologies for a multiple sequence alignment site of protein families [26]. Lichtarge et al. [27] have proposed an evolutionary trace method. This method can detect common functional surfaces of a protein family. Landgraf et al. [28] have introduced a specific conservation score based upon the variability consisting of a substitution matrix and sequence weights. The conservation score is effective not to detect common functional sites but to estimate specific sites of a protein subfamily. Mihalek et al. [29] have proposed the variability based upon information entropy. They have also proposed integrating methodology for quan-titative measurements using all evolutionary branches of a protein family or only evolutionary branches including a target protein.
Our paper describes whether the specific conservation score can detect both actin and fibronectin binding residues which have been already identified by wet laboratory experiments of the EF-Tu/1A molecules. We define a novel scoring methodology, which is developed in the process of the verification, for multiple sequence alignment sites. Then, we discuss the inherent property of the actin and fibronectin binding residues and propose one possible story for assessment of key residues involving functional divergences between the human eEF1A1 and eEF1A2 molecules.

Data collection
1,024 entries were taken from UniProtKB/Swiss-Prot release 2013 11 (541,762 entries) [30] by searching for "GTP-binding elongation factor family. EF-Tu/EF-1A subfamily" in the annotation of protein simi-larities. 41 entries were excluded because 39 entries were annotated as a fragment and 2 entries included "X" in their sequences. Consequently, 983 entries were retained as a protein set of the EF-Tu/1A family.

Multiple sequence alignment and phylogenetic tree
Multiple sequence alignment of the protein set was performed by using the MAFFT 7 program [31]. Construction of a distance matrix was performed by the PHYLIP protdist program [32] with the Jones-Taylor-Thornton model [33] as an amino acid substitution model. A phylogenetic tree was written by the unweighted pair group method with arithmetic mean (UPGMA) using the PHYLIP neighbor program. The sequence alignment and phylogenetic tree were visualized by the Discovery Studio 3.1 software package [34] and the MEGA 5.2 software package [35], respectively. The 983 sequences were divided into groups at each node of the phylogenetic tree.

Variability based upon a substitution matrix and sequence weights
This paper employed the variability measure which was proposed by Landgraf et al. [28]. The nth alignment site of a group g of variability , sw g n V was defined as , , , S a a is a substitution score between amino acids , g i a and , g j a and , g i W is a weight of the ith sequence of the group g. The sequence weight was estimated by the Sibbald and Argos methodology [36]. Then, random sequences were created by 100,000 iterations. The Gonnet substitution matrix was employed as a mutation model of amino acids [37].

Variability based upon information entropy
This paper employed another variability measure whose basis was proposed by Mihalek et al. [29]. The nth alignment site of a group g of variability , ie g n V was defined as: 21 , , , g n a f is frequency of an amino acid of the nth alignment site of the group g and N g is the number of sequences in the group g. Here, a gap site is regarded as an extra amino acid.

Conservation score
The nth alignment site of a base group b g and a target group t g of a conservation score C n (b g , t g ) was defined as , , 1 ( , ) 1 1 This paper dealt with the case that the base group includes the target group and employed both variabilities described above.

ROC curve
Construction of a receiver operating characteristic (ROC) curve and calculation of an area under the curve (AUC) were conducted by the pROC R package [38]. A heat map of the AUC was created by the matplotlib Python package [39].

Alignment and grouping of the EF-Tu/1A family
There were 831 sites into the multiple sequence alignment of the 983 EF-Tu/1A sequences. After the phylogenetic tree of the EF-Tu/1A family was written by using all alignment sites, division of each node created groups containing sequences of EF-Tu/1A subfamilies. Figure 1 shows that the human eEF1A1 and eEF1A2 nodes are divided into the 21st node when numbers 1 to 28 is assigned to each node from the root to the human eEF1A1 node. Figure 2 shows that there are 36 different sites between the human eEF1A1 and eEF1A2 sequences. Figure 3A shows that the conservation score set between the 21st and 22nd nodes is the second highest score set. The highest score set is the score set between the 6th and 8th nodes. Figure 3B shows that the conservation score set between the 21st and 22nd nodes is the highest score set. The second highest score set is the score set between the 6th and 8th nodes. Table 1 show that each value between the 21st and 22nd nodes is higher than each value between the 6th and 8th nodes in both variabilities. Figure 4 shows that the MpEF-Tu and MgEF-Tu nodes are divided into the 23rd node when numbers 1 to 24 are assigned to each node from the root to the MpEF-Tu node. Figure 5 shows that there are 13 different sites between the MpEF-Tu and MgEF-Tu sequences. Figure  6A shows that the conservation score set between the 23rd and 24th nodes is the second highest score set. The highest score set is the score set between the 7th and 9th nodes. Figure 6B shows that the AUC value of the conservation score set between the 23rd and 24th nodes is 0.5. The highest score set is the score set between the 6th and 10th nodes. Table 2 shows that each value between the 23rd and 24th nodes of both variabilities is higher than each value between the 7th and 9th nodes and the 6th and 10th nodes.

Conservation scores based upon two types of variabilities
Landgraf et al. [28] have proposed the variability based upon a substitution matrix and sequence weights for quantitative description of a multiple sequence alignment site. The range of the variability is 0.0 to 1.0. The higher the value implies, the more the alignment site consists of hardly substituting amino acids.
Mihalek et al. [29] have proposed the variability based upon information entropy. The range of the variability is 0.0 to ln 20. The higher the value implies, the more the alignment site consists of various letters. In our paper, two changes have been added to this variability. One is that a gap site is regarded as an extra amino acid. The other is that the base of the logarithm is 21. Hereby, the range of this variability is 0.0 to 1.0.
Landgraf et al. [28] have also proposed a specific conservation score using their variability. The score is obtained from two variabilities. One is the variability obtained from all sequences of a multiple sequence alignment site. The other is the variability obtained from a group, which includes a target protein, defined by a node of the phylogenetic tree. The range of the score is -0.5 to 1.0. The higher the score implies, the more the variability of the group relatively decreases than the variability of all sequences. They have discussed influences of grouping by setting to a threshold of the specific conservation score using random sequence clusters and differences of the residues with above the threshold using several groups. However, the conservation score obtained from their grouping method strongly depends upon how to construct a protein sequence set. Our study have employed comprehensive grouping of the evolutionary branches of a target protein. Then, we have analyzed whether the conservation score set can divide different residues between two sequences with functional divergence into the residues involving the functional difference and the others. Hereby, we could determine an evolutionary divergence highly correlating with the functional divergence.

Actin binding residues
Our study explores two EF-Tu/1A subfamilies, which make both specific conservation scores of 2 actin binding residues relative higher values than other 34 different residues between the human eEF1A1 and eEF1A2 molecules, in the evolutionary branches including the human eEF1A1 node. Conservation score sets of both variabilities with the first and second highest AUCs have included the score sets between the 21st and 22nd nodes of Figure 1. This implies that the functional divergence involving actin binding between the human eEF1A1 and eEF1A2 molecules highly correlates with the node dividing the human eEF1A1 and eEF1A2 sequences. Figure 2 shows that there are 29 sites consisting of unique letters in the sequences of the 22nd node of Figure  1. This implies that the value of the conservation score depends upon the substituting score of amino acids. This suggests that the property of amino acids is an essential factor for the functional difference of the actin binding function.
Conservation score sets of both variablities with the first and second highest AUCs have also included the score sets between 6th and 8th nodes. Comparisons of the conservation score sets of 6th and 8th nodes and 21st and 22nd nodes show that the former values are considerable lower than the latter values. This shows that the conservation score sets between 6th and 8th nodes are interesting score sets although the high correlations might happen to obtain. Table 1 shows that the first and second highest values of 36 conservation scores based upon information entropy are 0.25547 and 0.19407, respectively. The human eEF1A1 and eEF1A2 residues with these conservation scores include the 2 actin binding residues. The values of other 27 residues are also 0.19407. This shows that we cannot describe differences of the 27 residues by using the conservation scores based upon information entropy.

Key residues involving functional divergence
Conversely, there are various values in the conservation scores based upon a substitution matrix and sequence weights. We propose one possible story for assessment of functional differences of the human eEF1A1 and eEF1A2 molecules under the conjecture that the inherent property of amino acids is critical for their divergence. Here, the higher conservation score implies, the more the property of amino   Table 1: Four conservation score sets with the first and second highest AUC in 378 (= 28 C 2 ) conservation score sets of both variabilities. The "X" letters denote the human eEF1A1 and eEF1A2 residues which are equivalent to the N329 and K333 residues involving actin binding of the yeast eEF1A molecule [13]. The threshold shows the arithmetic mean between two conservation scores whose sum of the sensitivity and specificity becomes maximum when the threshold is used as a cutoff value of the conservation score set. acids of the site is changed. Soares et al. [40] have created structure models of the human eEF1A1 and eEF1A2 proteins by homology modeling using the yeast eEF1A as a template. They have compared interaction surfaces and different residues between the human eEF1A1 and eEF1A2 proteins based upon location of the three-dimensional structures. Table 3 shows hypothetical functional residues described by Soares et al. [40]. The N197H, A326C, and F393S residues have the highest values in the hypothetical residues involving guanosine binding, actin binding, and phosphorylation, respectively. This shows that quantitative measurement by the conservation score allows us to determine key residues involving functional divergences of protein subfamilies. However, since the story is only a possibility, further assessment is necessary for identification of the key residues beyond the actin binding function.

Fibronectin binding residues
Our study explores two EF-Tu/1A subfamilies, which make specific conservation scores of 6 fibronectin binding residues relative higher values than other 7 different residues between the MpEF-Tu and MgEF-Tu molecules, in the evolutionary branches including the MpEF-Tu node. The conservation score set with the second highest AUC is the score set between the node dividing the MpEF-Tu and MgEF-Tu sequences and the MpEF-Tu node in the variability based upon a substitution matrix and sequence weights. This implies that the functional divergence involving fibronectin binding between the MpEF-Tu and MgEF-Tu molecules highly correlates with the node dividing the MpEF-Tu and MgEF-Tu sequences. The value of the conservation score depends upon the substituting scores of amino acids because the variability of a leaf node is 0. This suggests that the property of amino acids is an essential factor for the functional difference of the fibronectin binding function.
Balasubramanian et al. [25] have created the structure model of the MpEF-Tu protein by homology modeling using the Thermus thermophilus EF-Tu as a template. The fibronectin binding residues, Figure 3: Two heat maps of areas under the ROC curve of 378 (= 28 C2) conservation score sets using variabilities based upon (A) a substitution matrix and sequence weights and (B) information entropy. The AUC of the conservation score set of 36 different residues between the human eEF1A1 and eEF1A2 sequences was calculated by using the N331S and M335Q residues, which are equivalent to the N329 and K333 residues involving actin binding of the yeast eEF1A molecule [13], as learning data. The positive direction of the learning data defined as ascending order of the 13 conservation scores. which are divided into binding regions 1 and 2, are surface accessible residues in the three-dimensional structure. Our result shows that these surface exposed residues have very high correlation with the conservation score sets because Table 2 shows the large AUC values.
The conservation score sets with the highest AUCs are the  The AUC of the conservation score set of 13 different residues between the MpEF-Tu and MgEF-Tu sequences was calculated by using the M193, N194, E204, S343, P345, and T357 residues involving fibronectin binding of the MpEF-Tu molecule [25] as learning data. The positive direction of the learning data defined as ascending order of the 13 conservation scores. score sets between 7th and 9th nodes and 6th and 10th nodes in the variabilities based upon a substitution matrix and sequence weights and information entropy, respectively. Comparisons of the conservation score sets of these nodes and 23rd and 24th nodes show that the former values are considerable lower than the latter values. This shows that the Residues C n sw (7, 9) C n sw (23, 24) C n ie (6, 10) C n ie ( Table 2: Two conservation score sets with the first and second highest AUC of the variability based upon a substitution matrix and sequence weights and two conservation score sets with the highest AUC and between the 23rd and 24th nodes of the variability based upon information entropy in 276 (= 24 C 2 ) conservation score sets, respectively. The "X" letter denotes the MpEF-Tu residue involving fibronectin binding [25]. The threshold shows the arithmetic mean between two conservation scores whose sum of the sensitivity and specificity becomes maximum when the threshold is used as a cutoff value of the conservation score set.
conservation score sets between former nodes are interesting score sets although the high correlations might occur by chance.

Conclusion
Our paper has described quantitative measurements for specific conservation of protein subfamilies. We have developed an exhaustive grouping approach, which is subject to all evolutionary branches involving the target nodes, to calculate the conservation scores. This approach has shown that the functional divergences involving actin and fibronectin binding of the EF-Tu/1A molecules highly correlate with the evolutionary branches supposedly dividing their sequences. Such quantitative descriptions, which are based upon a substitution matrix and sequence weights and information entropy, could be effective for objective evaluation of amino acid residues. We expect that such scoring methodology for multiple alignment sites prompts identification of key residues involving functional divergence of protein subfamilies or subtypes.  Table 3: Hypothetical residues estimated from the human eEF1A1 and eEF1A2 structure models and potential phosphorylatable residues [40].