Protein Functional Site Prediction Using a Conservative Grade and a Proximate Grade

So far, in order to predict important sites of a protein, many computational methods have been developed. In the era of big-data, it is required for improvements and sophistication of existing methods by integrating sequence data in the structural data. In this paper, we aim at two things: improving sequence-based methods and developing a new method using both sequence and structural data. Therefore, we developed an originally modified evolutionary trace method, in which we defined conservative grades calculated from a given multiple sequence alignment and a proximate grade in order to evaluate predicted active sites from a viewpoint of protein-ion, protein-ligand, protein-nucleic acid, proteinprotein interaction by use of three-dimensional structures. In other words, the proximate grade also can evaluate an amino acid residue. When we applied our method to translation elongation factor Tu/1A proteins, it showed that the conservative grades are evaluated accurately by the proximate grade. Consequently, our idea indicated two advantages. One is that we can take into account various cocrystal structures for evaluation. Another one is that, by calculating the fitness between the given conservative grade and the proximate grade, we can select the best conservative grade. Journal of Data Mining in Genomics & Proteomics J o u r n a l of D ata Mi ning in Gmics & rot e o m i c s ISSN: 2153-0602 Citation: Kondo Y, Miyazaki S (2015) Protein Functional Site Prediction Using a Conservative Grade and a Proximate Grade. J Data Mining Genomics Proteomics 6: 175. doi:10.4172/2153-0602.1000175


Introduction
When a protein works, a specific site to bind an ion or a molecule may exist. Identification of binding sites is important to investigate how the protein works and binds ions or molecules. In order to identify such an important site, it is necessary to prepare a mutant type of the protein, whose amino acid residue is mutated into another one, and then a difference of binding affinity between the mutant type and the wild type is investigated. However, mutating amino acid residues one by one takes an amount of time and costs. Therefore, it is effective for developing a method to narrow down the amino acid residues.
For electing the candidate sites, there are many computational methods, which are based on (i) sequence, (ii) structure and (iii) sequence and structure [1][2][3][4][5][6]. Sequence-based methods usually assume that such an important site is conservative against mutation and therefore important sites and others should have been mutated in different patterns. In order to detect such patterns, various methods have been developed [7]. One of the sequence-based methods is a method based on Shannon entropy (SE) [8,9]. However, the SEbased method may have three problems. The first one is that the SEbased method, in which twenty standard amino acids are regarded as characters, does not consider properties of amino acids. Therefore, a method based on SE of residue properties [10] or a sum of pairs [11] was proposed. The second one is that the SE-based method does not consider a background distribution of amino acids. Therefore, other information-theoretical method such as relative entropy [12] or Jensen Shannon divergence [13] was proposed. The third one is that the SEbased method, in which a rate of an amino acid is calculated, cannot take into account which amino acid is included in a sequence. Therefore, some methods based on windowing [13], weighting [14] or phylogenetic analysis was proposed. One of the methods based on a phylogenetic tree is an evolutionary trace (ET) method [15], which has been extended as weighted ET (WET) [16], integer-valued ET (iv-ET) and real-valued ET (rv-ET) methods [17]. Additionally, other methods based on phylogenetic trees are ConSurf [18] and Rate4Site [19,20] algorithms.
Although a variety of sequence-based methods have been already compared each other [13,21], what difference makes a difference is difficult to understand because such methods do not be explained by an idea. Therefore, we consider a map, a mathematical formula, on a multiple sequence alignment (MSA) and aim at constructing an exhaustive method. As part of this effort, we propose a method currently including some existing methods such as the method based on SE or SE of residue properties, the method based on a sum of pairs with/without weighting and the iv-ET or the rv-ET method.
Even if a variety of methods are executable, how are the methods evaluable? There may exist two approaches: confirmation by sitedirected mutagenesis and visualization onto a three-dimensional structure. The former is more consistent with identification of binding sites because the latter is verifiable that a site is proximate from ions or molecules. In spite of that, the latter has been still used because of indefinability of protein functional sites. Therefore, on the basis of benchmark sets such as catalytic sites, ligand-binding sites or proteinprotein interfaces [13], the predictive ability has been evaluated. However, the latter is immature because of usually conducting only a structure [15,22]. This mainly causes two problems. The first one is that the latter neglects a protein which binds various ions or molecules because an entry in the Protein Data Bank (PDB) [23] does not always include all states of the protein structure. The second one is that the latter cannot take account of proteins which are derived from an ancestor. Therefore, protein structures derived from different organisms are incomparable with each other. To solve these problems, we consider another map, which measures proximity of amino acid residues and ions or molecules, and then two maps are integrated. 1 , Where τ is a threshold of ( ) : Where max , min ( , ) ( ) where l is a weight of sequence l .

Let
 denote a set of real numbers and there be 3 3 : where 2 . is an Euclidean norm.
Let us consider structure k, which contains a protein and ions or molecules. Let

Data collection
In UniProtKB/Swiss-Prot release 2015_01 [24], entries which are annotated as 'Classic translation factor GTPase family. EF-Tu/EF-1A subfamily', do not include 'X' in the sequence and are not a fragment were 984 entries. In the PDB, entries which are referenced from above 984 entries and are determined by X-ray crystallography were 68 entries. 14 entries were excluded because of binding an immunoprotein [25] and forming a chimeric protein [26][27][28][29]. Consequently, as shown in Table 1, 54 entries including 103 chains were retained.

Computations of f 1 and f 2
As N=984 and K=103 in Figure 2, the sequences were aligned by the MAFFT 7 program [30]. 477 i M were extracted because of including residues which have coordinate data.
A difference between two sequences was computed by the maximum likelihood method [31] using the Jones-Taylor-Thornton model [32] as a substitution matrix and the Dayhoff method [33] for computing equilibrium frequencies. From all combinations of the differences, a phylogenetic tree was written by the unweighted pair group method with arithmetic mean [34] in each entry, representative ions or molecules were shown in Table 1. However, because of uncertain functions, we excluded the following ions or molecules; sodium ion, acetate ion, sulfate ion, ammonium ion, sugar (sucrose), di(hydroxyethyl)ether, glyoxylic acid, 5-bromofuran-2-carboxylic acid, β-mercaptoethanol and water [37][38][39][40][41][42][43].

Correlations between f 1 and f 2
Let ( ) Let denote a cutoff of ( ) a true positive rate and an area under the curve ( ) ( ) Elongation factor 1Bα Elongation factor-1 β Elongation factor 1-β, 5GP Elongation factor 1-β, GD P Elongation factor-1 β, GD P Elongation factor 1-β, GD P where N 1 and N 2 are numbers of tied ranks in F 1 and F 2 , respectively.

Visualization
( ) f M , AUC and Spearman's ρ were visualized by the matplotlib Python package [45]. A three-dimensional structure was visualized by the VMD program [46].

Fitness between f 1 and f 2
If x g , x h ,τ and A are same but i G is different, Table 2 shows that In the latter case, Figure 3 shows that when the time point increases, the AUC or the Spearman's ρ tends to increase. Figure 4A shows that f M using a receiver operating characteristic (ROC) curve [47] in Figure 4B. Figures 4C and 4D show that the left sides tend to have small ( ) G G, the method is the method based on SE of residue properties [10]. If T=1 is changed to T=N in the former and the latter, Figure 3 shows that the AUC is from 0.5779 to 0.6147 and the Spearman's € ρ is from 0.0757 to 0.1241 and the AUC is from 0.5709 to 0.5992 and the Spearman's ρ is from 0.1152 to 0.1405, respectively. Therefore, in the former and the latter, distinguishing characters utilizing the phylogenetic tree is effective for improving the AUC and the Spearman's ρ.  [11]. If T=1 is changed to T=N in the former and the latter, Figure 3 shows that the AUC is from 0.6083 to 0.6276 and the Spearman's ρ is from 0.1982 to 0.1653 and the AUC is from 0.6093 to 0.6211 and the Spearman's ρ is from 0.2263 to 0.1502, respectively. Therefore, in the former and the latter, distinguishing characters utilizing the phylogenetic tree is effective for improving the AUC but not for the Spearman's . However, in the above case, if

Evaluation of predicted functional amino acid residues by f 2
in the former and the latter, Figure 3 shows that the AUC is from 0.6941 to 0.7349 and the Spearman's ρ is from 0.4981 to 0.5650 and the AUC is from 0.6846 to 0.7335 and the Spearman's ρ is from 0.4749 to 0.5637, respectively. Therefore, in the former and the latter, distinguishing characters utilizing the phylogenetic tree and considering that each gap is different are effective for improving the AUC and the Spearman'sρ.
the method is the iv-ET method [17]. If T=N, , the method is equivalent to the rv-ET method [17]. If in the former and the latter, Table 2 shows that the AUC is from 0.5896 to 0.6242 and the Spearman's ρ is from 0.1221 to 0.3650 and the AUC is from 0.6180 to 0.7417 and the Spearman's ρ is from 0.1308 to 0.5722, respectively. Therefore, in the former and the