Department of Biological Sciences,Birla, Institute of Technology & Science, Pilani, India
Received date: April 13, 2015; Accepted date: May 09, 2015; Published date: May 11, 2015
Citation: Runthala A (2015) Non-Linear and Misleading Template Scoring Criteria: Root Cause of Protein Modelling Inaccuracies. Curr Synthetic Sys Biol 3:121. doi:10.4172/2332-0737.1000121
Copyright: © 2015 Runthala A. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Current Synthetic and Systems Biology
Template based protein modelling is currently the most accurate as well as trustworthy method for predicting the correct protein conformations to bridge the constantly increasing gap between the number of experimentally solved protein structures and the count of protein sequences. Our best knowledge based prediction algorithms employing the templates are not highly proficient of consistently selecting the best scoring template(s) to construct a highly accurate protein model. Mutually contrary nature of generic and currently employed template assessment and selection scores further makes this essential modelling step a very tricky and fluky business. Precisely, the article briefly investigates and justifies the impact of fundamentally allowed degree of freedom of a template selection measure on the accuracy of constructed protein models. Several logical guidelines, normally overlooked in a protein modelling task, are analyzed and should be routinely considered. A more reliable and robust scoring measure is thus mandatorily required to select the best possible available template for constructing the most accurate target conformation.
Protein; CASP; Modelling; Template; Assessment
Functional study of proteins is based on the highly accurate knowledge of their structural details. Structure determination methodologies, aimed at constructing accurate conformations, face several technical and monetary limitations. Protein modelling algorithms hereby come for the rescue to quickly predict highly accurate structures . Modelling accuracy of a protein sequence depends on the degree of near-native proximity of a predicted model . A highly accurate Template Based Modelling (TBM) algorithm  employs the structural information of solved protein structures (templates), available in the Protein Data Bank (PDB), to maximally span the target (Considered protein sequence for modelling) . Gapped or unaligned segments in such target-template alignments are possibly the results of insertions, deletions (INDELs) primarily caused due to evolutionary pressure and are modelled through a couple of means. Such segments are normally modelled through the physical principle of protein folding for building the lowest energy confirmation. Distantly related or dissimilar templates are also employed for modelling the target chunk, not spanned by the selected templates, to construct an overall model . Algorithms employing the correct as well as biologically significant templates have been proven to construct fairly accurate models. Major steps of a generic TBM algorithm include several steps, amongst which template identification and selection step is of paramount importance. The selected template(s) are then aligned with the target sequence to construct a model  and such predictions are normally employed for several cellular applications . tools also employ PDB culling [8- 19] at a sequence identity threshold and that is actually the mutual comparison of templates, which may yield a template as high-scoring hit, although it is structurally too distant from the structural topology of the target sequence.
To solve most of the modelling errors caused due to an incorrect structure, not functionally or biologically related to the target, a reliable set of template(s) is thus normally selected through the following scoring measures. However, they are usually antagonistic and they do not unanimously select the best template as the top ranked hit consistently. The degree of multi-dimensional scoring schemes forces us to follow the best possible scoring scheme, which is mostly the consideration of the E-value scores. Therefore, false positive and spurious templates with significant homoplasic sequence similarity to target sequence are not eminently, reliably distinguished and filtered out from the correct set of actually relevant templates. This concept is well illustrated in Tables 1 and 2 which enlist the modelling accuracy of randomly chosen CASP8 targets T0423 and T0428 through several significant templates searched by HHPred. Targets T0423 and T0428 encode a sequence length of 110 and 267 residues respectively. The near-native accuracy of these target models was assessed, as per the structural domain information employed by CASP, respectively through 97 (2-98) and 229 (20-248) residue lengths.
|Template||Resolution||Length||Sequence Identity||Average gap length||BLOSUM score||Mismatch residues||Coverage span||TM_Score||GDT-TS||RMSD|
Table 1: Inconsistent template selection scoring results, showing their non-linear relationship with the most credible GDT-TS and TM_Score measure, for the CASP8 target T0423 encoding 110 residues.
|Template||Resolution||Length||Sequence Identity||Average gap length||BLOSUM score||Mismatch residues||Coverage Span||TM_Score||GDT-TS||RMSD|
Table 2: Inconsistent template selection scoring results, showing their non-linear relationship with the most credible GDT-TS and TM_Score measure, for the CASP8 target T0428 encoding 267 residues.
¨Target-template length difference
An ideal template is expected to encode the same number of residues as the target sequence, as expected. It works fairly well against a single domain template employed for a short length target sequence. However, it is usually quite hard to see such case due to domain insertion, duplication or deletion in the templates and so a stretch of a structural domain of a template may sometimes provide the best structural information for a target sequence. Reliability of a hit to be the actually best template for a target sequence thus becomes a dilemma.
Substitution matrices, scoring and justifying feasibility of the aligned residue substitutions in a target-template alignment, are considered credible. However, it mostly becomes an erroneous case when a template residue say Glycine (G) come up in mutation over the earlier residue Alanine (A) and on alignment with the target residue G at that locus, it would become the false positive sequence identity. Such homoplasy, along with discrete behavior of context specific localization of residues at specific sites in a protein, makes the reliability of a selected template completely questionable. Therefore, such algorithms have been improved several times to consider the structural information along with the sequence context information [20-23]. It is well understood that sequence similarity is a good score to select reliable templates, however it is not always correct and the most similar hit is not always the best one. Still, it has been consistently employed in most of the modelling algorithms and we are far from utilizing it as the best scoring measure for a target sequence.
Differences in the proportions of encoded amino acids between target and a template sequence form the basis of this scoring scheme. Lower is the difference or more closer is the observed residue composition proportions in the target and template sequences irrespective of their alignment, higher is the affirmed reliability of the template being phylogenetically closer and credible for the target. CASP8 algorithms including CADCMLAB  and COMA , CASP9 algorithms including DCLAB  and FLYPRED  and CASP10 algorithms including samcha-server  and TSAILAB  employed such a scoring measure to rank the templates against the considered target sequence .
This measure, though being considered as a highly reliable scoring criterion, is also impaired by homoplasy. Structural and functional similarity of templates is thus normally used to select the best template amongst the set of redundant hits for a target sequence. This structural similarity is also normally employed for phylogenetic study of the protein structures . However, when several hits share an almost equal sequence identity score with the target, their credibility seems to be doubtful and it may become difficult to select the most similar one. It was well realized by Jones-UCL in CASP9, and here the template culling step, to keep the one most significant template amongst similar structures for a target, further complicates the process. Such a template culling step further poses a new challenge. It is because template selection step is employed to screen them against the target and their mutual comparison may exclude the actually closest and reliable hit. This scoring scheme was used by several groups including CaspIta  and COMA  in CASP8, ATOME2_CBS  and FAMSD  in CASP9, and ATOME2_CBS  and CASPita  in CASP10.
Coverage span, seemingly a reliable measure, is dependent on alignment constraints and the employed scoring scheme. A shorter and a longer sequence even if aligned together, for an almost complete coverage span, make the corresponding template selection lucrative. However, comparative analysis of the target against another template with same coverage span and a higher sequence identity makes the later hit a favorable choice. However in another case, if an actually correct template is evolutionarily closer to the target and has lesser coverage span, its selection again becomes a questionable dilemma for the person. This measure has been used by several algorithms including BAKER-GINZU and PRO-SP3-TASSER  in CASP8, Firestar  and GSmetadisorder  in CASP9 and CASPita  and HHPred  in CASP10.
Alignment score, e-value
These scores depend on the quality of alignment. An Alignment score and an E-value, the chance to expect the same template in the PDB database, are good scoring factors to correctly discriminate between highly correct and worse templates. Through the target-template HMM profile comparison, the alignment score of a template hit is also computed. However, these measures also fail to precisely select the best template from the pretty similar set of almost equally scoring hits and so there is still a need to develop other significant assessment measures for a target to model a highly accurate protein conformation . In CASP8, FAMSD, FAMSSEC, sbtJ, Yuan-Chen-Kihara and ZHOUSPARKS- X majorly relied on this scoring scheme. This scoring scheme has been used by many algorithms including Distill  in CASP8, Distill  and TASSER  in CASP9 and TASSER  and Distill  in CASP10.
For a protein, it scores the experimental quality of data obtained from the crystal. On the basis of structurally similar topology of proteins, a crystal results in a diffraction pattern and a perfect highly ordered structure shows a resolution score of 1Å. It is normally believed that high resolution structures solved by X-Ray experimental methodology are the perfect ones. However, it is not always applicable for selecting the best set of templates. A template protein, even with a lower resolution, may still provide the structural information for several target residues, not spanned by the already selected templates, and may thus be fairly reliable conformation.
These measures, enlisted in the Tables 1 and 2 ordered as per TM_Score  accuracy, are quite heterogeneous and their reliability varies a lot. Errors due to the selection of seemingly reliable but actually evolutionarily distant hit should thus be tackled properly, as recently tried by several modelling algorithms . The servers trying to solve this issue through computation of multiple sequence profiles are highly laborious, time-consuming and still cannot predict highly accurate models consistently [41-43].
The template search and selection step of a routinely employed protein modelling algorithm should be properly screened. Several knowledge based expert guidelines, as enlisted and explained in logistically correct order below, should thus be routinely considered to select the best template(s) for a target sequence.
Maximum informative MSA profile
Iterative template search rounds are often employed by several algorithms including HHPred  to search the significantly relevant set of templates for a target sequence. Such a maximum allowed iteration parameter, although computationally expensive, fairly correlates and considers even the distantly related hits for a target sequence and should thus be normally employed. It reasonably evaluates the evolutionarily consensus probability of residue substitutions across the target sequence in the screened list of hits to prioritize and accurately rank the scoring of correctly related templates.
A hit with a considerably low E-value score is normally considered as a good template for a target sequence. This concept quite reasonably selects the best hit with the lowest E-value score for a target sequence. However, the same relationship
can never be extrapolated to other meaningful templates as a hit with a very bad E-value score might still be sequentially divergent as well as structurally and functionally relevant one for a target sequence. Hence, solely discarding templates for their lower E-value scores is not normally advised.
Score secondary structure of target
A target sequence might share too much sequence divergence with the selected functionally similar templates and still be excellent structural resource. Hence as per the constructed reasonably correct alignment, similarity of predicted secondary structure of target sequence chunks and the template segments provides reliable homology information. This constraint also assists the construction of a reasonably accurate target-template alignment.
Local as well as global alignment consideration
Protein sequence information is significantly lost through inaccurate localization of gaps especially while constructing an optimally scoring and biologically meaningful alignment. Hence, gaps must be carefully crosschecked in the target-template alignment. Similarly, longer gap segments more than 5 gaps should be avoided, especially if they not at the periphery, as ab-initio modelling of such chunks might disturb the orientation and topology of adjoining residues especially if they encode a secondary structure element. Therefore, an alignment optimally placing the residues, both in terms of their local and global functional significance, should be employed for a logistically correct modelling of the target sequence. The best possible biologically meaningful alignment might not always be the one with mathematically best score and rather it could be a sub-optimal one with a comparatively inferior score.
Functional significance of templates
A target sequence normally encodes atleast one functional domain, although it might be sequentially and structurally continuous or discontinuous. Hence, a target sequence must be screened for the plausible availability and localization of such domains through several databases including PFAM and CDD and then the functionally similar Homologues and Orthologoues must be considered as reliable hits through other scoring measures. Such a consideration of structurally and functionally significant sequence information normally involves the exquisite evolutionarily reliable sequence information of templates the best possible way and thus assists us to predict highly accurate nearnative protein models for both the conserved local structural segments and the complete model altogether.
Employing all culled PDBs
Consider all the culled PDB structures along with the selected functionally similar and reliable representative hits for selecting the best possible set of templates for a target sequence. The culled PDB might actually be evolutionarily and structurally closest to the target sequence and hence the complete set of related PDB structures must be considered for selecting the best set of templates.
Fixing the best set of templates
Best possible set of mutually and structurally complementary templates is essential to model a highly accurate protein structure. The discussed scoring measures and template selection or consideration constraints must thus be carefully employed, through correctly computed pairwise and multiple sequence alignments, to fix up the best possible templates for maximally spanning the target sequence. The best hit, scored significantly with majority of the aforementioned measures, must be thus employed to seed the construction of a highly accurate MSA. This MSA should then be employed for screening the hits to maximally span the target.
The template search and selection criterion, being the major armature to ultimately build the highly reliable models, needs a well developed template ranking system. Selecting the reliable templates is thus the supreme prerequisite to construct highly accurate protein models. CASP Server models are therefore usually pretty poor topology predictions and are not highly accurate compared to the well justified human models. A robust template selection algorithm, encompassing the best of these scoring measures, is thus required to significantly distinguish the actually relevant templates from the spurious hits and thus solve modelling errors caused due to consideration of incorrect template(s).
Most of the template search and selection criteria seem to be parallel or mutually convergent with consideration of their benchmarked weights. A robust algorithm with optimally weighed consideration of most of these measures is thus required to reliably rank the credibility of a template. However, it is obvious that any such template ranking algorithm will fail completely when an assigned weight results in a false positive ranking of templates. Weighting increases the credibility of a selection measure much more than others and the marginal change of the weighted factor significantly suppresses the noteworthy weights of other template scoring measures. Template scoring results of the significant hits searched by HHPred  for the CASP8 targets T0423 and T0428, enlisted in the Tables 1 and 2, clearly prove this discussed problem. All the aforementioned template selection measures are enlisted in these tables and their scores are not always parallel to the TM_Score , computed against the actual native conformation. A template solely selected on the basis of a single scoring measure may not be the best structural hit always and so a more reliable template scoring measure, statistically too robust, is mandatorily required to definitely pave our way for developing a consistently successful modelling algorithm.