alexa Genome-Scale Approach and the Performance of Phylogenetic Methods
ISSN: 2329-9002
Journal of Phylogenetics & Evolutionary Biology
Make the best use of Scientific Research and information from our 700+ peer reviewed, Open Access Journals that operates with the help of 50,000+ Editorial Board Members and esteemed reviewers and 1000+ Scientific associations in Medical, Clinical, Pharmaceutical, Engineering, Technology and Management Fields.
  • Research Article   
  • J Phylogen Evolution Biol 2013, Vol 1(3): 116
  • DOI: 10.4172/2329-9002.1000116

Genome-Scale Approach and the Performance of Phylogenetic Methods

Anup Som*
Center of Bioinformatics, Institute of Interdisciplinary Studies, University of Allahabad, Allahabad-211002, India
*Corresponding Author: Anup Som, Center of Bioinformatics, Institute of Interdisciplinary Studies, University of Allahabad, Allahabad-211002, India, Tel: +91 532 2460027, Fax: +91 532 2461009, Email: [email protected]

Received Date: Apr 11, 2013 / Accepted Date: Jul 27, 2013 / Published Date: Jul 29, 2013

Abstract

The use of genome-scale approach in phylogenetic analysis is imperative in order to resolve evolutionary relationships over large taxon sets and deep phylogenetic divergences. But yet it is not clear what are the strengths and weaknesses of the various phylogenetic methods or which one should be preferred under genome-scale approach. In this article, the performance of five major phylogenetic methods is evaluated under genome-scale approach using biologically realistic simulated data. The following phylogenetic methods are considered; Bayesian, maximum likelihood (ML), neighbor joining (NJ), NJ maximum composite likelihood (NJ-MCL), and maximum parsimony (MP). Simulation results show that probabilistic methods (i.e., Bayesian and ML methods) are much more accurate than the NJ-MCL, MP and NJ methods. Concerning the consistency of methods, ML is consistent than other methods. This analysis shows that the NJ-MCL, MP, and NJ methods are fast (i.e., computationally efficient), but their accuracy and consistency are very poor compared to Bayesian and ML methods. On the other hand, the Bayesian method is an accurate one, but less consistent than the ML method, and it takes much longer execution time. Therefore, based on the accuracy, consistency and computational efficiency the ML method is the preferred algorithm under genome-scale approach. In addition to the methods performance, this study has investigated several important aspects of genomescale phylogeny; such as how concatenations of longest and smallest genes make effect on the method’s performance, how much datasets are needed to recover the true tree (i.e. true evolutionary history of a group of species or genes), and whether more genes or more characters are important. These are explained in the result section.

Keywords: Genome-scale approach; Phylogenetic methods; Computer simulation; Topological distance; Accuracy; Consistency

Introduction

Reconstruction of evolutionary history from multiple genes is routinely conducted using genome-scale approach where individual gene sequences are concatenated head-to-tail to form a super gene alignment. Improved accuracy of phylogenetic inference through the concatenation of multiple sequences from the same taxon is expected on theoretical grounds [1,2] and has been found in many studies [3-9]. The use of genome-scale approach in phylogenetic analysis is widely applied in order to resolve evolutionary relationships over large taxon sets and deep phylogenetic divergences with greater resolution [10,11]. It has been proposed that a well-resolved Tree of life can be achieved through concatenation of genes [12]. Judging by recent phylogenetic analyses using concatenated genes, the tendency is to combine data by default, in the hope that weight of corroborative evidence will resolve any kind of conflicts [4-6,10]. However, multigene datasets suffer from systematic errors such as within-site-rate variation [13] and long-branch attraction artifacts [14], and statistical methods for extracting information from such data remains limited [15-18]. In this case, analysis of individual partitions (phylogenies are inferred separately for each data set and a consensus tree determined from these separate trees), in addition to combined analysis, is also necessary [19].

A number of studies have been conducted to investigate what the strengths and weaknesses of each method are, or which should be preferred in given situation [20-25]. But, despite of the extensive use of the genome-scale approach, studies comparing the performance of phylogenetic methods under a genome-scale approach are lacking. Advances in both computer and algorithm speed have allowed us to simulate and analyze several thousand data sets, and provided a thorough look at the performance of the various methods.

The study reported here has several purposes: first, to evaluate the performance of Bayesian, ML, NJ-MCL, MP, and NJ methods under the genome-scale approach and to find out the most accurate method; second, to examine the effect of addition of a single gene to an existing concatenation on the methods performance; third, to investigate how many datasets are needed to recover the true tree (i.e. model tree); fourth, to check whether the number of genes or the number of characters is more important.

Here performance of the methods is measured based on three criteria; accuracy1, consistency2 and computational efficiency [26,27], and the problem is studied using biologically realistic simulated DNA datasets.

Materials and Methods

Phylogeny reconstruction methods

The performances of five phylogenetic tree reconstruction methods were examined in a genome-scale approach; Bayesian, ML, NJ-MCL, MP, and NJ. Beside NJ-MCL, the other four methods are very well known and therefore do not need any further introduction about them [28]. NJ-MCL method in brief; a new method has been developed which is a balance algorithm based on maximum likelihood (ML) and neighbor joining (NJ) algorithms. Algorithm of NJ-MCL method is based on the simultaneous estimation of all pairwise distances by maximizing a likelihood function and then NJ method is used to infer phylogeny [29]. The method of simultaneous estimation of pairwise distances called maximum composite likelihood (MCL) method. The NJ tree reconstruction using MCL method of distance estimation is referred as NJ-MCL method [30].

To search for MP trees, the Subtree-Pruning-Regrafting (SPR) search algorithm was used. In an extensive computer simulation, Takahashi and Nei [31] showed that the SPR search algorithm as efficient as the extensive search algorithms such as max-mini branchand- bound search, min-mini heuristic search, and close-neighborinterchange heuristic search. However, only the branch-and-bound search is guaranteed to find all the MP trees, but it takes prohibitive amount of time if the number of sequences is large (>15) [32].

Computer simulations to generate datasets

Figure 1 shows the model tree topology selected for computer simulations. This phylogeny is based on an independent analysis of 16,397 aligned nucleotide positions which included 19 nuclear and three mitochondrial genes (a total of 20 individual genes; three mtgenes making one alignment) for 42 placental mammals [33]. To conduct a biologically realistic simulation, gene-specific evolutionary parameters were estimated from each gene sequence of 42 mammals using the TN93 model [34] of sequence evolution, allowing for a gamma-distribution of rates. The TN93 model was selected the bestfit model of nucleotide substitution using the MODELTEST program [35,36]. The evolutionary parameters (i.e. branch length, transition/ transversion ratio and gamma parameter) of each gene dataset were calculated using PhyML [37,38]. Gene sequences were simulated using model topology (Figure 1) along with gene-specific evolutionary parameters which are extracted from real data. For each set of genespecific evolutionary parameters 100 replicate datasets were generated using the Dawg program [39] under the TN93 model of nucleotide substitution with a gamma distribution of rates.

phylogenetics-evolutionary-biology-model-topology

Figure 1: The model topology used in the computer simulations based on the 42-taxa tree from Springer et al. [33].

Phylogenetic analysis

Five methods of tree reconstruction were used in this study. Phylogenetic analyses were carried out using MEGA5 [30] for NJ-MCL, MP, and NJ methods, PhyML [38] for the ML method, and BAMBE [40] for the Bayesian method. The TN93 model of sequence evolution with gamma rate heterogeneity was used for Bayesian, ML and NJ methods. The MP trees were reconstructed using the (SPR) search algorithm [32]. The algorithm of NJ-MCL method was developed based on the TN93 model of sequence evolution. Therefore, for NJ-MCL method the TN93 model of sequence evolution (the default choice) with gamma rate heterogeneity was used. For a given simulated gene (whether containing an alignment of one simulated gene or the concatenation of multiple genes), Bayesian, ML, NJ-MCL, MP, and NJ trees were reconstructed and in each case the topological distances between the reconstructed trees and the model tree were estimated.

Accuracy of the inferred phylogeny

The accuracy of each method was calculated by the percentage of clades reconstructed correctly (PC). This was obtained by PC = 100[1-dT/ (2m-6)], where dT is the topological distance between the reconstructed and model trees and m is the number of sequences in the phylogeny [41,42]. All comparisons were made between the reconstructed trees and the model tree (Figure 1). For example, for a given simulated dataset, the Bayesian tree was reconstructed and then dT was estimated between the reconstructed Bayesian tree and the model tree, and finally PC was calculated from dT value. A similar analysis is done for all five methods and also for each multigene dataset.

Construction of multigene datasets

For comparing the performance of the five different phylogenetic methods, 100 simulated datasets for each of 20 genes were generated. In construction of multigene datasets, one replicate should be selected out of the 100 replicates for each gene. To keep the replicate selection unbiased and realistic, the distribution pattern of dT for 100 replicates of each simulated gene were plotted, and it was found that topological distances among the trees based on the replicates varied widely (Figure 2). Considering the nature of topological-distance distribution, the multigene datasets were reconstructed in three ways. These are (i) the best-replicate (BS) scenario where the simulated replicate for a given gene was selected that produced a phylogeny with the highest PC (i.e., with the lowest topological difference when compared to the model tree); (ii) the worst-replicate scenario (WS) where the simulated replicate for a given gene was selected that produced a phylogeny with the lowest PC; and (iii) the random-replicate (RS) scenario where all analyses were conducted using randomly chosen replicates to represent individual genes.

phylogenetics-evolutionary-biology-topological-distances

Figure 2: Distributions of the topological distances (dT’s) from the 100 replicates of EDG1 gene is plotted (978 bp). It shows that within replicates topological distances varied widely even when a unique set of evolutionary parameters are used in the computer simulation to generate replicates of EDG1 gene.

Progressive concatenation of genes

In this study relative performances of five different phylogenetic methods are investigated under the genome-scale approach. It is well known that adding a single gene to an existing set of genes improves the accuracy of phylogenetic reconstruction [9,43] and it is also obvious that more accurate method will converge first (i.e. will need a smaller number of genes to recover the true tree). The question is in which order the genes should be concatenated, because genes length vary widely and moreover they contain different levels of phylogenetic signal. Theoretically it is expected that concatenation of longer genes will converge first because longer genes get enough nucleotide substitutions and make it possible to infer them with greater accuracy. Moreover, an overall increase in sequence length would lead to reduce stochastic errors for evolutionary distances and other parameters in model based methods (i.e., Bayesian, ML, NJ-MCL, and NJ) [32]. Therefore, based on the variable length of genes along with different levels of phylogenetic signal three concatenation scenarios are considered: (i) longest to shortest (LS) genes concatenation where genes should be concatenated in the descending order of their length; (ii) shortest to longest (SL) genes concatenation where genes should be concatenated in the ascending order of their length; and (iii) random concatenation (RC) where genes have been selected randomly. The reasons for choosing these three concatenation scenarios are to examine the performance of the methods in a wide frame and examine which method takes a lower number of datasets to recover the model tree, to investigate whether concatenation of longest and smallest genes makes any effect on the methods performance, and to check if all three scenarios take almost same number of characters. Therefore, this study included all three concatenation scenarios (i.e., LS, SL and RC) under each of the three replicate selection scenarios (i.e., best-replicate, worstreplicate and random-replicate scenarios) [Supplementary data].

Results

Comparison of the efficiencies of phylogenetic methods

In this study genome-scale approach has been used in a biologically realistic computer simulation to examine the performance of the five major phylogenetic methods. Simulation results indicate that probabilistic methods (i.e., Bayesian and ML methods) are much more accurate than the NJ-MCL, MP and NJ methods and these results hold for each replicate scenario and for all three concatenation scenarios (Table 1). In general, the two probabilistic methods show more or less the same performance, but in a fine comparison ML is better than the Bayesian method (ML is more accurate and consistent than the Bayesian method). This analysis shows that the NJ-MCL, MP, and NJ methods are fast (i.e., computationally efficient), but their accuracy and consistency are very poor compared to the Bayesian and ML methods. Among NJMCL, MP and NJ methods, NJ-MCL is more accurate than MP and NJ with one exception; for the best replicate scenario, MP outperforms NJ-MCL and NJ methods (Figure 3a). In a further investigation, it was found that, for all three replication scenarios, Bayesian and ML methods take concatenation of few genes to recover the true tree whereas other three methods take concatenation of considerably large number of genes (three to six times more number of genes were required by the NJ-MCL, MP and NJ methods) to recover the true tree. These results established the fact that the probabilistic methods (i.e., ML and Bayesian methods) are much more consistent and efficient than the NJ-MCL, MP and NJ methods. For example, under random-replicate scenario and for LS gene concatenation scenario Bayesian and ML methods take concatenation of 4 and 2 genes respectively to recover the true tree, whereas NJ-MCL, MP and NJ take concatenation of 7, 20, and 12 genes respectively to recover the true tree. These results are shown in Table 1.

phylogenetics-evolutionary-biology-branches-inferred

Figure 3: Percentage of branches inferred correctly (PC) is plotted against concatenated genes for three concatenation approaches (i.e., longest to shortest (LS), shortest to longest (SL), and random concatenation (RC) approaches) and for (a) best-replicate and (b) random-replicate scenarios. Plots show that probabilistic methods (i.e., Bayesian & ML) converge (i.e., PC=100%) much faster than the other methods (i.e., NJ-MCL, MP, & NJ). Furthermore in case of the ML method PC value is increased (or remains unchanged) with the addition of a single gene to an existing concatenation, whereas for other methods, in several cases, addition of a gene produce more incorrect tree than that from initial dataset (see text).

Method Best replicate scenario Random replicate scenario Worst replicate scenario
LS SL RC LS SL RC LS SL RC
Gene Char Gene Char Gene Char Gene Char Gene Char Gene Char Gene Char Gene Char Gene Char
Bayesian 3 5812 10 4042 9 8440 4 6988 14 7318 15 13301 5 8077 17 10585 12 11002
ML 3 5812 6 1934 8 7438 2 4564 14 7318 13 12649 2 4564 17 10585 12 11002
MCL 17 15548 17 10585 18 14945 7 10057 20 16397 18 14945 9 11665 19 13480 13 12649
MP 11 13032 16 9409 14 12986 20 16397 16 9409 20 16397 >20 -- >20 -- >20 --
NJ 17 15548 17 10585 19 16193 12 13593 20 16397 20 16397 7 10057 19 13480 19 16397

Table 1: The number of genes and characters required to recover the true tree for each concatenation scenario and for the Bayesian, ML, MP, NJ-MCL, and NJ methods. Columns “Gene” and “Char” stand for total number of genes and corresponding total number of characters used in a single concatenation to infer the true tree. LS, SL and RC concatenation scenarios represent longest to shortest, shortest to longest and random concatenation of genes respectively.

More genes or more characters

In this study it was investigated how many characters are needed to recover the true tree and are they correlated with number of concatenated genes (i.e., more genes and also more characters or vice versa). Results in Table 1 show that, for best-replicate scenario, shortest to longest (SL) concatenation takes the less number of characters (with more genes) followed by LS and RC scenarios to recover the true tree, which is true for all five methods. For ML method under best-replicate scenario, it was found that LS concatenation takes 5812 characters (concatenation of 3 genes) whereas SL scenario takes only 1934 characters (concatenation of 6 genes) to recover the true tree. Similarly, for Bayesian method LS concatenation takes 5812 characters and SL concatenation takes 4042 (concatenation of 3 and 10 genes respectively). This result indicates that it is not always useful to consider longer genes; rather concatenation of smaller gene sequences, may be more number of genes with less number of characters, is more effective for reconstructing multigene phylogenies. Particularly, it should reduce the computational time. Moreover, this finding violates the theoretical expectation; concatenation of longer genes will converge first because longer genes get enough nucleotide substitutions and make it possible to infer them with greater accuracy. Although no one can guarantee that the concatenation of smaller genes will produce better phylogeny because the levels of phylogenetic signals present in the sequences are most important factors for reconstructing true evolutionary history of the species or genes. For other replicate scenarios (i.e., WS and RS) show SL case takes more gene and also more characters than LS concatenation. This contradiction is due to the quality of the gene replications.

Effect of addition of a single gene to an existing concatenation

It was also investigated how addition of a single gene to an existing concatenation (that generates a new concatenated dataset) improves the accuracy of the methods. Figure 3 shows the concatenated gene versus PC plots for all three concatenation scenarios (i.e., for LS, SL and RC scenarios) and for best-replicate and random-replicate scenarios. In overall, progressive addition of genes improved the accuracy of the phylogenetic reconstruction for all five methods. However for NJ-MCL, MP, and NJ methods, in several cases (Figure 3), addition of a single gene to an existing concatenation decreases the PC value obtained from initial dataset (i.e., addition of a gene produce more incorrect tree than that from initial dataset). This is due to different phylogenetic signal of the individual genes, either because of real differences in their evolutionary history, or because of different statistical biases, and NJMCL, MP, and NJ methods failed to accommodate such properties. In this situation concatenation may obscure the underlining species tree [44]. Interestingly, in spite of rigorous statistical properties Bayesian method also suffers from similar problem, but the performance is comparatively better than the NJ-MCL, MP, and NJ methods. On the other hand, in case of ML method addition of a single gene to an existing concatenation mostly improves the phylogenetic reconstruction or keeps its accuracy (PC) as obtained from previous data (i. e., PC value is increased with the addition of a gene or remain unchanged). This result states that the ML method is more consistent than all other methods.

How many datasets are needed to recover the true tree?

Another investigation was performed to find out how much datasets are needed to recover the true tree. The results showed that each different method takes different number of genes depending on their statistical power to resolve branches. Even for a particular method numbers of genes are varied among different replicate scenarios. Table 1 shows the results of such variations. For example, for the Bayesian method under LS concatenation, the best, worst, and random replicates take three, four, and five genes respectively. These results imply that the quality of replication is a primary factor and in a simulation study it is possible to distinguish the best and worst replicates, but in reality it is not possible. This simulation experiments show the number of genes sufficient to recover the true tree ranged from a minimum of 4 to 20. This result completely agreed with Rokas et al. [43].

Discussion

In this article, relative performance of five major phylogenetic methods were evaluated under the genome-scale approach using biologically realistic simulated nucleotide data and simulation, and results show that the Bayesian and ML methods are much more accurate than the NJ-MCL, MP and NJ methods. These results agreed with other studies with an exception [20-25]. In Hall’s study [22], Bayesian method is more accurate than the ML method. By contrast, this study shows ML method is slightly better than Bayesian method. This is apparently due to a difference of our simulations strategy and methodologies of the experiment. Beside comparison of the performance of methods this study has revealed several important aspects of genome-scale approach such as how concatenations of longest and smallest genes make effect on the methods performance, how much dataset are needed to recover the true tree, and whether more genes or more characters are important. These have been explained in the results section.

Concerning the accuracy of methods, the results showed that probabilistic methods (i.e., Bayesian and ML methods) are much more accurate than the NJ-MCL, MP and NJ methods. In overall, ML is more accurate, followed by the Bayesian, NJ-MCL, NJ, and MP methods. Furthermore, it has been shown that the ML method is much more consistent than other methods (even superior to the Bayesian method). An accurate algorithm may be useless if it is too slow. Therefore, for comparison proposes, the run time of each algorithm was measured which is shown in Table 2. Although NJ-MCL, MP, and NJ methods are very much computational efficient, but their accuracies and consistencies are very poor compared to the Bayesian and ML methods. On the other hand, Bayesian is very efficient, but less consistent and takes much longer execution time; whereas ML is very accurate, consistent, and computational efficient. Therefore, in conclusion, the continued preference of the ML method is recommended when genome-scale approach is used for phylogenetic reconstructions.

Phylogenetic method Computer program Time required
Bayesian
ML
MCL
MP
NJ
BAMBE
PhyML
MEGA4
MEGA4
MEGA4
4 hr, 28 min, 48 sec
10 min, 40 sec
2 sec
10 sec
1.5 sec

Table 2: Average run times for various methods. The computing times were measured on a PC Pentium IV 3.0 GHz (2 GB RAM) running with Windows XP. Datasets of 42 taxa with 16,397 bp were used to estimate the average computation time.

Acknowledgements

I acknowledge useful discussions with Dr. Dan Graur and Dr. Giddy Landan of University of Houston. I thank Dr. Georg Fuellen of University of Rostock for his helpful comments. I also thank Priyanka Sengupta for editorial support.

Supplementary Data

Supplementary data associated with this article can be found in the online version.

1 Accuracy: a phylogenetic method has high accuracy if it quickly converges on the true tree as more data are applied to the problem.

2 Consistency: a phylogenetic method is consistent for an evolutionary model, if the method converges on the true tree as the data becomes infinite.

References

Citation: Som A (2013) Genome-Scale Approach and the Performance of Phylogenetic Methods. J Phylogen Evolution Biol 1:116. Doi: 10.4172/2329-9002.1000116

Copyright: © 2013 Som A. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Select your language of interest to view the total content in your interested language

Post Your Comment Citation
Share This Article
Article Usage
  • Total views: 12564
  • [From(publication date): 7-2013 - Dec 16, 2019]
  • Breakdown by view type
  • HTML page views: 8732
  • PDF downloads: 3832
Share This Article
Top