Error in Phylogenetic Estimation for Bushes in the Tree of Life

Many  rapid  radiations,  or  bushes,  throughout  the  Tree  of  Life  remain  unresolved.  Here,  we  investigated  how  the shape  of  a  bush  interacts  with  two  key  processes  -­  coalescence  and  mutation  -­  that  can  lead  to  errors  in  phylogenetic species  and  sampling  more  loci  as  well  as  the  utility  of  a  species  tree  method  based  upon  gene  tree  reconciliation  and the  concatenation  of  multiple  loci  for  resolving  bushes.  We  examined  different  bush  shapes,  varying  both  the  speciation rate  during  the  radiation  and  the  depth  of  the  radiation,  to  encompass  a  broad  range  of  situations.  Using  simulations based  upon  parameters  derived   from  empirical  studies,  we   investigated   the  performance  of  phylogenetic  analyses a  single  individual  for  more  loci  outperformed  sampling  multiple  individuals  for  one  locus  in  all  cases  except  the  most recent   radiations.  We   found   that   error   due   to   homoplastic  mutations   increased  with   depth,  while   error   due   to   the coalescent  process  remained  unchanged.  These  simulations  also  revealed  that,  for  certain  ancient  bushes,  analyses of  concatenated  data  matrices  surprisingly  resulted  in  more  accurate  phylogenies  than  gene  tree  reconciliation.  The the  superiority  of  concatenation  per  se.  Our  results  suggest  concatenation  remains  a  useful  approximate  method  for species  tree  estimation,  even  for  rapid  evolutionary  radiations.  However,  improved  estimation  of  gene  trees  combined with  use  of  gene  tree  reconciliation  has  the  greatest  potential  for  resolving  the  remaining  bushes  of  the  Tree  of  Life. *Corresponding  author:  Swati  Patel,  Department  of  Biology,  223  Bartram  Hall, P.O.  Box  118525,  University  of  Florida,  Gainesville,  Florida  32611-­8525,  USA,  Tel: 224-­558-­9786;;  E-­mail:  swpatel@ucdavis.edu Received  March  26,  2013;;  Accepted  May  31,  2013;;  Published  June  10,  2013 Citation:  Patel  S,  Kimball  RT,  Braun  EL  (2013)  Error  in  Phylogenetic  Estimation for   Bushes   in   the   Tree   of   Life.   J   Phylogen   Evolution   Biol   1:   110.   doi:10.4172/ jpgeb.1000110 Copyright:  ©  2013  Patel  S,  et  al.  This  is  an  open-­access  article  distributed  under the  terms  of  the  Creative  Commons  Attribution  License,  which  permits  unrestricted use,  distribution,  and  reproduction  in  any  medium,  provided  the  original  author  and source  are  credited.


Introduction
Despite the progress made on assembling the Tree of Life, many clades remain unresolved even a er substantial e ort. ese di cult clades have been called "bushes in the Tree of Life" [1] and they are thought to re ect rapid evolutionary radiations. Empirical examples of bushes are ubiquitous throughout the Tree of Life [1][2][3][4][5][6] and they are especially di cult to resolve when they are ancient [6], leaving large gaps in our knowledge about the Tree of Life.
Bushes in the Tree of Life can be characterized based on the rate of speciation during the evolutionary radiation and the overall depth of that radiation ( Figure 1). Rate captures the times between speciation events whereas depth captures the time since the radiation, and these two parameters can be viewed as de ning a "bush shape". Depth shows especially striking variation; even if we restrict consideration to animals, there are bushes as deep as 550 million years ago (Ma) when the Cambrian explosion occurred [7], to as recent as 100,000 years ago when the Lake Victoria cichlids radiated [8]. Although it has long been recognized that trees with long terminal branches (large depth) and short internal branches (high rate) are o en di cult to resolve [6,9] the relationship between rate, depth, and phylogenetic error remains unclear. Understanding the ways that these characteristics (rate and depth) in uence phylogenetic estimation will be bene cial to making further progress in resolving bushes in the Tree of Life.
To explore the best approaches for resolving bushes in the Tree of Life, it is important to consider the processes that lead to genetic di erences among taxa. Ultimately, the di erences among taxa re ect a complex set of stochastic processes that include lineage sorting due to the coalescent, patterns of mutation, recombination, horizontal gene transfer and introgression due to hybridization, and the duplication and loss of genes [10][11][12]. If we restrict our attention to vertebrates the rst two processes are likely to have the greatest impact upon the resolution of bushes, although all of those processes make important contributions to the di erences among genomes. e coalescent describes the history of alleles in populations [13,14] and the random sorting of alleles into di erent lineages due to the coalescent can result in discordance between gene trees and species trees [10,12]. Mutational processes have an impact both upon the probability of nding synapomorphic mutations that unite taxa [15] and the likelihood that homoplasy will obscure phylogenetic signal [6,16]. e bush shape (i.e., the speciation rate and depth of the radiation) is likely to in uence whether one or both of these factors, called "coalescent error" and "mutational error" herea er, will obscure the true phylogeny. e coalescent process, which can lead to discordance between gene trees and species trees, has been extensively studied from both a theoretical [10,[17][18][19] and empirical [20][21][22] standpoint. It is known that there are extreme situations (the "anomaly zone") where the most common gene tree is discordant with the species tree [23]. However, these studies have typically focused on relatively recent, or shallow, radiations. ere has been limited study of the problem for ancient rapid radiations, but the coalescent process should have just as much potential to result in gene tree discordance for ancient radiations [6,24]. Indeed, recent analyses of deep mammalian phylogeny using methods that accommodate discordance among gene trees due to the coalescent do yield di erent results than concatenation [25,26]. However, other empirical studies have not revealed clear evidence that gene tree discordance due to the coalescent has had an impact upon species tree estimation for ancient radiations [27][28][29]. Moreover, methods that accommodate gene tree discordance do appear to improve the e ciency and accuracy of phylogenetic estimation [30], at least for some problems. Regardless, the fact that gene tree-species tree discordance is as likely for ancient radiations as it is in recent radiations suggests that analyses of ancient rapid radiations should consider the impact of coalescent error. e stochastic nature of the mutational process can also result in di culties for the resolution of bushes in the Tree of Life. Even if gene tree-species tree discordance were ignored it would be di cult to obtain accurate estimates of gene trees for ancient rapid radiations [6,16]. is re ects a combination of three factors that we view as aspects of mutational error (we use the term "mutational error" to conform to the terminology used by Huang et al. [31] but note that this source of error includes both the origin of novel alleles by mutation and all other factors that in uence the xation of these alleles in lineages). First, present-day sequences are likely to exhibit substantial homoplasy that can obscure the branching pattern during an ancient radiation. In the extreme this homoplasy could lead to substitutional saturation [32,33]. Second, a limited number of informative characters are expected to be present in nite sequences that were generated by evolution on trees with short internodes [15]. is can lead to a requirement for very long sequences to accurately estimate gene trees [34]. Finally, there are cases where the expected site pattern spectra support a topology that con icts with the true tree due to factors like long-branch attraction and base compositional convergence [32,[35][36][37].
ere has been substantial e ort to develop phylogenetic methods that can detect and overcome bias, but it is important to recognize that the rst two phenomena can be problematic for the estimation of gene trees even if long-branch attraction, base compositional convergence, and other sources of bias are absent. Speci cally, the impact of homoplastic changes a er the radiation is related to depth whereas the expected number of informative characters that de ne clades in gene trees is related to the rate of speciation during the radiation. us, both of the parameters we use to describe bushes are expected to a ect the impact of mutational error upon phylogenetic estimation.
For each bush in the Tree of Life, it can be di cult to identify whether the lack of resolution re ects coalescent or mutational error (or both). Indeed, identifying the likely source of error may allow researchers to modify their sampling strategies and analytical methods to better address the speci c source of phylogenetic error for a given problem. e relative contribution of coalescent and mutational error may depend on the shape of the bush. Several studies have explicitly examined how speci c characteristics of a tree can a ect phylogenetic inference [31,[38][39][40][41]. While providing important insights into phylogenetic inference, these studies did not examine the impact of both coalescent and mutational variance upon radiations that are both ancient and rapid. However, it will be necessary to understand the impact of both processes and the ways that they interact to then identify methods with the greatest potential to resolve di cult bushes.
Here we use simulations to explore the impact of both coalescent and mutational processes upon phylogenetic estimation given di erent bush shapes (i.e., di erent rates of speciation and depths). We base our simulation parameters on empirical observations obtained from studies of tetrapods, speci cally birds and mammals (see Supplementary Material), to better link theory with practice and provide useful recommendations for empirical studies. We use these simulations to examine the impact of depth and rate upon two methodological questions likely to be important. First, we examine the tradeo between sampling more individuals per species and sampling more loci. Second, we compare the performance of a species tree method based upon gene tree reconciliation in a coalescent framework and simple phylogenetic analyses that use a concatenated alignment of multiple loci for resolving bushes. e goal of these analyses was to address the larger question of how to best resolve bushes in the Tree of Life and to gain insights into whether full resolution of the Tree of Life from sequence data is possible.

Methods
In this study, we generated random species trees, then we simulated gene trees based on these species trees, and nally we simulated nucleotide sequences using the gene trees. Phylogenetic trees obtained from analyses of the simulated data were compared with the true species trees that were used for the simulation (Figure 2). Details of these analyses are provided below.

Simulations
To examine the contribution of the coalescent and mutational processes to error in phylogenetic estimation, we simulated 20-taxon species trees assuming a Yule process [42] for various speciation rate and depth combinations. e simulations used code modi ed from the LASER package [43] in R. Speciation rate refers to the expected number of speciation events during each unit of time, while depth refers to the time since the last speciation event ( Figure 1A). All times were measured in coalescent time units (2N e generations for diploids, where N e is the e ective population size). We used a broad range of speciation rates, with the full set of speciation rates corresponding to 0.01, 0.1, 0.5, 1, 2.5, 5, 7.5, and 10, which resulted in a range of tree shapes ( Figure  1B). A er simulating the species tree, the terminal branches were then The shade of gray indicates how "bushlike" the phylogeny is, with lighter areas corresponding to "treelike" phylogenies and darker areas to more "bushlike" phylogenies.
extended to place the radiation at various depths. e depth values ranged from 0.5 coalescent units to 200 coalescent units (the full set was 0.5, 1, 5, 25, 50, 100, and 200). We then simulated gene trees under the coalescent process using the Phybase package [44] in R using these simulated species trees.
We simulated 1000 base pair (bp) sequences (a typical length for many markers; [45][46][47] using each gene tree in SeqGen [48], assuming the HKY model of evolution with a range of parameters that are typical of vertebrate introns. We established ranges of parameter values from empirical studies (Supplementary Material) and then chose values for our simulated regions randomly from these distributions. We used these parameters since introns have been used in many empirical studies e.g., [49][50][51][52], and their faster rates of evolution appear to make them more suitable for resolving rapid radiations than nuclear coding regions [34]. Although introns may be unsuitable for very ancient radiations, they do appear useful for the resolution of clades at a range of depths within some vertebrate classes [34,52,53]. Rates of molecular evolution obtained from empirical studies are typically expressed as substitutions per site per year instead of substitutions per site per coalescent unit. To convert coalescent units to years we assumed a constant N e of 200,000 diploid individuals and generations of one year, so coalescent time units can be converted to years by multiplying by 400,000. ese values are likely to be reasonable for a number of vertebrates based upon empirical studies (e.g, [20,54]). Although the most appropriate mutational and population genetic parameters are likely to di er among taxonomic groups, our approach could be applied to other types of loci or organisms by adjusting these parameters.

Sampling strategies
To examine the impact of sampling upon error due to the coalescent process we also simulated di erent numbers of individuals, or alleles in the diploid case, per species in each gene tree. We use the word "individuals" to represent individual intraspeci c lineages [18]. For this part of the study, we added smaller depths and omitted depths greater than 5 because theory suggests that most genes will coalesce within 5 coalescent units [55]. To examine coalescent error speci cally, we compared the true species tree with the estimated species tree from true gene trees. e set of depths we used was 0.01, 0.25, 0.5, 0.75, 1, 1.5, 2, and 5. We tested the following sampling strategies:   Finally, we added simulations with 50 loci and one individual per species for rates of 1 and 5 to address whether this increased amount of data would provide accurate estimates of phylogeny in the most problematic part of parameter space.

Phylogenetic analyses
We used STEM v1.1a [56] to nd maximum likelihood (ML) estimates of species trees by reconciling gene trees. e approach implemented in STEM represents a practical and commonly used ML method for species tree estimation, making it suitable for our simulation study. e gene trees were either the true gene trees obtained directly from simulations or estimates of gene trees obtained by analyzing simulated sequence data. ML estimates of gene trees were obtained from the simulated sequence data using RAxML version 7.2.8a [57] and the GTR+Γ (-m GTRGAMMA) model of evolution and converted into ultrametric trees using penalized likelihood [58] as implemented in the ape package [59] in R. RAxML was also used to obtain the ML trees for concatenated data matrices generated when a single individual per species was sampled.
To measure accuracy of phylogenetic inference, we used the Robinson-Foulds (RF) distance, a metric based upon the number of clades that di er between two given trees [60]. Depending upon the speci c analysis conducted, as many as four pairwise comparisons were conducted: 1) the true gene tree and the ML estimate (from RAxML) of the tree from the sequence data; 2) the true species tree with the ML estimate (from STEM) of the species tree based upon the true gene trees; 3) the true species tree and the ML estimate of the species tree obtained by analysis of the sequence data (using RAxML) followed by analysis of the gene trees (using STEM); and 4) the true species tree and the ML estimate of the species tree obtained by analysis of concatenated data (using RAxML). ese comparisons provided information about error due to the mutational process, coalescent process, both together, and concatenation, respectively.

Relative contributions of coalescent and mutational error
Error due to coalescence increased only with the speciation rate and was not a ected by depth whereas error due to the mutational process increased with both rate and depth ( Figure 3). e increase in coalescent error with rate was expected, given that high speciation rates result in less time between speciation events for alleles to sort into lineages. e independence of coalescent error and depth demonstrates that the problem of coalescent error can impact the phylogeny estimation for ancient radiations [25,26] just as it can for recent radiations. e increase in mutational error with rate likely re ects the low probability that a su cient number of mutations will accumulate along short internal branches to provide an accurate estimate of the phylogeny. On the other hand, the increase in mutational error with depth probably re ects the tendency of homoplasy to confound phylogenetic estimation.
is di erential dependence on rate and depth of the mutational  Figure 2: Schematic of our simulation procedure: Coalescent error was measured using the Robinson-Foulds (RF) distance between true species trees and species trees estimated using the true gene trees. Mutational error was measured using the RF distance between true gene trees and gene trees estimated using sequence data. Total discordance between the true species trees and species trees was estimated using gene trees that were estimated using sequence data. Solid lines represent simulated data, while dashed lines indicate estimations from the simulated data.
and coalescent error led to a di erence in how they contribute to the overall accuracy of phylogenetic estimation. For bushes shaped by slow speciation rates (e.g., rate=0.1) and shallow depths (e.g., depth=0.5), neither mutational nor coalescent error was large and relatively accurate estimates of phylogeny were obtained ( Figure 3A). For the same rate but greater depths (e.g., depth=200), coalescent error remained negligible but mutational error increased ( Figure 3A), suggesting that focusing on the latter problem could help resolution.
Regardless of depth, high speciation rates resulted in substantial error (e.g., Figure 3C; rate=10). us, the worst situation was a combination of a fast speciation rate with high depth ( Figure 3C; e.g., rate=10, depth=200), where both coalescent and mutational error caused substantial incongruence between estimates of the species tree and the true species tree. us, it will be necessary to address both problems to correctly resolve relationships, and this may prove to be very di cult at the highest rates. Under these conditions, accurate gene trees were not reconstructed from sequences. Even if the gene trees had been accurate, estimating the true species tree would have remained problematic ( Figure 3C). is result may re ect the limited number of genes analyzed here (see below for simulations using a larger number of genes). Overall, our results corroborated the idea that it is important to consider the depth and speciation rate when determining whether coalescent error, mutational error, or both are likely to a ect phylogenetic reconstruction.
Some studies have considered the coalescent and mutational errors (as measured by RF distances) to be additive [31], but our results suggest that this is not always true. e total error was not equal to the sum of the error due to the coalescent and mutational processes in some cases ( Figure 3). is means that reducing either mutational or coalescent error by a certain amount does not necessarily mean that the total error will also decrease. us, it is critical to consider both processes and the errors they can cause to accurately resolve phylogenies in di cult parts of parameter space.

Sampling strategy to alleviate coalescent variance
Data collection strategies are important for solving di cult phylogenetic problems. e tradeo between sampling many loci and sampling multiple individuals has been debated for many years [38,39,61] and the best strategy appears to depend upon the bush shape [31,38]. To explore this, we focused on coalescent error for the region of bush shape space where this trade o is likely to be particularly important. e optimal sampling strategy to overcome coalescent error depended on the rate and depth of the bush being considered. For shallow depths and fast rates, a sampling strategy that included multiple individuals rather than multiple loci was more bene cial to resolving relationships ( Figure 4). In contrast, a transition occurred in the optimal sampling strategy as simulated radiations became slower and more ancient (Figure 4B; e.g., depth=0.25, rate between 2.5 and 5.0). For bushes characterized by either high depths or low speciation rates (or both), sampling additional loci rather than additional individuals per species resulted in greater accuracy ( Figure 4). us, it appears that when resources for data collection are limited, empirical studies are likely to bene t from sampling more individuals rather than sampling multiple loci only when they are focused on a bush with a high rate and shallow depth.
We found that sampling more than one individual did not improve phylogenetic estimation at greater depths. Even by a depth of two coalescent units, increasing the number of sampled individuals resulted in limited improvement to our accuracy at great depths. is was even more pronounced at ve coalescent units (Supplementary gure S2). Indeed, the RF distances for simulations that sampled either two or ve individuals approach that of a single individual at these depths. is is in agreement with theory, which indicates that the expected time for two individuals within a population to coalesce is two coalescent units and that one can be 95% con dent that any number of individuals within a population will have coalesced by ve coalescent units [55,62,63]. e absolute time frame for the depths we are considering is surprisingly recent. Given the parameter space we examined, which we believe to be reasonable for many bushes in the vertebrate Tree of Life, two coalescent time units may re ect 800,000 years or less. us, sampling multiple individuals may only be bene cial with respect to overcoming coalescent error for Plio-Pleistocene radiations even if one considers the time from the beginning of the radiation. Although there may be additional reasons to sample multiple individuals (e.g., to limit the potential impact of errors in species identi cation; cf. [64], or to improve estimates of demographic parameters for the extant species), sampling multiple individuals does not appear to have a direct bene t for phylogenetic estimation with more ancient rapid radiations.

e utility of concatenation for rapid radiations
Two distinct approaches have been used for phylogenetic analyses of data from multiple independent loci. A common practice is to concatenate the data into a large supermatrix for a combined analysis is approach implicitly assumes that all loci have a single underlying tree topology. However, recent studies argue that concatenation can result in inaccurate estimates of relationships, sometimes with deceptively high support, when there is incongruence among gene trees [30,40,68,69].
us, methods of species tree estimation that allow topological di erences among gene trees, such as gene tree reconciliation in a coalescent framework, are becoming more common.
e method used to analyze multiple loci had the greatest impact on the accuracy of phylogenetic estimation at slow (e.g., rate=0.01) to intermediate rates (e.g., rate=1). At shallow depths, both gene tree reconciliation (STEM with estimated gene trees) and concatenation resulted in comparable amounts of error. However, as depth increased, concatenation performed better than the gene tree reconciliation approach with gene trees estimated from sequence data ( Figure 5A and 5B). Since coalescent error remained constant across depths (Figure 3), the relative advantage of concatenation at high depths likely re ects the impact of mutational error. In an empirical study of a deep radiation in iguanian lizards, Townsend et al. [70] also assert that concatenation will probably outperform estimation of a species tree using gene trees when a large number of the gene trees are poorly resolved. To address the hypothesis that the inferior performance of gene tree reconciliation in our simulations was due to mutational error at greater depths, we examined the performance of gene tree reconciliation using the true gene trees since these should exhibit no mutational error. e use of true gene trees resulted in substantially more accurate phylogenetic estimates than either the STEM analyses with estimated gene trees and concatenation ( Figure 5). Moreover, the accuracy of the species tree using true gene trees remained constant across depths as expected if the di erences observed above re ect mutational error.
On the other hand, at high speciation rates (e.g., rate=5), the method for analyzing multiple loci was inconsequential. Both approaches resulted in substantial error, o en estimating trees that were maximally di erent from the true species tree ( Figure 5C). At this speciation rate the evolutionary radiation could only be described as explosive, although the speciation rate was not be outside the likely  range of speciation rates for some known adaptive radiations (e.g., African cichlids; [71]). It is not surprising that resolving such a rapid radiation would prove extremely di cult. However, even for such a rapid radiation gene tree reconciliation using the true gene trees provided more accurate estimates of the species tree than concatenation or use of estimated gene trees ( Figure 5C). As depth increases, mutational error overwhelmed the coalescent error ( Figure 3). In gene tree reconciliation, high mutational error led to inaccurate estimates of individual gene trees and thus of the species tree. The increased power that came from including a large number of sites in concatenated analyses appeared to reduce the impact of mutational error and compensated (at least to some degree) for the incorrect assumption that all loci have the same topology. Thus, for certain bush shapes and with the type of data simulated here, concatenation may perform better than gene tree reconciliation using estimated gene trees.
To further test the hypothesis that concatenation can be bene cial due to the increased power of sampling more sites, we analyzed 50 loci using concatenation and gene tree reconciliation. With 50 loci, concatenation performed better than gene tree reconciliation (using estimated gene trees) at all depths ( Figure 6). Expectedly, the best performance came from use of the true gene trees, where even at very high speciation rates ( Figure 6B) some estimated species trees matched the true species tree (RF distance = 0).
Comparing the RF distances from the analyses of 5 loci ( Figure  5) and those with 50 loci (Figure 6) revealed that there is improved species tree estimation using concatenation when the number of loci is increased. In contrast, results using gene tree reconciliation were very similar when comparing 5 loci ( Figure 5) and 50 loci (Figure 6), even though the amount of data analyzed is 10X greater. ese results suggest that gene tree reconciliation depends heavily upon the quality of gene trees used to estimate the species tree. When there is a lot of error in the gene tree estimates, the number of gene trees becomes inconsequential. e substantial mutational error likely hindered the gene tree reconciliation approach. us, in the absence of accurate estimates of gene trees, our analyses indicate that concatenated analyses of increased number of loci have excellent potential to improve the resolution of di cult clades because of the increase in the overall power of these analyses.
Despite the evidence that concatenation can be inconsistent [40,68] the ML tree from a concatenated data matrix may still represent a good estimate of the phylogeny under some circumstances. Concatenation can be viewed as a type of model violation [12] and it is known that many phylogenetic methods are relatively robust to model violations The "STEM with estimated gene trees" boxes represent the discordance between true species trees and a tree estimated using 5 gene trees that were estimated each from 1000 base pair (bp) simulated sequence data matrices. Concatenation analyses involved joining all 5 sequences into a single 5000 bp data matrix from which a tree was estimated and compared to the true gene tree. The "STEM with true gene trees" box represents the discordance between the true species tree and the estimated species tree from the true gene trees (i.e. the coalescent variance). (cf. [36,72,73]). us, it is reasonable to postulate that analyses of concatenated data provide a useful, albeit biased, estimator of the species tree in many parts of parameter space. Although there is, as yet, no formal proof that ML analyses of concatenated data are inconsistent, simulations [68] are suggestive that analyses of concatenated data are inconsistent in the anomaly zone and further suggest that there may even be parts of parameter space where the tree favored by analyses of concatenated data is very di erent from the species tree. However, even in those parts of parameter space where concatenation is inconsistent, we expect the tree recovered by these analyses to be fairly close to the species tree under many circumstances. us, it may still be desirable to use concatenation to obtain an initial tree that can then be further rearranged to identify the optimal tree using a method that considers both the coalescent and mutational processes.
is would allow the use of a computationally e cient approach (i.e., ML analysis of concatenated data) to obtain a tree topology that is fairly close to the species tree before re ning that topology using a computationally di cult but consistent approach (e.g., the ML approach proposed by Maddison [10]). e excellent performance of gene tree reconciliation when true gene trees are used suggests that it will be important to focus on ways to improve gene tree estimation. e simplest way to obtain better gene tree estimates may be to increase the sequence length of the regions analyzed [34]. Empirical studies are consistent with this hypothesis; STAR (a species tree methods) did not appear to perform as well as concatenation in an analysis of avian phylogeny [74] based upon a large number of short (<600 bp) loci but it did perform well in an analysis of mammalian phylogeny [25] based upon longer (>1000 bp) loci. However, the maximum length of regions that can be used to estimate gene trees is unclear since di erent parts of very long sequences may actually have distinct gene trees due to recombination or gene conversion within the individual regions. At this time, it is not clear how problematic this will be for vertebrates in practice.
An alternative to sequencing longer regions might be to focus on rare genomic changes (RGCs) to identify gene trees. Transposable element insertions are the most commonly used RGC in phylogenetics [5,[75][76][77], although some studies have focused on other classes of RGCs such as microinversions [78,79], the presence/absence of microRNAs [80], and a subset of amino acid changes ("RGC_CAMs" [81]). e slow rate of accumulation for RGCs means they will not provide enough information to resolve gene trees completely. Instead, they are used to de ne speci c bipartitions within gene trees. Indeed, con ict among transposable element insertions has been interpreted as prima facie evidence of con ict among gene trees due to lineage sorting [76,82]. However, there is also evidence that some RGCs exhibit homoplasy [79,83,84]. Indeed, several analyses [82,85] of these RGCs, in isolation from nucleotide or amino acid sequence data, have lead to conclusions that con ict with careful analyses of very large sequence datasets [52,[86][87][88]. Nonetheless, the limited homoplasy associated with RGCs suggests that they will be useful, especially if they are combined with analyses of sequence data. Since it is clear that very accurate gene tree estimates will be very useful, even if they are challenging to obtain, identifying the best ways to obtain accurate estimates of gene trees seems critical.

Confronting theory with data
Our simulations examined an especially di cult phylogenetic problem: a rapid radiation followed by a period of time with no speciation ( Figure 1A). is situation may seem extreme, but it is relevant to many known biological radiations and is of general interest for assembling the Tree of Life. Even when there is post-radiation speciation, the situation is expected to be similar to our model tree if there are long branches between the initial radiation and later speciation events. us, excellent examples of bushes might be found in the divergence among the three major supergroups of eutherian mammals (Boreoeutheria, Afrotheria, and Xenarthra), where analyses of transposable element insertions [5] suggest a polytomy but species tree analyses suggest an Afrotheria-Xenarthra clade [25,26]. Indeed, the Afrotheria-Xenarthra clade has been suggested to re ect an empirical example of a case where estimation of phylogeny using concatenated data is inconsistent [25,26]. However, we believe that conclusion should be approached with caution since the Afrotheria-Xenarthra clade was recovered both in some concatenated analyses [89] and in analyses using a model that accommodates gene duplication and loss but not lineage sorting [90]. Regardless, it seems reasonable to view these early divergences among eutherian supergroups as a radiation that is both relatively ancient and similar in rate to the bushes we simulated, albeit with fewer taxa.
Additional bushes that are similar to our model tree can be found in the birds. Both Notopalaeognathae (the clade comprising all extant paleognathous birds except the ostrich; Yuri et al. [91]) and Neoaves (the clade comprising the majority of extant bird; reviewed by Cracra et al. [92]) Both of these groups include a number of highly divergent taxa characterized by long periods with no net speciation a er the initial radiation, especially if we consider the subset of Neoaves designated "Metaves" by Fain and Houde [93]. Although these examples include some subsequent speciation, they are uni ed by the origin of a relatively large number of lineages during a short period of time followed by limited cladogenesis (and/or substantial extinction) a erward, at least in some lineages. e existence of these examples, along with examples in other lineages (e.g., iguanian lizards [70]), emphasizes the fact that the parts of parameter space we explored are relevant to important biological problems.

Future directions
Our study illuminates the ways that speci c characteristics of bush shape can in uence the phylogenetic error due to the coalescent and mutational process. We used empirical data to guide our simulations so the assumptions we made have both theoretical and empirical justi cation. However, as in all simulation studies, there are limitations in our choices and our results re ect the parts of parameter space that we chose to explore. We chose our bush shape ( Figure 1) to reduce the tree characteristics to a set of two parameters. Although we have highlighted examples of empirical situations that are similar, all of our examples included some subsequent speciation. Given that subsequent speciation (and extinction) events do occur post-radiation, it would be interesting to pursue the e ect of these events in future studies. Methods to characterize shi s in the rate of speciation have been developed [94] and it might be possible to use these methods to parameterize realistic model trees with shi ing rates of speciation and extinction rather than our simple two-parameter bush model. Nonetheless, our bush models are likely to be informative since the most problematic parts of trees are likely to be during the rapid radiations.
Here we focused on simulations of up to 50 loci that evolve under patterns similar to the nuclear introns of birds and mammals, in part because studies using these types of markers in this number are becoming more common [29,95,96]. We recognize that the patterns of molecular evolution may di er among groups of organisms (e.g., turtles have evolved more slowly [87] and exhibit less GC-content heterogeneity [97] than birds or mammals) and types of markers (e.g., ultraconserved elements [25,74] exhibit di erent patterns of sequence evolution from the introns simulated herein). Furthermore, we avoided including rate variation among lineages in order to limit the impact of bias upon our simulations. is allowed us to use a simple tree model ( Figure 1) and focus on the other aspects of mutational variance. However, simulations that include rate variation among lineages (in addition to the variation among loci and among sites that we simulated) could be very interesting.
We also restricted our analyses to computationally tractable approaches that are commonly used in empirical studies. ere are a number of these methods that rely upon a two step process, rst estimating gene trees and later combining them to generate an estimate of the species tree (Liu et al. [98]). However, other methods exist and those methods may provide better estimates of species trees under certain conditions [31,40]. Speci cally, the Bayesian MCMC approaches BEST [99] and *BEAST [100] simultaneously estimate gene trees and species trees from sequence data and avoid the two-step approach. Bayesian methods provide a straightforward means to assess parameter identi ability [101], although these methods are computationally demanding for studies with large numbers of species. A third Bayesian approach, BUCKy [102,103], does use a two-step procedure that has the potential to better accommodate uncertainty in the estimates of gene trees. Although it would be interesting to test the e ectiveness of these and other methods for phylogenetic inference using the types of evolutionary radiations we explored here, we note that some simulation studies [104] have found that these much more computationally intensive methods have relatively limited increases in accuracy, at least in some parts of parameter space. Moreover, we note that STEM is a consistent estimator of the species tree when gene trees and their branch lengths are known [56]. Our focal question for this study was whether mutational error, given patterns of molecular evolution based upon empirical studies, was su cient to degrade the performance of a representative coalescent-based gene tree reconciliation method and this does appear to be the case in speci c parts of our "bush parameter space".
is result leads us to suggest that the greatest bene t to improving these methods of phylogenetic estimation may come from identifying methods to improve gene tree estimates, whether those improvements re ect the joint estimation of gene trees and the species tree or two-step methods combined with other approaches to improve gene tree estimates.
Finally, it will be interesting to examine how processes such as recombination, selection, and hybridization can in uence phylogenetic estimation for the bushes we considered. It seems reasonable to speculate that these processes will also exacerbate the di culties associated with the phylogenetic inference problem. Few methods exist for making inferences when these problems are present. We note, however, that BUCKy [103] does not make assumptions regarding the source of discordance among the estimates of gene trees and that Kubatko [105] recently proposed a species tree approach that incorporates hybridization. Adding these types of complexities to simulations would provide further information about the best approaches for phylogenetic reconstruction and allow simulations of this type to be expanded beyond the focus of this study to many other parts of the Tree of Life. e performance of phylogenetic methods has been evaluated in many ways, including mathematical analyses, simulations, and studies of "known" phylogenies (see Yuri et al. [91] for a discussion regarding the limitations of the last approach). ese approaches are complementary; for example, the development of modern species tree methods was motivated in part by the proof that the anomaly zone exists [23]. Although proofs of consistency are important, phylogenetic methods should also be evaluated based upon their performance in the parts of parameter space that are most relevant to practicing systematists. ese evaluations will help systematists determine appropriate approaches for phylogenetic estimation for their speci c problem. Overall, we feel that it is possible to explore many of the parts of parameter space that are most relevant to practicing systematists to assess methods as they are being developed.

Conclusion
As we expand data collection to assemble the Tree of Life it is important to examine the performance of phylogenetic methods given realistic models (and parameter values) that describe the process of evolution. We demonstrated the ways coalescent and mutational error impact phylogenetic inference given bushes of di erent shapes and highlighted approaches that may reduce these errors and improve accuracy of phylogenetic inference. Surprisingly, we found that concatenation performed better than gene tree reconciliation for deep bushes when mutational error overwhelmed coalescent error. However, the poor performance of gene tree reconciliation appeared to be due to the use of poor gene tree estimates; using true gene trees with gene tree reconciliation always resulted in the best estimates of the species tree. Unless it is possible to obtain accurate gene trees, concatenation of many loci may provide a tractable approach to resolve di cult phylogenetic problems (albeit one that may exhibit biases for a subset of nodes under speci c conditions). Regardless, the relatively good performance of concatenation in this study suggests that concatenation should continue to be compared to species tree methods in both empirical and simulation studies, at least for the time being. In the long term, however, identifying the best ways to improve gene tree estimation along with the continued development of improved approaches for species tree estimation will improve the resolution of the bushes in the Tree of Life.