Using Ancestral Information to Inform Analyses of Complex Data Sets

Over the last decade, improvements in sequencing technologies coupled with active development of association mapping methods have made it possible to link genotypes and quantitative traits in humans. Despite substantial progress find, even in studies with large numbers of individuals and genetic markers. This is due, in part, to the fact that effects of individual loci can be small and/or dependent on genetic variation at other loci or the environment. Tree-based mapping, which uses the evolutionary relatedness of sampled individuals to gain information during association mapping, has the potential to significantly improve our ability to detect loci impacting human traits. However, current tree-based methods are too computationally intensive and inflexible to be of practical use. Here, we compare tree-based methods with more classical approaches for association mapping and discuss how the limitations of these newer methods might be addressed. Ultimately, these advances have the potential to advance our understanding of the molecular mechanisms underlying complex diseases. in the ability to generate and analyze large data sets, however, genotype-phenotype associations are often difficult to our ability to generate and analyze large data sets, important statistical detect associated loci using quantitative trait mapping.


Introduction
A central goal in the biological and biomedical sciences is to identify the genetic basis of morphological, physiological, behavioral, and disease traits. Over the last decade, improvements in deoxyribonucleic acid (DNA) sequencing technologies coupled with active development of genome-wide association (GWA) methods have made it possible to link genetic variation and quantitative traits in a wide range of organisms, including humans. However, despite substantial progress in and bioinformatic challenges remain [1,2]. For example, while GWA studies have identified a large number of loci contributing to human disease, these loci rarely map to individual genes, let alone individual mutations [3][4][5]. Moreover, identified loci typically account for only a fraction of the total heritable variation in quantitative traits. To date, multiple overlapping explanations have been proposed to account for this "missing heritability" [6,7]. These explanations, some of which are described in more detail below, implicate several strategies for improving on current GWA methodology, including: increased sampling (of genetic regions and individuals), better measurements of traits and environmental variables, and improvements of existing statistical methodology. Here, we focus on the potential for using a novel statistical framework-tree-based association mapping-for improving our ability to map complex traits (i.e., those due to multiple genes and that are influenced by environment, genotype-by-environment, and genotype-by-genotype effects).
One of the leading explanations for missing heritability in human GWA studies is that many common diseases (e.g., cancer, diabetes, and heart disease) likely stem from the combined action of a large number of rare variants with individually small impacts on disease susceptibility. For example, despite the hundreds of GWA studies that have been performed to date, large-effect variants (e.g., APOE4 in Alzheimer's disease and CFH in age-related macular degeneration) remain the exception rather than the rule [3]. Using currently available mapping methods, small effect loci will be extremely difficult to detect without massive sample sizes.
Another potential explanation for missing heritability is that many genetic variants could be largely dependent on the environmental and genetic contexts in which they occur. For example, variation at the monoamine oxidase A (MAOA) gene is associated with violent behavior in humans, but only if the individual was abused as a child [8]. Also, gene-gene interactions (epistasis) are well-documented in controlled laboratory crosses in model organisms such as fruit flies and mice. While identification of epistasis remains elusive in humans [9,10], likely due to limited statistical power to detect gene-gene interactions in GWA studies of genetically diverse human populations, it is suspected to be widespread [3,11,12]. Disease susceptibility variants can also depend on sex [13] or on the parent from which the allele was inherited [14]. In short, the impact of a given genetic variant on a disease trait is often highly context-dependent. Such variants may be very difficult to detect when traits are measured in multiple genetic backgrounds and/ or multiple environments, as is often the case in GWA studies. Ignoring such information during analyses may reduce the power to identify associated genetic loci when multiple factors (genetic or otherwise) influence a quantitative trait. Developing methods that can adapt to the different contexts of GWA study data sets may increase the power to Many of the current limitations of association mapping methods ultimately stem from limitations on the power to detect and localize causal variants (either because they have small effect sizes, are contextdependent, or both). While increasing sample size is one way to approach this problem, another strategy is to develop more powerful statistical methods that can take greater advantage of the information contained within the data. In particular, in contrast to most commonly used methods for association mapping, tree-based methods use the evolutionary relatedness of sampled individuals to gain information during analysis. Thus, there is potential for these methods to show increased power in detecting small effects and/or context-dependent loci. However, most currently available tree-based methods [15][16][17][18][19] are computationally inefficient and/or cannot take into account external covariates that may influence variation in quantitative traits. Extending tree-based methods to the wider variety of contexts provided by GWA study data sets may improve the power in association mapping compared to existing non-tree based approaches. Before describing tree-based methods in more detail, we discuss the rationale for genetic mapping and the advantages and limitations of existing association mapping methods.

Background
The conceptual basis of quantitative trait mapping (QTM), in which statistical correlations are sought between quantitative traits and polymorphic DNA variants ("markers"), stems back to the early part of the 20 th century [20]. However, it has only been relatively recently that widespread availability of variable markers (e.g., single nucleotide polymorphisms (SNPs), insertion/deletion mutations (indels), or simple sequence repeats (microsatellites)) has made QTM feasible in humans [3]. In some cases, QTM is performed when the relationships among sampled individuals are known (linkage mapping), while in other cases, the relationships among individuals are unknown (association mapping). The present editorial focuses on the techniques developed for association mapping, although the reader is referred to the existing literature for information about the analysis of data with known familial relationships among sampled individuals [21][22][23][24]. Thus far, methodology proposed for association mapping either uses information in the evolutionary history present among genes and excludes other available information (including but not limited to experimental information and external covariate information) or uses external covariates to inform the analysis through the direct application of classical techniques that ignore the evolutionary relationships present within a particular SNP.
Classical statistical techniques applied in association mapping include the t-test, Analyses of Variance (ANOVA), and generalized linear model approaches that can be applied either marginally at each SNP or jointly on small neighboring sets of SNPs. Using generalized linear models allows straightforward adjustment for covariates during analyses [24][25][26][27][28]. More generally, classical statistical approaches are simple and readily-available, so that they are computationally efficient to implement on large GWA study data sets, making them popular approaches to association mapping. However, the precise localization these approaches assume independence among sampled individuals at each SNP, while the evolutionary relationships among these sampled individuals could be a potential source of covariation. By failing to consider shared evolutionary history among sampled individuals, classical statistical techniques could lose power to identify causal locations compared to methods that utilize this information.
Information about the evolutionary history for sampled observations can be represented by a bifurcating phylogenetic tree, as in Figure 1. The tips of the tree represent the sampled individuals at the present time, and the leftmost point on the tree represents the most recent common ancestor of the genetic variant under study. The lengths of the branches represent time, so that, if two observations share a branch, they share that part of their evolutionary history. Observations evolve independently after a split in their evolutionary history (represented by a split in a branch on the phylogenetic tree when viewed from left to right). If two large part of their evolutionary history, they are expected to have greater similarity in their trait(s) than two observations that share only a small portion of their evolutionary history (such as those denoted by blue circles). In fact, the technique in Thompson and Kubatko [17] suggests that, at a causal SNP, the covariance between two sampled observations could be approximated by the length of shared evolutionary history for that particular SNP. Phylogenetic methods provide an avenue to use the evolutionary relatedness among sampled individuals in the analysis of GWA study data, which may also be beneficial to association mapping [30].
Tree-based methods use estimated phylogenetic trees to gain information about the evolutionary history of a set of randomly sampled outbred individuals, and these methods show increased power compared to classical statistical techniques that ignore this information. Previous tree-based methods include those in Zöllner and Pritchard [19], which are not computationally feasible for large data sets, and Besenbacher et al., Pan et al., Zhang et al. [15,16,18], which consider all possible groups of observations compatible with the estimated phylogenies during association analysis. Because these methods use estimated phylogenies for each sampled SNP, they are especially computationally intensive. The method in Thompson and Kubatko [17] limits the required number of computations at the expense of considering only groups of observations defined by the earliest evolutionary events (edges) along an estimated phylogeny. In addition, current tree-based methods for data from randomly sampled individuals are limited by their inability to incorporate covariate information or any other existing information during association mapping. By extending these methods and remaining cognizant of the computational difficulties often associated with them, phylogenetic tools may provide an avenue for researchers to use external information during association mapping and achieve superior power over classical statistical techniques.

Future Directions
While the immediate goal of QTM is to identify loci that are statistically associated with complex trait variation, the ultimate goal is to use this information to uncover the biological underpinnings of quantitative trait variation and human disease. To these ends, finding Figure 1: Example of a phylogeny at a particular SNP. In the phylogenetic tree, time moves from past (left) to present (right) across the tree, and the tips of the tree represent observations from the present time. The amount of shared evolutionary history among the two observations with red squares is large, so that a large covariance is expected among their trait values. In contrast, the two observations denoted by blue circles share a smaller portion of their evolutionary history, so that little covariance in their trait values is expected from shared evolutionary history.
observations (such as those denoted by red squares in Figure 1 share a of associated SNPs is not readily addressed by current techniques that are flexible enough to allow for covariates [25,29]. Additionally, genomic regions harboring causal variants is not enough-it is only through finding the causal mutations themselves that we can dissect the molecular mechanisms that connect changes at the DNA level to traits expressed at the organism level. Moreover, many longstanding evolutionary questions are best informed by the identification of mutations rather than genomic regions [31], such as: Does adaptation proceed via a few large mutational steps or many small ones? Do individual mutations tend to impact few or many traits? How often do populations adapting to similar conditions utilize the same mutational solutions? Unfortunately, few GWA studies have achieved gene-level resolution, and even fewer have achieved mutation-level resolution. Detection and localization are especially challenging when individual effects are small and/or context-dependent. By extracting more information from the data, tree-based methods have the potential to significantly improve our ability to find causal mutations. In particular, the performance of association mapping methods may be improved by estimating covariance structures using ancestral information within genes, which can be done using phylogenetic techniques. If successful, these methods may help recover some of the "missing heritability" that has plagued GWA studies of complex diseases to date.