Received Date: November 01, 2013; Accepted Date: November 02, 2013; Published Date: November 05, 2013
Citation: Thompson KL, Charnigo R, Linnen CR (2013) Using Ancestral Information to Inform Analyses of Complex Data Sets. J Biomet Biostat 4:e126. doi:10.4172/2155-6180.1000e126
Copyright: © 2013 Thompson KL, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Biometrics & Biostatistics
Over the last decade, improvements in sequencing technologies coupled with active development of association mapping methods have made it possible to link genotypes and quantitative traits in humans. Despite substantial progress in the ability to generate and analyze large data sets, however, genotype-phenotype associations are often difficult to find, even in studies with large numbers of individuals and genetic markers. This is due, in part, to the fact that effects of individual loci can be small and/or dependent on genetic variation at other loci or the environment. Tree-based mapping, which uses the evolutionary relatedness of sampled individuals to gain information during association mapping, has the potential to significantly improve our ability to detect loci impacting human traits. However, current tree-based methods are too computationally intensive and inflexible to be of practical use. Here, we compare tree-based methods with more classical approaches for association mapping and discuss how the limitations of these newer methods might be addressed. Ultimately, these advances have the potential to advance our understanding of the molecular mechanisms underlying complex diseases.
A central goal in the biological and biomedical sciences is to identify the genetic basis of morphological, physiological, behavioral, and disease traits. Over the last decade, improvements in deoxyribonucleic acid (DNA) sequencing technologies coupled with active development of genome-wide association (GWA) methods have made it possible to link genetic variation and quantitative traits in a wide range of organisms, including humans. However, despite substantial progress in our ability to generate and analyze large datasets, important statistical and bioinformatic challenges remain [1,2]. For example, while GWA studies have identified a large number of loci contributing to human disease, these loci rarely map to individual genes, let alone individual mutations [3-5]. Moreover, identified loci typically account for only a fraction of the total heritable variation in quantitative traits. To date, multiple overlapping explanations have been proposed to account for this “missing heritability” [6,7]. These explanations, some of which are described in more detail below, implicate several strategies for improving on current GWA methodology, including: increased sampling (of genetic regions and individuals), better measurements of traits and environmental variables, and improvements of existing statistical methodology. Here, we focus on the potential for using a novel statistical framework-tree-based association mapping-for improving our ability to map complex traits (i.e., those due to multiple genes and that are influenced by environment, genotype-by-environment, and genotype-by-genotype effects).
One of the leading explanations for missing heritability in human GWA studies is that many common diseases (e.g., cancer, diabetes, and heart disease) likely stem from the combined action of a large number of rare variants with individually small impacts on disease susceptibility. For example, despite the hundreds of GWA studies that have been performed to date, large-effect variants (e.g., APOE4 in Alzheimer’s disease and CFH in age-related macular degeneration) remain the exception rather than the rule . Using currently available mapping methods, small effect loci will be extremely difficult to detect without massive sample sizes.
Another potential explanation for missing heritability is that many genetic variants could be largely dependent on the environmental and genetic contexts in which they occur. For example, variation at the monoamine oxidase A (MAOA) gene is associated with violent behavior in humans, but only if the individual was abused as a child . Also, gene-gene interactions (epistasis) are well-documented in controlled laboratory crosses in model organisms such as fruit flies and mice. While identification of epistasis remains elusive in humans [9,10], likely due to limited statistical power to detect gene-gene interactions in GWA studies of genetically diverse human populations, it is suspected to be widespread [3,11,12]. Disease susceptibility variants can also depend on sex  or on the parent from which the allele was inherited . In short, the impact of a given genetic variant on a disease trait is often highly context-dependent. Such variants may be very difficult to detect when traits are measured in multiple genetic backgrounds and/ or multiple environments, as is often the case in GWA studies. Ignoring such information during analyses may reduce the power to identify associated genetic loci when multiple factors (genetic or otherwise) influence a quantitative trait. Developing methods that can adapt to the different contexts of GWA study data sets may increase the power to detect associated loci using QTM.
Many of the current limitations of association mapping methods ultimately stem from limitations on the power to detect and localize causal variants (either because they have small effect sizes, are contextdependent, or both). While increasing sample size is one way to approach this problem, another strategy is to develop more powerful statistical methods that can take greater advantage of the information contained within the data. In particular, in contrast to most commonly used methods for association mapping, tree-based methods use the evolutionary relatedness of sampled individuals to gain information during analysis. Thus, there is potential for these methods to show increased power in detecting small effects and/or context-dependent loci. However, most currently available tree-based methods [15-19] are computationally inefficient and/or cannot take into account external covariates that may influence variation in quantitative traits. Extending tree-based methods to the wider variety of contexts provided by GWA study data sets may improve the power in association mapping compared to existing non-tree based approaches. Before describing tree-based methods in more detail, we discuss the rationale for genetic mapping and the advantages and limitations of existing association mapping methods.
The conceptual basis of quantitative trait mapping (QTM), in which statistical correlations are sought between quantitative traits and polymorphic DNA variants (“markers”), stems back to the early part of the 20th century . However, it has only been relatively recently that widespread availability of variable markers (e.g., single nucleotide polymorphisms (SNPs), insertion/deletion mutations (indels), or simple sequence repeats (microsatellites)) has made QTM feasible in humans . In some cases, QTM is performed when the relationships among sampled individuals are known (linkage mapping), while in other cases, the relationships among individuals are unknown (association mapping). The present editorial focuses on the techniques developed for association mapping, although the reader is referred to the existing literature for information about the analysis of data with known familial relationships among sampled individuals [21-24]. Thus far, methodology proposed for association mapping either uses information in the evolutionary history present among genes and excludes other available information (including but not limited to experimental information and external covariate information) or uses external covariates to inform the analysis through the direct application of classical techniques that ignore the evolutionary relationships present within a particular SNP.
Classical statistical techniques applied in association mapping include the t-test, Analyses of Variance (ANOVA), and generalized linear model approaches that can be applied either marginally at each SNP or jointly on small neighboring sets of SNPs. Using generalized linear models allows straightforward adjustment for covariates during analyses [24-28]. More generally, classical statistical approaches are simple and readily-available, so that they are computationally efficient to implement on large GWA study data sets, making them popular approaches to association mapping. However, the precise localization of associated SNPs is not readily addressed by current techniques that are flexible enough to allow for covariates [15,25,29]. Additionally, these approaches assume independence among sampled individuals at each SNP, while the evolutionary relationships among these sampled individuals could be a potential source of covariation. By failing to consider shared evolutionary history among sampled individuals, classical statistical techniques could lose power to identify causal locations compared to methods that utilize this information.
Information about the evolutionary history for sampled observations can be represented by a bifurcating phylogenetic tree, as in Figure 1. The tips of the tree represent the sampled individuals at the present time, and the leftmost point on the tree represents the most recent common ancestor of the genetic variant under study. The lengths of the branches represent time, so that, if two observations share a branch, they share that part of their evolutionary history. Observations evolve independently after a split in their evolutionary history (represented by a split in a branch on the phylogenetic tree when viewed from left to right). If two observations (such as those denoted by red squares in Figure1 share a large part of their evolutionary history, they are expected to have greater similarity in their trait(s) than two observations that share only a small portion of their evolutionary history (such as those denoted by blue circles). In fact, the technique in Thompson and Kubatko  suggests that, at a causal SNP, the covariance between two sampled observations could be approximated by the length of shared evolutionary history for that particular SNP. Phylogenetic methods provide an avenue to use the evolutionary relatedness among sampled individuals in the analysis of GWA study data, which may also be beneficial to association mapping .
Figure 1: Example of a phylogeny at a particular SNP. In the phylogenetic tree, time moves from past (left) to present (right) across the tree, and the tips of the tree represent observations from the present time. The amount of shared evolutionary history among the two observations with red squares is large, so that a large covariance is expected among their trait values. In contrast, the two observations denoted by blue circles share a smaller portion of their evolutionary history, so that little covariance in their trait values is expected from shared evolutionary history.
Tree-based methods use estimated phylogenetic trees to gain information about the evolutionary history of a set of randomly sampled out bred individuals, and these methods show increased power compared to classical statistical techniques that ignore this information. Previous tree-based methods include those in Zöllner and Pritchard , which are not computationally feasible for large data sets, and Besenbacher et al., Pan et al., Zhang et al. [15,16,18], which consider all possible groups of observations compatible with the estimated phylogenies during association analysis. Because these methods use estimated phylogenies for each sampled SNP, they are especially computationally intensive. The method in Thompson and Kubatko  limits the required number of computations at the expense of considering only groups of observations defined by the earliest evolutionary events (edges) along an estimated phylogeny. In addition, current tree-based methods for data from randomly sampled individuals are limited by their inability to incorporate covariate information or any other existing information during association mapping. By extending these methods and remaining cognizant of the computational difficulties often associated with them, phylogenetic tools may provide an avenue for researchers to use external information during association mapping and achieve superior power over classical statistical techniques.
While the immediate goal of QTM is to identify loci that are statistically associated with complex trait variation, the ultimate goal is to use this information to uncover the biological underpinnings of quantitative trait variation and human disease. To these ends, finding genomic regions harboring causal variants is not enough-it is only through finding the causal mutations themselves that we can dissect the molecular mechanisms that connect changes at the DNA level to traits expressed at the organism level. Moreover, many longstanding evolutionary questions are best informed by the identification of mutations rather than genomic regions , such as: Does adaptation proceed via a few large mutational steps or many small ones? Do individual mutations tend to impact few or many traits? How often do populations adapting to similar conditions utilize the same mutational solutions? Unfortunately, few GWA studies have achieved gene-level resolution, and even fewer have achieved mutation-level resolution. Detection and localization are especially challenging when individual effects are small and/or context-dependent. By extracting more information from the data, tree-based methods have the potential to significantly improve our ability to find causal mutations. In particular, the performance of association mapping methods may be improved by estimating covariance structures using ancestral information within genes, which can be done using phylogenetic techniques. If successful, these methods may help recover some of the “missing heritability” that has plagued GWA studies of complex diseases to date.
However, two significant challenges in the development of treebased association methods remain. First, existing methods are too computationally intensive to be of practical use for large GWA studies. The methods in Besenbacher et al., Thompson and Kubatko [15,17] propose the use of broad-scale evolutionary relationships to address this limitation. Second, while context-dependence is pervasive in quantitative traits, current tree-based methods are not flexible enough to take into account environmental or gender-specific covariates. Importantly, context-dependent effects are more than just nuisance parameters in association mapping-gene-environment and gene-gene interactions may provide essential clues to the molecular pathways underlying complex traits. Thus, methods that show an improved ability to detect and quantify epistatic effects and genotype by- environment interactions would represent a significant advance in GWA methodology. Together, these improvements have the potential to yield novel insights into the genetics of complex diseases that may better inform disease prediction and treatment strategies.
This material is based, in part, upon work supported by the National Science Foundation under Grant No. DEB-1257739. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Make the best use of Scientific Research and information from our 700 + peer reviewed, Open Access Journals