Hierarchical Models for Genetic Association Studies

It is well known that complex diseases are caused by a network of interacting factors and therefore statistically it makes more sense to consider multiple predictors (genetic variants and environmental covariates) simultaneously than to consider a single SNP (singlenucleotide polymorphism) [1,2]. Such joint analyses not only improve the power for detecting causal effects by explaining a substantial fraction of trait variation but also potentially lead to increased understanding about the genetics of the underlying traits or diseases [3]. The dramatic increase in genetic discoveries involving complex diseases has provided great opportunities for statisticians to contribute critical concepts and methods to the field [2,3]. However, analysis and interpretation of genetic association studies for jointly handling multiple genetic and environmental variables involves interesting statistical challenges [4-6]. First, with multiple SNPs and environmental factors, there are many possible main effects and interactions, most of which are likely to be null or at least negligible, leading to high-dimensional sparse models [7]. In addition, there are many more possible interactions than main effects, requiring different modeling and parameterization for main effects and interactions [2]. Second, genetic association studies usually genotype SNPs with strong linkage disequilibrium (LD), introducing highly correlated variables [7]. Third, SNP data often include genotypes with low frequencies that create predictors with near-zero variation [7]. Finally, separation, which arises when a predictor or a linear combination of predictors is completely aligned with the outcome, is a common phenomenon in case–control genetic association studies [7-9]. These complications result in challenges in terms of statistical modeling and computation and thus sophisticated techniques are required to handle them.

It is well known that complex diseases are caused by a network of interacting factors and therefore statistically it makes more sense to consider multiple predictors (genetic variants and environmental covariates) simultaneously than to consider a single SNP (singlenucleotide polymorphism) [1,2]. Such joint analyses not only improve the power for detecting causal effects by explaining a substantial fraction of trait variation but also potentially lead to increased understanding about the genetics of the underlying traits or diseases [3]. The dramatic increase in genetic discoveries involving complex diseases has provided great opportunities for statisticians to contribute critical concepts and methods to the field [2,3]. However, analysis and interpretation of genetic association studies for jointly handling multiple genetic and environmental variables involves interesting statistical challenges [4][5][6]. First, with multiple SNPs and environmental factors, there are many possible main effects and interactions, most of which are likely to be null or at least negligible, leading to high-dimensional sparse models [7]. In addition, there are many more possible interactions than main effects, requiring different modeling and parameterization for main effects and interactions [2]. Second, genetic association studies usually genotype SNPs with strong linkage disequilibrium (LD), introducing highly correlated variables [7]. Third, SNP data often include genotypes with low frequencies that create predictors with near-zero variation [7]. Finally, separation, which arises when a predictor or a linear combination of predictors is completely aligned with the outcome, is a common phenomenon in case-control genetic association studies [7][8][9]. These complications result in challenges in terms of statistical modeling and computation and thus sophisticated techniques are required to handle them.

Why Hierarchical Models?
Recent years have seen an affluence of new methods for genetic and genomic research. As Allison et al. [10] states, "Many of these are wonderfully creative and useful". Among them, a potentially attractive approach, which has revolutionized modern genetic research, is Bayesian hierarchical modeling. Many papers have shown the practical and theoretical advantages of using Bayesian methods for genetic association studies [6]. Hierarchical models, which are more easily interpreted and handled in the Bayesian framework, are no exceptions. They provide a rational and quantitative way to incorporate biological information facilitating fitting of a large number of variables and a range of possible genetic models in a single analysis [6]. Thus, using appropriate prior information on the coefficients hierarchical models attempt to solve the aforementioned problems by providing stable, regularized estimates unlike non-hierarchical models, which generally cannot handle many variables simultaneously and often tend to overfit [2].

Continuous shrinkage priors
For genetic models with a large number of potential genetic variants and environmental covariates, it is reasonable to assume that most of the variables have null or weak effects on the phenotype, whereas only a few have noticeable effects [11]. Bayesian hierarchical models incorporate this idea by setting up shrinkage prior distributions on the coefficients that give each effect a high probability of being near zero. The shrinkage prior distributions usually have very heavy tails, which enable strong shrinkage of small coefficients while minimally shrinking large coefficients. A variety of continuous shrinkage priors have been proposed in literature and many of them have been adopted to QTL mapping and genetic association analysis [7,12]. Two most commonly used continuous shrinkage priors are Student's t-distribution and the double exponential distribution. With these shrinkage priors, the posterior mode estimates of the coefficients are the ridge-penalized estimate [13] and the lasso-penalized estimate [14] respectively.

Scale mixtures of normals
Both the double exponential distribution and the Student's t-distribution can be presented as a two level hierarchical model [12,15]. The first level assumes that the coefficients follow independent normal distributions with mean zero and unknown variances whereas the second level assigns independent prior distributions on the variances which themselves depend on the hyperparameters. The twolevel hierarchical formulation has several advantages. First, it allows easy and efficient computation facilitating MCMC and EM algorithms; conditional on the variances the coefficients can be easily estimated [2]. Secondly, it offers easy interpretation of the model; the coefficientspecific variances result in different shrinkage amounts for different coefficients [7]. Thirdly, it is flexible enough to encompass most popular penalized regression procedures and new hierarchical models can be defined by using new priors for the variances or further modeling the hyperparameters [16].

Sampling from continuous posterior distribution
On the other hand, taking advantage of the scale mixture representation of the shrinkage priors, various MCMC algorithms have been developed, many of which have recently been adapted to multiple QTL mapping and genetic association analysis [1,12,16,[21][22][23][24][25][26][27]. Since all the priors for regression coefficients are conditionally Normal, a simple and unified scheme can be developed to update the coefficients regardless of the specific prior distributions on the variances [2]. For the Student-t and double exponential priors, the conditional posterior distributions of the variances have standard forms and thus can be easily sampled using MCMC [2,12]. Also, the hyperparameters can be assigned appropriate hyperpriors so that they can be updated along with other parameters or estimated based on empirical Bayes using marginal maximum likelihood [16]. These methods can simultaneously fit many correlated variables and can distinguish important effects from a large number of correlated variables [2,7].

Software Implementation
The above-mentioned algorithms for fitting Bayesian hierarchical models for genetic association studies can be carried out in a freely available R package BhGLM, which can be downloaded at http:// www.ssg.uab.edu/bhglm/. The package provides a unified framework for setting up and fitting Bayesian hierarchical GLMs, for numerically and graphically displaying the results, with special emphasis on genetic association studies and QTL mapping. The functions in BhGLM are particularly useful for complicated genetic data analysis, e.g., QTL mapping in experimental crosses, genetic association studies for rare and common variants, prediction of complex diseases and traits, geneset and pathway analysis, haplotype association analysis, gene-gene (GXG) and gene-environment (GXE) interactions, etc. Moreover, the methods can be used for general data analysis as well as for analyzing high-dimensional and correlated data arising from other disciplines. Some important functions in the package include but not limited to bglm (for fitting Bayesian hierarchical GLMs), bglm.ex (extensions of Bayesian hierarchical GLMs), bglm.selection (variable selection for Bayesian hierarchical GLMs), bpolr (Bayesian hierarchical ordered logistic or probit regressions for ordinal response), make.haplo (for creating a design matrix for haplotypes), make.inter (for making design matrix of interactions (GXG and GXE)), make. main (to make design matrix of main effects from genotypic data of genetic markers), plot. bglm (for graphically summarizing Bayesian hierarchical GLMs fits), predict.bglm (to make predictions for Bayesian hierarchical GLMs), summary.bglm (to summarize Bayesian hierarchical GLMs fits), etc. Other functions and additional details can be found at http://www.ssg. uab.edu/bhglm/.

Concluding Remarks
Although we have discussed continuous shrinkage priors due to their natural computational advantages, other shrinkage priors such as the spike and slab priors (discrete, two-component mixture distributions) [28] and Bayesian non-parametric hierarchical methods based on Dirichlet and other priors [29][30][31][32] are also common. Other continuous shrinkage priors include: the horseshoe prior [33]; the generalized double-Pareto model [34]; the orthant normal prior [22]; the mixture of uniform prior [35]; the Laplace-Poisson model [36] and the hypergeometric inverted-beta model [37]. The Bayesian hierarchical models discussed above can simultaneously analyse many covariates, main effects of numerous loci, epistatic and GXE interactions. These methods can return not only point estimates but also interval estimates of all the parameters, and offers a natural means of assessing model uncertainty [2]. For large-scale genetic data, a preliminary analysis is recommended to weed out unnecessary variables, or a variable selection procedure should be assessed to build a parsimonious model that only includes the most important predictors [21]. The only disadvantage of the Bayesian hierarchical approach is the intensive computation, which becomes more rigorous for high dimensional genetic data. Nevertheless, Bayesian hierarchical models attempt to overcome the associated challenges in high dimensional genetic data analysis by simultaneous variable selection and reliable coefficient estimation [16,26]. As noted by Zhou [3], statistical methods must evolve naturally to meet the challenges of sequence data and to resolve the missing dark matter of genetic epidemiology. Therefore, more coherent and sophisticated approaches based on hierarchical models that can combine external biological information and results across studies (meta analysis), across SNPs in genes and/or across gene pathways will further improve our understanding. standard iterative weighted least squares (IWLS) algorithm for fitting classical generalized linear models [8,9]. This computational strategy takes advantage of the standard algorithm and leads to a stable, flexible and easily implementable computational tool [2,7]. The hierarchical generalized linear models with Student-t priors on the coefficients include various methods as special cases that have been designed to handle problems encountered in interacting QTL and association studies and therefore, can flexibly deal with various types of continuous and discrete phenotypes [2].