Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
Received date: April 14, 2013; Accepted date: May 28, 2013; Published date: June 01, 2013
Citation: Mallick H, Yi N (2013) Bayesian Methods for High Dimensional Linear Models. J Biomet Biostat S1:005. doi: 10.4172/2155-6180.S1-005
Copyright: © 2013 Mallick H, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Biometrics & Biostatistics
In this article, we present a selective overview of some recent developments in Bayesian model and variable selection methods for high dimensional linear models. While most of the reviews in literature are based on conventional methods, we focus on recently developed methods, which have proven to be successful in dealing with high dimensional variable selection. First, we give a brief overview of the traditional model selection methods (viz. Mallow’s Cp, AIC, BIC, DIC), followed by a discussion on some recently developed methods (viz. EBIC, regularization), which have occupied the minds of many statisticians. Then, we review high dimensional Bayesian methods with a particular emphasis on Bayesian regularization methods, which have been used extensively in recent years. We conclude by briefly addressing the asymptotic behaviors of Bayesian variable selection methods for high dimensional linear models under different regularity conditions.
Bayesian hierarchical models; Penalized regression; Regularization; Shrinkage methods; Bayesian variable selection; High dimensional linear models; Bayesian model selection; MCMC; Posterior consistency; Nonlocal priors; Bayesian subset regression
Linear models are probably the most widely used statistical models to investigate the influence of a set of predictors on a response variable. In practice, only a small subset of potential covariates actually has an influence on the response variable, whereas the effect of most predictors is very small or even zero. Since, model misspecification can have a significant impact on a scientific result, correctly identifying relevant variables is an important issue in any scientific research. If more predictors are included in the model, a high proportion of the response variability can be explained. On the other hand, overfitting (inclusion of predictors with null effect) results in a less reliable model with poor predictive performance. The problem of variable selection becomes more challenging for high dimensional problems, particularly when the number of predictors greatly exceeds the number of observations. High dimensional problems arise in a variety of scientific fields, such as bioinformatics, medicine, genetics, etc. High dimensionality could lead us to models that are very complex, which poses serious challenges in estimation, prediction and interpretation. Therefore, many classical approaches to variable selection cease to be useful due to computational infeasibility, model non-identifiability, or both.
Various methods have been developed over the years for dealing with variable selection in high dimensional linear models. Very recently, much work has been done in the direction of Bayesian framework. Unlike non-Bayesian methods, Bayesian analysis enables one to deal with model uncertainty by averaging over all possible models. Moreover, Bayesian methods have the ability to significantly reduce the actual complexity involved in the estimation procedure by incorporating prior information within the data into the model estimation technique. With ever-increasing computing power, these methods are increasingly becoming popular and gaining more and more insight and considerations for high dimensional analysis.
In this review, we present a selective overview of some recent developments in Bayesian model and variable selection methods for high dimensional linear models. While most of the reviews in literature are based on conventional methods, we focus on recently developed methods, which have been used extensively over the years. The review is structured as follows. First, we give a brief overview of the traditional commonly used model selection methods followed by a discussion on some recently developed approaches for linear models, which have proven to be successful in dealing with high dimensional variable selection. Then, we mention some conventional Bayesian variable and model selection methods, along with some recently developed Bayesian approaches, with a particular emphasis on Bayesian regularization methods. We conclude by briefly addressing the asymptotic behaviorsof Bayesian high dimensional variable selection methods for linear models under different regularity conditions.
In statistical tradition, commonly used methods for model selection are backward, forward and stepwise selection, where in every step, predictors are added to the model or eliminated from the model, according to a precisely defined testing rule. Besides accurate prediction, the primary goal in model selection is also to come up with meaningful, interpretable and parsimonious model. Traditional methods such as stepwise regression fall short in one or more of these criteria. To overcome these shortcomings, several information-type methods have been developed, which aim to provide a trade-off between model complexity and goodness-of-fit of a model. We describe the standard normal linear model and Bayesian hierarchical normal linear model in later subsections. Then, we briefly review four commonly used methods viz. Mallow’s Cp [1], AIC [2] and BIC [3] for standard normal models, and DIC [4] for Bayesian hierarchical normal linear models, which remain the most popular model selection methods used for linear models and hierarchical linear models respectively.
Standard normal linear model
In the simplest case of a normal linear regression model, it is assumed that the mean of the response variable can be described as a linear function of a set of predictors. Mathematically, in a normal linear regression setup, we have the following model
(1)
where y is the n×1 vector of centered responses; X is the n× p matrix of standardized regressors; β is the p×1 vector of coefficients to be estimated and ε is the n×1 vector of independent and identically distributed normal errors with mean 0 and variance σ^{2} . The classical estimator in linear regression is the Ordinary Least Squares (OLS) estimator which is obtained by minimizing the residual sum of squares (RSS) given by
(2)
The OLS estimator has some attractive statistical properties, in the sense that it is the best linear unbiased estimator (BLUE), provided the classical assumptions hold [5]. Also, it coincides with the maximum likelihood estimator (MLE) for normal linear models.
Bayesian hierarchical normal linear model
In Bayesian hierarchical normal linear model, we assume that the distribution of the dependent variable y is specified conditional on the parameters β and σ^{2} as
(3)
Then, any prior information on (β, σ) is incorporated by specifying a suitable prior distribution p(β, σ^{2}) on them. This second-level model has its own parameters known as hyperparameters, which are usually estimated from the data. After observing the data, the prior distribution, p(β, σ^{2}) is updated by the corresponding posterior distribution, which is obtained as
(4)
The posterior distribution contains all the current information about the parameters. Ideally one might fully explore the entire posterior distribution by sampling from the distribution, using Markov chain Monte Carlo (MCMC) algorithms [6]. Due to its ability to incorporate specific hierarchical structure of the data (correlation among the predictors), hierarchical modeling is often more efficient than traditional approaches [7].
Mallow’s C_{p}
For a subset model with k ≤ p explanatory variables, the C_{p} statistic proposed by Mallows [1], is defined as
(5)
where s^{2} is the MSE for the full model containing p explanatory variables and RSS(k) is the residual sum of squares for the subset model containing k explanatory variables. In practice, C_{p} is usually plotted against for a collection of subset models under consideration and models with C_{p} approximately equal to p are taken as acceptable models, in the sense of minimizing the total bias of the predicted values [5]. Woodrofe [8] showed that Cp is a conservative model selector, which tends to overfit. Nishii [9] showed that C_{p} is not consistent in selecting the true model, and often tends to select a larger model as n → ∞ .
Akaike’s Information Criteria (AIC)
The AIC of Akaike [2] is defined as
(6)
where L is the likelihood function evaluated at the MLE. Given a set of candidate models, the ‘best’ model is the one with the minimum AIC value. Similar to Mallow’s C_{p}, AIC is not model selection consistent [9]. Here, the consistency of a model selection criterion means that the probability of the selected model being equal to the true model converges to 1. More information about AIC can be found in Burnham and Anderson [10]. The asymptotic approximation on which the AIC is based is rather poor when n is small [11]. Therefore, Hurvich and Tsai [12] proposed a small-sample correction, leading to the AIC_{c} statistic defined by
(7)
AIC_{c} converges to AIC as n gets larger, and therefore, it is preferred to AIC regardless of the sample size [11].
Bayesian Information Criteria (BIC)
While AIC is motivated by the Kullback-Leibler discrepancy of the fitted model from the true model, Schwarz [3] derived BIC from a Bayesian perspective by evaluating the leading terms of the asymptotic expansion of the Bayes factor. The BIC is defined as
(8)
where L is the likelihood function evaluated at the MLE. Similar to AIC, model with minimum BIC is chosen as the preferred model from a set of candidate models. It is well known that neither AIC nor BIC performs better all the time. However, unlike AIC, BIC is a consistent model selection technique, which means, as the sample size n gets large enough, the lowest BIC model will be the true model, with probability 1 [10,11]. For a comparison of AIC and BIC, refer to Kundu and Murali [13] or Yang [14].
Deviance Information Criteria (DIC)
For model selection in Bayesian hierarchical normal linear models, Spiegelhalter et al. [4] proposed the generalization of AIC and BIC defined as
(9)
where D= -2log L is the deviance evaluated at the posterior mean of the parameters, and pD is the effective number of parameters calculated as the difference between posterior mean deviance and deviance of posterior means. Like AIC and BIC, models with smaller DIC are better supported by the data. DIC is particularly useful when the MCMC samples are easily available, and is valid only when the joint distribution of the parameters is approximately multivariate normal [4]. DIC tends to select over-fitted models, which has been addressed by Ando [15], although very little is known about its performance in high dimensional models. As noted by Gelman et al. [16], various other difficulties (apart from overfitting) have been noted with DIC, but there has been no consensus on an alternative.
In the setting of a linear regression model, if the number of covariates p is of the polynomial order or exponential order of the sample size , i.e., p = O(n^{κ}) or p = O(exp(n^{κ})) for κ > 0, then it is called a high dimensional problem [17]. Despite their popularity, classical model selection criteria such as Mallow’s C_{p}, AIC, BIC tend to select more variables than necessary for high dimensional linear models, especially when the number of regressors increases with the sample size [18,19]. Also, Yang and Barron [20] argues that in some cases the overfitting problem can be substantial, resulting in severe selection bias, which damages predictive performance for high dimensional models. To overcome this, various model selection criteria for high dimensional models have been introduced recently. Wang et al. [21] proposed a modified BIC (mBIC), which is consistent when p is diverging slower than n. Chen and Chen [22,23] developed a family of extended Bayesian information criteria (EBIC), for variable selection for high dimensional problems. On the other hand, a large amount of effort has gone into the development of regularization methods for simultaneous variable selection and coefficient estimation. Regularization methods mitigate modeling biases and achieve higher prediction accuracy in high dimensional linear models by shrinking the coefficients and providing meaningful estimates, even if the model includes a large number of, and/or highly correlated predictors. We describe the extended Bayesian information criteria and regularization methods in the following subsections, before gradually moving to Bayesian methods for high dimensional linear models.
Extended Bayesian Information Criteria (EBIC)
The family of EBIC is indexed by a parameter γ in the range [0,1]. The extended BIC (EBIC) is defined as
(10)
where L is the likelihood function evaluated at the MLE, and γ > 0 is a tuning parameter. The original BIC is a special case of EBIC with γ = 0 . The mBIC is also a special case of EBIC in an asymptotic sense; i.e. it is asymptotically equivalent to the EBIC with γ =1. Chen and Chen [23] established the model selection consistency of EBIC, when p = O(n^{κ}) and for any κ > 0 , where consistency implies that as n → ∞ , the minimum EBIC model will converge in probability to the ‘true’ model. Among other developments, General Information Criterion (GIC) proposed by Shao [24] is known to be consistent in high dimensions. Kim et al. [25] showed that EBIC is asymptotically equivalent to GIC.
Regularization methods
Similar to information-type methods, various regularization methods have been developed to overcome the problem of overfitting in high dimensional linear models. It is well known that OLS often does poorly in both prediction and interpretation in high dimensions. Despite its nice statistical properties, it is highly unstable in the presence of multicollinearity. Also, if p >> n , it produces a non-unique estimator, since X is less than full rank (non-identifiability). Motivated by this, regularization methods (also known as penalized likelihood or shrinkage method) with various penalties have been developed, which have proven to be successful and model selection consistent for high dimensional linear models [26,27].
The problem of interest involves estimating a sparse vector of regression coefficients by minimizing an objective function Q that is composed of a loss function (without loss of generality, most commonly used least squares loss function (RSS) is considered, although least absolute deviation and negative log-likelihood is also common) plus a penalty function P, i.e.
(11)
where P is a function of the coefficients indexed by a parameter λ > 0 , which controls the degree of penalization. Typically, the penalty function P has the following properties [28]:
1. It is symmetric about the origin, i.e. P(0) = 0 ,
2. is non-decreasing in (0,∞) .
This approach produces a spectrum of solutions depending on the value of λ . Such methods are often referred to, as regularization methods, and λ is called the regularization parameter (or tuning parameter). The penalty function serves to control the complexity of the model and provides criteria for variable selection and model comparison, by imposing some constraints on the parameters. The form of P_{λ} (β) determines the general behavior of regularization methods. Small penalties lead to large models with limited bias, but potentially high variance; large penalties lead to the selection of models with fewer predictors, but with less variance. A variety of penalty terms have been proposed, among which the most popular ones are ridge regression, the lasso and the elastic net. We summarize these methods in Table 1. For a more comprehensive review on regularization methods, refer Bickel and L_{i} [29].
Method | Tuning parameter | Penalty function |
---|---|---|
Lasso | λ | |
Ridge | λ | |
Bridge | λ | |
Adaptive Lasso | λ_{1}.....,λ_{p} | |
Elastic Net | λ_{1},λ_{2} | |
Group Lasso | λ |
Table 1: Different penalty functions.
The bridge family: Regularization methods date back to the proposal of ridge regression by Hoerl and Kennard [30], who suggested minimizing the following objective function:
(12)
As a continuous shrinkage method, ridge regression achieves better prediction performance than OLS through a bias-variance tradeoff (biased estimates with lower variance). However, ridge regression cannot produce a parsimonious model, as it always keeps all the predictors in the model.
Frank and Friedman [31] introduced bridge regression, a broad class of penalized regression, which is obtained by minimizing
(13)
where λ is the tuning parameter and α is the concavity parameter, as it controls the concavity of the objective function (13). It includes lasso and ridge as special cases (corresponding to α =1 and α = 2 respectively). Although Frank and Friedman [31] did not solve for the estimator of bridge regression for any given α ≥ 0, they indicated that optimum choice of α would yield reasonable predictor. The bridge estimator does variable selection when α ≤1 and shrinks the coefficients when α >1 [32]. These three estimators viz. ridge, lasso and bridge are together referred to as the bridge family.
Among penalized regression techniques, the most popular and widely used method in statistical literature is the Least Absolute Shrinkage and Selection Operator (LASSO). The lasso of Tibshirani [33] is obtained by minimizing
(14)
Compared to ridge regression, a remarkable property of lasso is that it can shrink some coefficients exactly to zero, and therefore, can automatically achieve variable selection. Intuitively, this can be explained by the fact that is much larger than for small β_{j} and thus the L_{1}-penalty enforces some β_{j}'s exactly to zero. Figure 1 shows the behavior of these three penalty functions in a two-parameter case, β_{1} and β_{2} . To obtain the regularized estimators, we essentially seek the points at which the objective function contour first “hits” the constraint. Lasso, ridge and bridge penalty functions have constraints shaped like a square, circle and star, respectively. As a consequence of the different shapes, lasso is likely to involve variable selection (β_{1} = 0 or β_{2} = 0), as well as parameter estimate shrinkage, and ridge yields mainly parameter estimate shrinkage; in contrast, bridge induces an even higher chance of variable selection than lasso, because the star shape of bridge makes the contour even more likely to hit one of the points (β_{1} = 0 or β_{2} = 0), than does the diamond shape of lasso.
Figure 1: A graphical illustration of the properties of three different penalty functions. The eclipses represent the contours of the objective functions (A– C). The square, round, and star shapes represent the lasso, ridge, and bridge constraint, respectively. The dots are the points where contours are “tangent” to the constraints, i.e., the regularized estimates. Note that, in lasso (A) or bridge (C), the constraint is discontinuous at zero. If the contour first touches the constraint at point zero, the corresponding parameter estimate is zero and variable selection is achieved [34].
Some generalizations: The lasso has demonstrated excellent performance in many situations. As a consequence, most of the developments in recent years are focused on the lasso and related problems. However, despite its promising nature, there are three inherent drawbacks of lasso [35]. Firstly, due to the nature of the convex optimization problem, the lasso method cannot select more predictors than the sample size. But, in practice, there are often studies that involve much more predictors than the sample size, e.g. microarray gene expression data analysis, cloud detection through analysis of satellite images, classification of spam emails, and many others. Secondly, when there is some group structure among the predictors, the lasso estimator usually selects only one predictor from a group, while ignoring others. Thirdly, when the predictors are highly correlated, lasso performs unsatisfactorily. We discuss some of the alternatives and generalizations that have been proposed to overcome the above limitations of lasso.
The lasso uses a unique tuning parameter λ to equally penalize all the coefficients. In practice, due to the single tuning parameter, lasso can either include irrelevant variables or over-shrink large coefficients [36], which is critical for high dimensional problems. To address the issue, Zou [37] introduced the adaptive lasso that uses a weighted L_{1}-penalty
(15)
where λ_{j} is the tuning parameter corresponding to the j^{th} coefficient β_{j} , j =1(1) p . The intuition of adaptive lasso is to shrink coefficients differently by shrinking important variables slightly and unimportant variables heavily.
To address the issue of variable selection for grouped variables, Yuan and Lin [38] proposed the group lasso estimator, in which the penalty is given by
(16)
where K is the no. of groups with and β_{kj} is the coefficient corresponding to j^{th} predictor in the k^{th} group, j=1(1)m_{k}, k=1(1)K. With appropriately chosen tuning parameter, the group lasso can shrink all coefficients in a group to zero or keep all coefficients in a group in the model.
Zou and Hastie [35] proposed the elastic net estimator to achieve improved performance in situations when there is multicollinearity and grouping among predictors. The penalty term in elastic net is a convex combination of the lasso penalty and the ridge penalty, i.e.
(17)
where λ_{1} > 0 , λ_{2} > 0 are two tuning parameters.
The elastic net estimator can be interpreted as a stabilized version of the lasso, and it often outperforms the lasso, especially when there are groups of highly correlated predictors.
Other related regularization methods include the smoothly clipped absolute deviation (SCAD) method [28], the fused lasso [39], the adaptive elastic net [40], the minimax concave penalty (MCP) [41], the adaptive bridge estimator [32], the Dantzig selector [42], and the group bridge estimator [43].
Optimization algorithms: Various optimization algorithms have been proposed to obtain the lasso and related estimators [44]. Notably, the least angle regression (LARS) [45], and the coordinate descent algorithm [46,47], are the most computationally efficient ones. Given the tuning parameters, these algorithms are extremely fast (e.g. the computational load of LARS is same as that of a single OLS fit), thus making the penalized regression approaches extremely popular in high dimensional data analysis.
Limitations of regularization methods: In spite of being theoretically attractive, here are at least three serious disadvantages of frequentist penalized regression approaches:
1. Penalized regression is essentially an optimization problem that only provides a point estimate of β. Nevertheless, people usually also need to know the level of confidence of the estimates, such as the confidence interval (or credible interval), and the p-value. This problem can be addressed by applying bootstrap sampling [33], but it is computationally intensive. Kyung et al. [48] showed that the bootstrap estimates of the standard errors of the lasso estimates might be unstable, and are not consistent if true β_{j}=0.
2. The penalized regression approaches need to preset the tuning parameter(s). The commonly used method is to use cross-validation [49], to choose ‘optimal’ values of the tuning parameters. However, cross-validation can be computationally costly, e.g. for the adaptive lasso, it is very challenging to choose multiple tuning parameters. Moreover, cross-validation is a standard way to assess the predictive accuracy of the model. Choosing tuning parameters using cross validation of the prediction error can result in unsatisfactory performance, when the main goal is to identify relevant variables.
3. In frequentist framework, it may be challenging to deal with complicated multilevel/hierarchical structures of data, and to incorporate external information (for example, data with specific group level information) into the model.
In the Bayesian framework, the model selection problem is transformed to the form of parameter estimation: rather than searching for the single optimal model, a Bayesian will attempt to estimate the posterior probability of all models, within the considered class of models (or in practice, of all models with non-negligible probability) [50]. In many cases, this question is asked in variable-specific form (variable selection): i.e. the task is to estimate the marginal posterior probability that a variable should be in the model [50]. Thus, variable selection can be considered a special case of model selection.
In this section, we review some recently developed Bayesian methods for high dimensional variable selection. There is an enormous amount of literature on Bayesian variable selection methods [50-55]. Some conventional Bayesian variable selection methods are Gibbs Variable Selection (GVS) [50,53,56], Bayesian Model Averaging (BMA) [50,54,56], Stochastic Search Variable Selection (SSVS) [52], Unconditional Priors for Variable Selection [57], Product Space Search [58], and RJMCMC (Reversible Jump MCMC [59]). Other methods include approaches based on Zellner’s g-prior [60-62], Bayes factor [63], fractional Bayes factor [64], objective Bayes [65,66], etc. Due to space constraint, it is impossible to discuss all the available methods in this review. Since, detailed and comprehensive reviews on the state-ofthe- art Bayesian methods have previously appeared in literature [50-55], here we only mention some of those approaches. Instead, we focus on popular recently developed methods. We particularly focus on the Bayesian regularization methods, which have been proven successful for high dimensional variable selection.
Most of the conventional Bayesian variable selection methods rely on MCMC algorithms by specifying spike and slab priors on the coefficients subject to selection [50-52], requiring computation of marginal likelihood, which is computationally intensive for high dimensional models. Also, posterior sampling of these methods often requires stochastic search over an enormous space of complicated models facilitating slow convergence and mixing, when the marginal likelihood is not analytically tractable [67]. On the other hand, Bayesian regularization methods specify both the spike and slab as continuous distributions, which can be written as scale mixtures, leading to simpler MCMC algorithm with no marginal likelihood being computed. Also, unlike conventional Bayesian methods, Bayesian regularization methods specify shrinkage priors, which enable simultaneous variable selection and coefficient estimation. We also discuss two new model selection approaches viz. Bayesian model selection based on nonlocal prior densities proposed by Johnson and Rossell [67], and Bayesian subset regression (BSR) proposed by Liang et al. [68], which are shown to be model selection consistent for high dimensional linear models.
Bayesian regularization methods
Regularization methods are originally developed by the frequentists, and obtaining statistical inference on the regression coefficients is usually difficult, and often requires various kinds of asymptotic approximations. In contrast, a Bayesian approach enables exact inference, even when the sample size is small. Regularization methods naturally lend itself to a Bayesian interpretation, in which the penalization term in penalized regression is the negative log prior of the coefficient. Apart from their easy interpretability, Bayesian methods have some advantages over frequentist methods. Firstly, in MCMCbased Bayesian regularization methods, we have a valid measure of standard error obtained from the posterior distribution, and thus, we can easily obtain interval estimates of the parameters, along with other quantities. Secondly, it is more flexible in the sense that we can estimate the tuning parameter jointly with other parameters of interest. Thirdly, unlike in frequentist framework, it is fairly straightforward to extend a Bayesian model to incorporate multilevel information inherent in the data. Lastly, using MCMC and Gibbs sampler to search for the model space for the most probable variable models, without fitting all possible models is efficient, which avoids time-consuming computation [56].
Regularization methods as hierarchical models: Recent years have seen a resurgence of interest in Bayesian hierarchical modeling techniques. This increasing popularity can be contributed to the fact that hierarchical models are more easily interpreted and handled in the Bayesian framework. Hierarchical models can significantly reduce the ‘effective’ no. of parameters in a model by linking the coefficients together, or shrinking some of them. In Bayesian hierarchical models, the hyperparameters include shrinkage parameters, which control the complexity of the model, similar to the tuning parameter λ in penalized regression approaches. With the given prior distribution, the log posterior density for linear models become
(18)
where p(σ^{2}) and p(λ ) are the prior distributions of σ^{2} and λ respectively, is the likelihood function and is the prior distribution of β. It is to be noted that the posterior mode (MAP) estimate can be obtained by maximizing the posterior density in equation 18. Therefore, the posterior mode estimate is equivalent to the penalized regression estimate, with as the penalty. Thus, with particular priors, hierarchical models can lead to similar results as penalized regression approaches.
It is evident that the prior distribution plays an important role in Bayesian regularization methods. For models with a large number of potential variables, it is reasonable to assume that most of the variables have no or very weak effects, whereas only some have noticeable effects. Therefore, we should set up a prior distribution that gives each coefficient a high probability of being near zero. Such prior distributions are often referred to as shrinkage prior distributions.
A shrinkage prior distribution should have an infinite spike at zero and very heavy tails, thereby strongly shrinking small coefficients to zero, while minimally shrinking large coefficients, and also enable to incorporate hierarchical structure of the data. Therefore, the resulting hierarchical models can effectively remove unimportant variables and reliably estimate the coefficients of important variables simultaneously [6].
Bayesian lasso: Tibshirani [33] suggested that the lasso estimates can be interpreted as posterior mode estimates, when the regression parameters are assigned independent and identical Laplace priors. A remarkable feature of the double exponential distribution is that it can be presented as a two-level hierarchical model [69], as scale mixtures of normal distributions with independent exponentially distributed variances. The two-level formulation of the double exponential distributions offers advantages of easily interpreting the model and developing computational algorithms. The latent variables directly control the amount of shrinkage in the coefficient estimates. If , there is no shrinkage; if the j^{th} coefficient is shrunk to zero. Although these latent variables are not the parameters of interest, they are useful quantities that allow easy computation [70].
Park and Casella [71] introduced Gibbs sampling for Bayesian lasso, using a conditional Laplace prior specification of the form
(19)
and non-informative scale-invariant marginal prior on σ^{2} , i.e. . They pointed out that conditioning on σ^{2} is important, as it ensures unimodal full posterior. Lack of unimodality might slow down the convergence of the Gibbs sampler, and make the point estimates less meaningful [48]. The Bayesian formulation of the original lasso, as given in Park and Casella [71], is given by the following hierarchical model:
(20)
With this formulation, the posterior distribution of β is normal, while the reciprocals of the latent variables are distributed as inverse Gaussian distributions (Table 2). The posterior distribution of σ^{2} is inverse gamma distribution. Based on this, Park and Casella [71] formulated the Gibbs sampler for the Bayesian lasso and achieved variable selection by interval estimation. The adaptive version of Bayesian lasso can be obtained similarly by specifying variable-specific tuning parameter (Table 2).
Method | Posterior Distributions of Latent Parameters | |||
---|---|---|---|---|
Bayesian lasso | ||||
Bayesian ridge | ||||
Bayesian adaptive lasso | ||||
Bayesian elastic net | ||||
Bayesian group lasso* (K Groups) |
*For Bayesian Group Lasso the shrinkage prior distribution is with corresponding mixing density and posterior distribution , where , , , , , , is a diagonal matrix with diagonal elements , is a diagonal matrix with diagonal elements is a diagonal matrix with diagonal elements . IG refers to inverse Gaussian distribution.
Table 2: Scale mixture representation of different shrinkage priors and corresponding posterior distributions.
Bayesian ridge: The hierarchical representation of Bayesian ridge estimator is obtained as follows:
(21)
Under this hierarchical prior, the posterior distribution of β is normal, while the reciprocal of the latent variable τ^{2} is distributed as χ^{2} distribution. The above Bayesian ridge regression can be extended to include variable-specific latent variables. In that case, the prior distributions become
(22)
And the conditional posterior distribution of becomes
(23)
Bayesian group lasso: The hierarchical representation of Bayesian group lasso estimator is obtained as follows:
(24)
Under this hierarchical prior, the posterior distribution of β_{k} is normal, while the reciprocals of the latent variables are distributed as inverse Gaussian distributions (Table 2). A non-informative scaleinvariant marginal prior on σ^{2} results in an inverse gamma posterior distribution of σ^{2}.
Bayesian elastic net: Similarly, the hierarchical representation of Bayesian elastic net estimator is obtained as follows:
(25)
Under this hierarchical prior, the posterior distribution of β is normal, while the reciprocals of the latent variables are distributed as inverse Gaussian distributions (Table 2). Here also, a non-informative scale-invariant marginal prior on σ^{2} results in an inverse gamma posterior distribution of σ^{2}. For a slightly different version of Bayesian elastic net estimator refer to the paper of L_{i} and Lin [72]. Other related developments are Bayesian bridge [73], and a different version of Bayesian adaptive lasso [74].
Estimation of hyperparameters: In the Bayesian framework, typical approaches for estimation of the tuning parameters are based on incorporating them into the Gibbs sampler with an appropriate hyperprior [48]. Park and Casella [71] suggested using a gamma prior G(a,b) for a proper posterior. The prior, which is put on λ^{2} for convenience because of the way λ enters into the posterior [48], is given by
(26)
For elastic net, we have two parameters, and we assign G(a_{1} ,b_{1}) and G(a_{2} ,b_{2}) for and respectively. When the prior (Equation 26) is used in the hierarchy, the full conditional distributions of all the tuning parameters are gamma distributions, and are listed in Table 3. For Bayesian ridge, a gamma prior G(a,b) is put on s^{2}, which again results in a gamma posterior (Table 3). The tuning parameters can also be estimated through the marginal likelihood of λ , which can be implemented with an EM/Gibbs algorithm [48].
Posterior distributions of the tuning parameters and their expectations | E-step | M-step | |
---|---|---|---|
Bayesian lasso | |||
Bayesian ridge | |||
Bayesian adaptive lasso | |||
Bayesian elastic net | |||
Bayesian group lasso^{**} (K Groups) |
^{**}where, ,
Table 3: EM algorithms for various Bayesian regularization methods.
Algorithms for fitting Bayesian regularization methods:
section, we briefly describe two popular model-fitting algorithms (viz. MCMC and EM), for estimating parameters in Bayesian regularization methods in linear regression. For the MCMC algorithms, we only describe the hierarchical formulations and the corresponding posterior distributions. Any feature of the posterior distribution is a legitimate candidate for Bayesian inference: moments, quantiles, highest posterior density regions, etc. [19]. Posterior median is one attractive measure as a robust estimator. Hans [75] emphasized using the posterior mean as point estimate, as it facilitates prediction of future observations via the posterior predictive distribution. However, implementation of MCMC algorithms for high dimensional problems may require excessive computing time. For practical and computational purposes, it is often desirable to have a fast algorithm that returns point estimates of the
parameters and their standard errors. The EM algorithm [76] aims to address this issue by providing MAP estimates along with speedy inference and fast computation.
It must be emphasized that both non-Bayesian and Bayesian regularization methods are essentially optimization methods, with the common goal of determining the model parameters that maximize some objective function. A Bayesian approach can often lead to very different results than a traditional penalized regression approach [76]. Although, the Gibbs samplers discussed in this paper are extremely fast [48], one should be aware of the existence of problems relating to MCMC algorithms, e.g. slow convergence, poor mixing, etc. Therefore, once the simulation algorithm has been implemented and the simulations are drawn, it is absolutely necessary to check the convergence of the simulated sequences [6]. Inference based on the output of a slowly converging algorithm may require long runtime (many iterations e.g. thousands, millions, or more). If the convergence is not attained (painfully slow), one should resort to an alternative algorithm or other remedies, e.g. increasing burn-in period, thinning, etc. [6], for the inference to be valid.
MCMC algorithm
As shown above, the conditional posterior distribution for each parameter has standard form, and thus, can be directly sampled. Thus, the MCMC algorithm can be applied to fully explore the joint posterior distribution by sampling each parameter from its conditional posterior distribution. In summary, the MCMC algorithm proceeds as follows:
1. Initialize all the parameters with some plausible values.
2. Update β by sampling from its conditional posterior.
3. Update the latent variables and the variance parameter by sampling from their conditional posterior distributions.
4. If the hyperparameters are not prefixed, update them by sampling from their conditional posterior distributions.
EM algorithm
Given the latent variables ’s, the prior information of β can be incorporated in the linear model, as p ‘additional data-points’ with value 0 (prior means), and corresponding ‘explanatory variables’ equal to 0, except x_{k} which equals 1 and residual variance depending on [6]. Therefore, given , the posterior mode of (β, σ^{2}) can be obtained by performing weighted linear regression on the augmented response variable y*, augmented design matrix X* and augmented variancecovariance matrix Σ, where
and Σ_{β}is a diagonal matrix containing prior variances. Therefore, by treating the latent variables as ‘missing data’ and averaging over them, we can estimate the posterior mode of (β, σ^{2}) by EM algorithm [76]. The algorithm proceeds as follows:
1. Initialize all the parameters with some plausible values.
2. E-Step: Update the latent variables by replacing their posterior conditional expectations.
3. M-Step: Update (β, σ^{2}) by weighted linear regression on the augmented data as follows:
(27)
4. Repeat 1,2, and 3, until convergence.
Table 3 gives the summary of E and M steps for different Bayesian regularization methods. In some cases, the hyperparameters need to be updated in the E-step, which can be easily done by plugging in their conditional posterior expectations. Other variants of EM algorithms are proposed by Figueiredo [77] and Xu [78].
Extension to GLM: A generalized linear model (GLM) consists of three components: the linear predictor, the link function and the distribution of the outcome variable [6,79]. The linear predictor can be expressed as: η = Xβ ; the link function g(.) relates the linear predictor η to the mean of the outcome variable y as: and the distribution of y depends on the linear predictor Xβ and generally also a dispersion (or variance) parameter φ and can be expressed as: where X_{i}β =η_{i} is the linear predictor for the i^{th} observation.
The standard IRLS algorithm proceeds by approximating the generalized linear model by a weighted normal linear regression, and estimating the maximum likelihood estimates of the parameters (β, φ) using weighted least squares, and then iterating the process [6]. At each iteration, the algorithm calculates pseudo-data z_{i}and pseudovariances for each observation based on the current estimates of the parameters , approximating the generalized linear model likelihood by the normal likelihood and then updating the parameters by weighted linear regression. The iteration proceeds until convergence. The pseudo-data z_{i} and pseudovariances are calculated by
(28)
and the estimates are obtained as
(29)
where , L is the log-likelihood, L’ and L” are first and second order derivatives of L with respect to η_{i} and are the current estimates of (β ,φ). Thus, all the methods described above can be easily extended to generalized linear models by approximating the GLM likelihood by normal likelihood [80,81].
Related Bayesian shrinkage methods: Bayesian regularization methods are not merely the Bayesian versions of different regularization methods. In fact, Bayesian regularization methods fall into the broad class of Bayesian shrinkage methods. Apart from the methods discussed above, there are tons of other Bayesian shrinkage methods available in literature, which include the normal/exponential-gamma model of Griffin and Brown [82]; the normal/gamma and the normal/inverse- Gamma model [70,83]; the horseshoe prior of Carvalho et al. [84]; the generalized double-Pareto model of Armagan et al. [85]; the orthant normal prior of Hans [86]; mixture of uniform prior of Knürr at al. [87]; the Laplace-Poisson model of Chen et al. [88], and the hypergeometric inverted-beta model of Polson and Scott [89]. For the literature of EM algorithms in Bayesian shrinkage methods, refer Gelman et al. [6], Green [90], Polson and Scott [89].
Bayesian model selection using nonlocal priors
Regularization methods identify only one model that maximizes a penalized likelihood function, or minimizes RSS subject to a penalty. On the other hand, most conventional Bayesian methods based on local prior densities [61-64] provide estimates of posterior model probabilities that are poor and unrealistic, and therefore, the posterior model probability estimates are often unreported for high dimensional linear models [67]. To overcome this deficiency, Johnson and Rossell [67] recently proposed Bayesian model selection methods by specifying nonlocal prior densities on the regression coefficients, which provide accurate estimates of the posterior probability that each identified model is correct. Unlike local prior densities, which are positive at null parameter value, nonlocal prior densities are identically zero whenever a model parameter is equal to its null value [91]. Johnson and Rossell [67] introduced two classes of nonlocal priors on the coefficients, along with their frequentist analogues for both densities, which we describe below.
Product moment (pMOM) densities: The first class of nonlocal densities is called product moment (pMOM) densities, which are defined as
(30)
for τ > 0 , A_{p} a p× p nonsingular matrix, and r = 1, 2,... is called the order of the density and d_{p} is the normalizing constant independent of σ^{2} and τ. Similar to BIC, an objective function that might be associated with this prior can be expressed as
(31)
where L is the likelihood function and d > 0. Similar to BIC, model with minimum value of the objective function (equation 31) should be selected as the best model.
Product inverse moment (piMOM) densities: The second class of nonlocal densities is called product inverse moment (piMOM) densities, which are defined as
(32)
for τ > 0 and r=1,2,.... The parameter τ in both pMOM and piMOM prior densities represents a scale parameter that determines the dispersion of the prior densities on β around the null vector, and it should be estimated carefully for efficient computation [67]. The objective function associated with this prior can be expressed as
(33)
where L is the likelihood function and d > 0. Similar to above, model with minimum value of the objective function (Equation 33) should be selected as the best model
Based on these prior densities, Johnson and Rossell [67] formulated marginal densities in analytical form, leading to analytical posterior probabilities. Using these expressions, they proposed an MCMC scheme to obtain posterior samples. They demonstrated that the proposed method is model selection consistent (posterior probability of selecting the ‘true’ model approaches 1), as n → ∞ and as long as n ≤ p under certain regularity conditions on the design matrix X. Also, the proposed method performed impressively, when compared with Bayesian methods based on local prior specifications [61-64]. Moreover, the proposed method performs either as well, or better than various regularization methods, as evident from their simulation studies. Although they provided frequentist versions of the nonlocal prior specifications, Johnson and Rossell [67] recommended using Bayesian methods, as it facilitates inference regarding the posterior probability that each model is true along with easy computation.
Bayesian Subset Regression (BSR)
Another recently developed method, which is particularly interesting to us and within the scope of current review, is the Bayesian Subset Regression (BSR) method proposed by Liang et al. [92], for high dimensional generalized linear models. They propose a new prior specification, which results in a Bayesian subset regression (BSR), with the negative log-posterior distribution approximately reduced to EBIC, when the sample size is large. In addition, they propose a variable screening procedure based on marginal inclusion probability, which is shown to have same theoretical properties of sure screening and is consistent, as the SIS procedure by Fan and Song [93], although both SIS [93,94], and its iterative extension ISIS [95] are outperformed by the proposed method, in terms of prediction accuracy. Also, the proposed method outperforms several popular regularization methods, including lasso and elastic net, suggesting that BSR is more suitable in high dimensional models than regularization methods. Here, we describe the method for high dimensional linear models.
Prior specification: To formulate the prior specification of their method, let us denote the number of explanatory variables as Pn. Following Liang et al. [92], let us denote by a subset model (of size ≤ P_{n} ) of the full model with P_{n} predictors. Let, denote the vector of true regression coefficients of the model ξ_{n}. With this formulation, Liang et al. [92] set up following priors on and ξ_{n}
(34)
(35)
where is a pre-specified variance chosen, such that
(36)
which ensures that the prior information of can be ignored for sufficiently large n; v_{n} denotes the prior probability of each variable, independent of other variables to be selected for the subset model, and K_{n} is an upper bound on the model size facilitating calculation of the MLE .
Posterior distribution: With the above prior specifications and following prior probability specification
(37)
the log-posterior distribution is approximated as the following whenever
(38)
and equals-∞ otherwise. Thus, the negative of the log-likelihood approximately reduces to the EBIC. This leads to Bayesian subset regression, with the MAP model approximately equivalent to the minimum EBIC model. For the simulation of the posterior, they propose an adaptive MCMC algorithm. With extensive numerical studies, they establish that BSR outperforms penalized likelihood approaches significantly, especially when the dimension of P_{n} is increasing. Under mild conditions, they establish the consistency of the resulting posterior. In addition, they show that the posterior probability of the true model will converge to 1 as n → ∞ .
In this section, we consider the asymptotic behaviors of Bayesian variable selection methods in high dimensional linear models. Here also we denote the number of explanatory variables as P_{n} which is possibly much larger than the sample size n and it is assumed that the regression coefficients satisfy the sparseness condition , where β_{j}^{*} is the j^{th} ‘true’ regression coefficient. The sparseness condition describes a general situation when all explanatory variables are relevant, but most of them have very small effects [96]. Several authors have studied the asymptotic properties of Bayesian variable selection methods under different regularity conditions [68,96-99]. Wang et al. [96], Jiang [97] and Jiang [98] used a framework developed by Wasserman [100], for showing consistency in the context of density estimation. According to Wasserman [100], ‘density consistency’ is defined as ‘given a proper prior to propose joint densities f (x, y) of response y and explanatory variables vector x, the posterior-proposed densities are often close to the true density for large n’. Given a proper prior to propose joint densities of response y and explanatory variables vector x, the above three works showed that the posterior-proposed joint densities are often consistent to the true joint density for large n. Jiang [98] defined a proper prior for different sets of explanatory variables to prove that the posteriorproposed densities with different parameterizations were consistent to the true density. Wang et al. [96] further proved the ‘regression function consistency’, which ensures good performance of high dimensional Bayesian variable selection methods, especially when some regression coefficients are bounded away from zero, while the rest are exactly zero. Jiang [97] also studied the convergence rates of the fitted densities for generalized linear models and established consistency under some realistic assumptions. In summary, the asymptotic results reveal that with appropriate prior specification, high dimensional Bayesian variable selection methods not only can identify the ‘true’ model with probability 1, but also can give consistent parameter estimates [96], facilitating tremendous applications in a variety of fields. For other asymptotic results refer Casella et al. [19] or Moreno et al. [101].
On the other hand, unlike frequentist methods, asymptotic behaviors of Bayesian regularization or shrinkage methods are less studied and poorly understood [68]. Armagan et al. [99] provided sufficient conditions on prior concentration for strong posterior consistency, when P_{n}=o(n) as n→∞ for various Bayesian shrinkage methods, including Bayesian lasso [71], generalized double pareto [85], and horseshoe estimator [84]. Posterior consistency involves examining a posterior probability of a set of parameter values as n → ∞, where the set is any neighborhood of the true parameter value [102]. Posterior consistency is mainly verified after checking sufficient conditions for a general posterior consistency theorem, after defining a suitable topology of the parameter, and the neighborhood of the true value of the parameter [102]. Bhattacharya et al. [92] studied prior concentrations of various Bayesian shrinkage priors to investigate whether the entire posterior distributions of these methods concentrate at the optimal rate, i.e. the posterior probability assigned to a shrinking neighborhood (proportional to the optimal rate) of the true value of the parameter converges to 1. They argued that most of the Bayesian shrinkage methods are sub-optimal, in terms of posterior contraction, which is considered stronger optimality condition than posterior consistency alone. Due to lack of theoretical justification, Bayesian shrinkage methods are not easily adopted by subjective Bayesians and frequentists [68]. Therefore, much work in this direction needs to be done to better understand these approaches regarding their asymptotic behaviors in high dimensional linear models.
Over the years, Bayesian methods have evolved immensely with the growth in modern computing power. These approaches based on modern statistical techniques are powerful tools to handle the associated challenges in high dimensional data analysis. This paper attempts to provide a selective overview of these methods for high dimensional linear models. Our discussion is only limited to linear models; we have not discussed nonparametric Bayesian methods [103-107], and various machine learning techniques [66,95,108], and several other approaches which have their own advantages, and are beyond the scope of current review. We particularly focus on the Bayesian regularization methods, which have enjoyed a great deal of applicability in recent years. We have summarized the model-fitting algorithms for fitting these methods for high dimensional linear models. For more comprehensive review of Bayesian formulations of various regularization methods as hierarchical models, we suggest the readers to read Kyung et al. [48]. Also, we have mentioned two new Bayesian model selection methods, which have shown excellent performance in high dimensional model selection. Moreover, we have addressed the asymptotic behaviors of Bayesian high dimensional methods for linear models under certain regularity conditions. The paper will be particularly helpful in understanding the basic methodology used in high dimensional Bayesian methods. It is to be noted that Bayesian variable or model selection is a much broader topic than what we have described here, which includes several data preprocessing steps in practice including variable transformations, coding of variables, removal of outliers, etc. Therefore, in real life applications, the general framework of Bayesian variable and model selection [109], should be applied to these methods to ensure accurate results and easy implementation.
This work was supported in part by the research grant NIH 5R01GM069430-08. Himel Mallick was supported in part by a cooperative agreement grant U01 NS041588 from the National Institute of Neurological Disorders and Stroke, National Institutes of Health and Department of Health and Human Services. We thank the editor for several thoughtful suggestions on our earlier version of the manuscript, which greatly improved the quality of the paper.