Bayesian Methods for High Dimensional Linear Models

In this article, we present a selective overview of some recent developments in Bayesian model and variable selection methods for high dimensional linear models. While most of the reviews in literature are based on conventional methods, we focus on recently developed methods, which have proven to be successful in dealing with high dimensional variable selection. First, we give a brief overview of the traditional model selection methods (viz. Mallow’s Cp, AIC, BIC, DIC), followed by a discussion on some recently developed methods (viz. EBIC, regularization), which have occupied the minds of many statisticians. Then, we review high dimensional Bayesian methods with a particular emphasis on Bayesian regularization methods, which have been used extensively in recent years. We conclude by briefly addressing the asymptotic behaviors of Bayesian variable selection methods for high dimensional linear models under different regularity conditions. Advances in Markov Chain Monte Carlo Methods and Survival Analysis J o ur na l o f B iometrics & Bistatis t i c s ISSN: 2155-6180 Journal of Biometrics & Biostatistics Citation: Mallick H, Yi N (2013) Bayesian Methods for High Dimensional Linear Models. J Biomet Biostat S1: 005. doi:10.4172/2155-6180.S1-005


Introduction
Linear models are probably the most widely used statistical models to investigate the influence of a set of predictors on a response variable. In practice, only a small subset of potential covariates actually has an influence on the response variable, whereas the effect of most predictors is very small or even zero. Since, model misspecification can have a significant impact on a scientific result, correctly identifying relevant variables is an important issue in any scientific research. If more predictors are included in the model, a high proportion of the response variability can be explained. On the other hand, overfitting (inclusion of predictors with null effect) results in a less reliable model with poor predictive performance. The problem of variable selection becomes more challenging for high dimensional problems, particularly when the number of predictors greatly exceeds the number of observations. High dimensional problems arise in a variety of scientific fields, such as bioinformatics, medicine, genetics, etc. High dimensionality could lead us to models that are very complex, which poses serious challenges in estimation, prediction and interpretation. Therefore, many classical approaches to variable selection cease to be useful due to computational infeasibility, model non-identifiability, or both.
Various methods have been developed over the years for dealing with variable selection in high dimensional linear models. Very recently, much work has been done in the direction of Bayesian framework. Unlike non-Bayesian methods, Bayesian analysis enables one to deal with model uncertainty by averaging over all possible models. Moreover, Bayesian methods have the ability to significantly reduce the actual complexity involved in the estimation procedure by incorporating prior information within the data into the model estimation technique. With ever-increasing computing power, these methods are increasingly becoming popular and gaining more and more insight and considerations for high dimensional analysis.
In this review, we present a selective overview of some recent developments in Bayesian model and variable selection methods for high dimensional linear models. While most of the reviews in literature are based on conventional methods, we focus on recently developed methods, which have been used extensively over the years. The review is structured as follows. First, we give a brief overview of the traditional commonly used model selection methods followed by a discussion on some recently developed approaches for linear models, which have proven to be successful in dealing with high dimensional variable selection. Then, we mention some conventional Bayesian variable and model selection methods, along with some recently developed Bayesian approaches, with a particular emphasis on Bayesian regularization methods. We conclude by briefly addressing the asymptotic behaviors of Bayesian high dimensional variable selection methods for linear models under different regularity conditions.

Classical Model Selection Methods for Linear Models
In statistical tradition, commonly used methods for model selection are backward, forward and stepwise selection, where in every step, predictors are added to the model or eliminated from the model, according to a precisely defined testing rule. Besides accurate prediction, the primary goal in model selection is also to come up with meaningful, interpretable and parsimonious model. Traditional methods such as stepwise regression fall short in one or more of these criteria. To overcome these shortcomings, several information-type methods have been developed, which aim to provide a trade-off between model complexity and goodness-of-fit of a model. We describe the standard normal linear model and Bayesian hierarchical normal linear model in later subsections. Then, we briefly review four commonly used methods viz. Mallow's Cp [1], AIC [2] and BIC [3] for standard normal models, and DIC [4] for Bayesian hierarchical normal linear models, which remain the most popular model selection methods used for linear models and hierarchical linear models respectively. assumed that the mean of the response variable can be described as a linear function of a set of predictors. Mathematically, in a normal linear regression setup, we have the following model where y is the 1 n × vector of centered responses; X is the n p × matrix of standardized regressors; β is the 1 p × vector of coefficients to be estimated and ε is the The OLS estimator has some attractive statistical properties, in the sense that it is the best linear unbiased estimator (BLUE), provided the classical assumptions hold [5]. Also, it coincides with the maximum likelihood estimator (MLE) for normal linear models.

Bayesian hierarchical normal linear model
In Bayesian hierarchical normal linear model, we assume that the distribution of the dependent variable y is specified conditional on the parameters β and 2 σ as Then, any prior information on ( , ) β σ is incorporated by specifying a suitable prior distribution 2 ( , ) p β σ on them. This second-level model has its own parameters known as hyperparameters, which are usually estimated from the data. After observing the data, the prior distribution, 2 ( , ) p β σ is updated by the corresponding posterior distribution, which is obtained as The posterior distribution contains all the current information about the parameters. Ideally one might fully explore the entire posterior distribution by sampling from the distribution, using Markov chain Monte Carlo (MCMC) algorithms [6]. Due to its ability to incorporate specific hierarchical structure of the data (correlation among the predictors), hierarchical modeling is often more efficient than traditional approaches [7].

Mallow's C p
For a subset model with k p ≤ explanatory variables, the C p statistic proposed by Mallows [1], is defined as where s 2 is the MSE for the full model containing p explanatory variables and RSS(k) is the residual sum of squares for the subset model containing k explanatory variables. In practice, C p is usually plotted against for a collection of subset models under consideration and models with C p approximately equal to p are taken as acceptable models, in the sense of minimizing the total bias of the predicted values [5]. Woodrofe [8] showed that C p is a conservative model selector, which tends to overfit. Nishii [9] showed that C p is not consistent in selecting the true model, and often tends to select a larger model as n → ∞ .

Akaike's Information Criteria (AIC)
The AIC of Akaike [2] is defined as where L is the likelihood function evaluated at the MLE. Given a set of candidate models, the 'best' model is the one with the minimum AIC value. Similar to Mallow's Cp, AIC is not model selection consistent [9]. Here, the consistency of a model selection criterion means that the probability of the selected model being equal to the true model converges to 1. More information about AIC can be found in Burnham and Anderson [10]. The asymptotic approximation on which the AIC is based is rather poor when n is small [11]. Therefore, Hurvich and Tsai [12] proposed a small-sample correction, leading to the c AIC statistic defined by c AIC converges to AIC as n gets larger, and therefore, it is preferred to AIC regardless of the sample size [11].

Bayesian Information Criteria (BIC)
While AIC is motivated by the Kullback-Leibler discrepancy of the fitted model from the true model, Schwarz [3] derived BIC from a Bayesian perspective by evaluating the leading terms of the asymptotic expansion of the Bayes factor. The BIC is defined as where L is the likelihood function evaluated at the MLE. Similar to AIC, model with minimum BIC is chosen as the preferred model from a set of candidate models. It is well known that neither AIC nor BIC performs better all the time. However, unlike AIC, BIC is a consistent model selection technique, which means, as the sample size n gets large enough, the lowest BIC model will be the true model, with probability 1 [10,11]. For a comparison of AIC and BIC, refer to Kundu and Murali [13] or Yang [14].

Deviance Information Criteria (DIC)
For model selection in Bayesian hierarchical normal linear models, Spiegelhalter et al. [4] proposed the generalization of AIC and BIC defined as where = -2log L D is the deviance evaluated at the posterior mean of the parameters, and pD is the effective number of parameters calculated as the difference between posterior mean deviance and deviance of posterior means. Like AIC and BIC, models with smaller DIC are better supported by the data. DIC is particularly useful when the MCMC samples are easily available, and is valid only when the joint distribution of the parameters is approximately multivariate normal [4]. DIC tends to select over-fitted models, which has been addressed by Ando [15], although very little is known about its performance in high dimensional models. As noted by Gelman et al. [16], various other difficulties (apart from overfitting) have been noted with DIC, but there has been no consensus on an alternative.

High Dimensional Methods
In the setting of a linear regression model, if the number of covariates p is of the polynomial order or exponential order of the penalty function P, i.e.
where P is a function of the coefficients indexed by a parameter 0 λ > , which controls the degree of penalization. Typically, the penalty function P has the following properties [28]:

1.
It is symmetric about the origin, i.e. ( ) This approach produces a spectrum of solutions depending on the value of λ . Such methods are often referred to, as regularization methods, and λ is called the regularization parameter (or tuning parameter). The penalty function serves to control the complexity of the model and provides criteria for variable selection and model comparison, by imposing some constraints on the parameters. The form of ( ) P λ β determines the general behavior of regularization methods. Small penalties lead to large models with limited bias, but potentially high variance; large penalties lead to the selection of models with fewer predictors, but with less variance. A variety of penalty terms have been proposed, among which the most popular ones are ridge regression, the lasso and the elastic net. We summarize these methods in Table 1. For a more comprehensive review on regularization methods, refer Bickel and Li [29].
The bridge family: Regularization methods date back to the proposal of ridge regression by Hoerl and Kennard [30], who suggested minimizing the following objective function: As a continuous shrinkage method, ridge regression achieves better prediction performance than OLS through a bias-variance tradeoff (biased estimates with lower variance). However, ridge regression cannot produce a parsimonious model, as it always keeps all the predictors in the model. Frank and Friedman [31] introduced bridge regression, a broad class of penalized regression, which is obtained by minimizing sample size , i.e., κ > then it is called model selection criteria such as Mallow's C p , AIC, BIC tend to select more variables than necessary for high dimensional linear models, especially when the number of regressors increases with the sample size [18,19]. Also, Yang and Barron [20] argues that in some cases the overfitting problem can be substantial, resulting in severe selection bias, models have been introduced recently. Wang et al. [21] proposed a modified BIC (mBIC), which is consistent when p is diverging slower than n. Chen and Chen [22,23] developed a family of extended Bayesian information criteria (EBIC), for variable selection for high dimensional problems. On the other hand, a large amount of effort has gone into the development of regularization methods for simultaneous variable selection and coefficient estimation. Regularization methods mitigate modeling biases and achieve higher prediction accuracy in high dimensional linear models by shrinking the coefficients and providing meaningful estimates, even if the model includes a large number of, and/or highly correlated predictors. We describe the extended Bayesian information criteria and regularization methods in the following subsections, before gradually moving to Bayesian methods for high dimensional linear models.

Extended Bayesian Information Criteria (EBIC)
The family of EBIC is indexed by a parameter γ in the range [0,1].
The extended BIC (EBIC) is defined as where L is the likelihood function evaluated at the MLE, and 0 γ > is a tuning parameter. The original BIC is a special case of EBIC with 0 γ = . The mBIC is also a special case of EBIC in an asymptotic sense; i.e. it is asymptotically equivalent to the EBIC with the 'true' model. Among other developments, General Information Criterion (GIC) proposed by Shao [24] is known to be consistent in high dimensions. Kim et al. [25] showed that EBIC is asymptotically equivalent to GIC.

Regularization methods
Similar to information-type methods, various regularization methods have been developed to overcome the problem of overfitting in high dimensional linear models. It is well known that OLS often does poorly in both prediction and interpretation in high dimensions. Despite its nice statistical properties, it is highly unstable in the presence of multicollinearity. Also, if p n  , it produces a non-unique estimator, since X is less than full rank (non-identifiability). Motivated by this, regularization methods (also known as penalized likelihood or shrinkage method) with various penalties have been developed, which have proven to be successful and model selection consistent for high dimensional linear models [26,27].
The problem of interest involves estimating a sparse vector of regression coefficients by minimizing an objective function Q that is composed of a loss function (without loss of generality, most commonly used least squares loss function (RSS) is considered, although least absolute deviation and negative log-likelihood is also common) plus a  a high dimensional problem [17]. Despite their popularity, classical which damages predictive performance for high dimensional models.
To overcome this, various model selection criteria for high dimensional where λ is the tuning parameter and α is the concavity parameter, as it controls the concavity of the objective function (13). It includes lasso and ridge as special cases (corresponding to 1 α = and 2 α = respectively). Although Frank and Friedman [31] did not solve for the estimator of bridge regression for any given 0 α ≥ , they indicated that optimum choice of α would yield reasonable predictor. The bridge estimator does variable selection when 1 α ≤ and shrinks the coefficients when 1 α > [32]. These three estimators viz. ridge, lasso and bridge are together referred to as the bridge family.
Among penalized regression techniques, the most popular and widely used method in statistical literature is the Least Absolute Shrinkage and Selection Operator (LASSO). The lasso of Tibshirani [33] is obtained by minimizing Compared to ridge regression, a remarkable property of lasso is that it can shrink some coefficients exactly to zero, and therefore, can automatically achieve variable selection. Intuitively, this can be explained by the fact that j β is much larger than  β . To obtain the regularized estimators, we essentially seek the points at which the objective function contour first "hits" the constraint. Lasso, ridge and bridge penalty functions have constraints shaped like a square, circle and star, respectively. As a consequence of the different shapes, lasso is likely to involve variable selection ( 1 0 β = or 2 0 β = ), as well as parameter estimate shrinkage, and ridge yields mainly parameter estimate shrinkage; in contrast, bridge induces an even higher chance of variable selection than lasso, because the star shape of bridge makes the contour even more likely to hit one of the points ( 1 0 β = or 2 0 β = ), than does the diamond shape of lasso.

Some generalizations:
The lasso has demonstrated excellent performance in many situations. As a consequence, most of the developments in recent years are focused on the lasso and related problems. However, despite its promising nature, there are three inherent drawbacks of lasso [35]. Firstly, due to the nature of the convex optimization problem, the lasso method cannot select more predictors than the sample size. But, in practice, there are often studies that involve much more predictors than the sample size, e.g. microarray gene expression data analysis, cloud detection through analysis of satellite images, classification of spam emails, and many others. Secondly, when there is some group structure among the predictors, the lasso estimator usually selects only one predictor from a group, while ignoring others. Thirdly, when the predictors are highly correlated, lasso performs unsatisfactorily. We discuss some of the alternatives and generalizations that have been proposed to overcome the above limitations of lasso.
The lasso uses a unique tuning parameter λ to equally penalize all the coefficients. In practice, due to the single tuning parameter, lasso can either include irrelevant variables or over-shrink large coefficients issue, Zou [37] introduced the adaptive lasso that uses a weighted L 1penalty where j λ is the tuning parameter corresponding to the j th coefficient The intuition of adaptive lasso is to shrink coefficients differently by shrinking important variables slightly and unimportant variables heavily.
To address the issue of variable selection for grouped variables, Yuan and Lin [38] proposed the group lasso estimator, in which the penalty is given by where K is the no. of groups with With appropriately chosen tuning parameter, the group lasso can shrink all coefficients in a group to zero or keep all coefficients in a group in the model. [35] proposed the elastic net estimator to achieve improved performance in situations when there is multicollinearity and grouping among predictors. The penalty term in elastic net is a convex combination of the lasso penalty and the ridge penalty, i.e.
The elastic net estimator can be interpreted as a stabilized version of the lasso, and it often outperforms the lasso, especially when there are groups of highly correlated predictors.
Optimization algorithms: Various optimization algorithms have been proposed to obtain the lasso and related estimators [44]. Notably, the least angle regression (LARS) [45], and the coordinate descent algorithm [46,47], are the most computationally efficient ones. Given the tuning parameters, these algorithms are extremely fast (e.g. the computational load of LARS is same as that of a single OLS fit), thus making the penalized regression approaches extremely popular in high dimensional data analysis. Figure 1: A graphical illustration of the properties of three different penalty functions. The eclipses represent the contours of the objective functions (A-C). The square, round, and star shapes represent the lasso, ridge, and bridge constraint, respectively. The dots are the points where contours are "tangent" to the constraints, i.e., the regularized estimates. Note that, in lasso (A) or bridge (C), the constraint is discontinuous at zero. If the contour first touches the constraint at point zero, the corresponding parameter estimate is zero and variable selection is achieved [34]. [36], which is critical for high dimensional problems. To address the

Advances in Markov Chain Monte Carlo
Methods and Survival Analysis theoretically attractive, here are at least three serious disadvantages of frequentist penalized regression approaches: 1. Penalized regression is essentially an optimization problem that only provides a point estimate of β. Nevertheless, people usually also need to know the level of confidence of the estimates, such as the confidence interval (or credible interval), and the p-value. This problem can be addressed by applying bootstrap sampling [33], but it is computationally intensive. Kyung et al. [48] showed that the bootstrap estimates of the standard errors of the lasso estimates might be unstable, and are not consistent if true 0 j β = .
2. The penalized regression approaches need to preset the tuning parameter(s). The commonly used method is to use cross-validation [49], to choose 'optimal' values of the tuning parameters. However, cross-validation can be computationally costly, e.g. for the adaptive lasso, it is very challenging to choose multiple tuning parameters. Moreover, cross-validation is a standard way to assess the predictive accuracy of the model. Choosing tuning parameters using cross validation of the prediction error can result in unsatisfactory performance, when the main goal is to identify relevant variables.
3. In frequentist framework, it may be challenging to deal with complicated multilevel/hierarchical structures of data, and to incorporate external information (for example, data with specific group level information) into the model.

High Dimensional Bayesian Variable and Model Selection Methods
In the Bayesian framework, the model selection problem is transformed to the form of parameter estimation: rather than searching for the single optimal model, a Bayesian will attempt to estimate the posterior probability of all models, within the considered class of models (or in practice, of all models with non-negligible probability) [50]. In many cases, this question is asked in variable-specific form (variable selection): i.e. the task is to estimate the marginal posterior probability that a variable should be in the model [50]. Thus, variable selection can be considered a special case of model selection.
Most of the conventional Bayesian variable selection methods rely on MCMC algorithms by specifying spike and slab priors on the coefficients subject to selection [50][51][52], requiring computation of marginal likelihood, which is computationally intensive for high dimensional models. Also, posterior sampling of these methods often requires stochastic search over an enormous space of complicated models facilitating slow convergence and mixing, when the marginal likelihood is not analytically tractable [67]. On the other hand, Bayesian regularization methods specify both the spike and slab as continuous distributions, which can be written as scale mixtures, leading to simpler MCMC algorithm with no marginal likelihood being computed. Also, unlike conventional Bayesian methods, Bayesian regularization methods specify shrinkage priors, which enable simultaneous variable selection and coefficient estimation. We also discuss two new model selection approaches viz. Bayesian model selection based on nonlocal prior densities proposed by Johnson and Rossell [67], and Bayesian subset regression (BSR) proposed by Liang et al. [68], which are shown to be model selection consistent for high dimensional linear models.

Bayesian regularization methods
Regularization methods are originally developed by the frequentists, and obtaining statistical inference on the regression coefficients is usually difficult, and often requires various kinds of asymptotic approximations. In contrast, a Bayesian approach enables exact inference, even when the sample size is small. Regularization methods naturally lend itself to a Bayesian interpretation, in which the penalization term in penalized regression is the negative log prior of the coefficient. Apart from their easy interpretability, Bayesian methods have some advantages over frequentist methods. Firstly, in MCMCbased Bayesian regularization methods, we have a valid measure of standard error obtained from the posterior distribution, and thus, we can easily obtain interval estimates of the parameters, along with other quantities. Secondly, it is more flexible in the sense that we can estimate the tuning parameter jointly with other parameters of interest. Thirdly, unlike in frequentist framework, it is fairly straightforward to extend a Bayesian model to incorporate multilevel information inherent in the data. Lastly, using MCMC and Gibbs sampler to search for the model space for the most probable variable models, without fitting all possible models is efficient, which avoids time-consuming computation [56].
Regularization methods as hierarchical models: Recent years have seen a resurgence of interest in Bayesian hierarchical modeling techniques. This increasing popularity can be contributed to the fact that hierarchical models are more easily interpreted and handled in the Bayesian framework. Hierarchical models can significantly reduce the 'effective' no. of parameters in a model by linking the coefficients together, or shrinking some of them. In Bayesian hierarchical models, the hyperparameters include shrinkage parameters, which control the complexity of the model, similar to the tuning parameter λ in penalized regression approaches. With the given prior distribution, the log posterior density for linear models become Methods and Survival Analysis large number of potential variables, it is reasonable to assume that most of the variables have no or very weak effects, whereas only some have noticeable effects. Therefore, we should set up a prior distribution that gives each coefficient a high probability of being near zero. Such prior distributions are often referred to as shrinkage prior distributions.
A shrinkage prior distribution should have an infinite spike at zero and very heavy tails, thereby strongly shrinking small coefficients to zero, while minimally shrinking large coefficients, and also enable to incorporate hierarchical structure of the data. Therefore, the resulting hierarchical models can effectively remove unimportant variables and reliably estimate the coefficients of important variables simultaneously [6].

Different Hierarchical Formulations
Bayesian lasso: Tibshirani [33] suggested that the lasso estimates can be interpreted as posterior mode estimates, when the regression parameters are assigned independent and identical Laplace priors. A remarkable feature of the double exponential distribution is that it can be presented as a two-level hierarchical model [69], as scale mixtures of normal distributions with independent exponentially distributed variances. The two-level formulation of the double exponential distributions offers advantages of easily interpreting the model and developing computational algorithms. The latent variables Park and Casella [71] introduced Gibbs sampling for Bayesian lasso, using a conditional Laplace prior specification of the form and non-informative scale-invariant marginal prior on 2 σ , i.e.  [48]. The Bayesian formulation of the original lasso, as given in Park and Casella [71], is given by the following hierarchical model: With this formulation, the posterior distribution of β is normal, while the reciprocals of the latent variables are distributed as inverse Gaussian distributions ( Table 2). The posterior distribution of σ 2 is inverse gamma distribution. Based on this, Park and Casella [71] formulated the Gibbs sampler for the Bayesian lasso and achieved variable selection by interval estimation. The adaptive version of Bayesian lasso can be obtained similarly by specifying variable-specific tuning parameter (Table 2). τ . IG refers to inverse Gaussian distribution. , , Under this hierarchical prior, the posterior distribution of β is normal, while the reciprocal of the latent variable 2 τ is distributed as χ 2 distribution. The above Bayesian ridge regression can be extended to include variable-specific latent variables. In that case, the prior distributions become .
And the conditional posterior distribution of Under this hierarchical prior, the posterior distribution of β k is normal, while the reciprocals of the latent variables are distributed as inverse Gaussian distributions ( Table 2). A non-informative scaleinvariant marginal prior on σ 2 results in an inverse gamma posterior distribution of σ 2 .
Bayesian elastic net: Similarly, the hierarchical representation of Bayesian elastic net estimator is obtained as follows:  Under this hierarchical prior, the posterior distribution of β is normal, while the reciprocals of the latent variables are distributed as inverse Gaussian distributions (Table 2). Here also, a non-informative scale-invariant marginal prior on σ 2 results in an inverse gamma posterior distribution of σ 2 . For a slightly different version of Bayesian elastic net estimator refer to the paper of Li and Lin [72]. Other related developments are Bayesian bridge [73], and a different version of Bayesian adaptive lasso [74].

Estimation of hyperparameters:
In the Bayesian framework, typical approaches for estimation of the tuning parameters are based on incorporating them into the Gibbs sampler with an appropriate hyperprior [48]. Park and Casella [71] suggested using a gamma prior G(a,b) for a proper posterior. The prior, which is put on 2 λ for convenience because of the way λ enters into the posterior [48], is given by For elastic net, we have two parameters, and we assign  Table 3. For Bayesian ridge, a gamma prior G(a,b) is put on s 2 , which again results in a gamma posterior ( Table 3). The tuning parameters can also be estimated through the marginal likelihood of λ , which can be implemented with an EM/Gibbs algorithm [48].
In this section, we briefly describe two popular model-fitting algorithms (viz. MCMC and EM), for estimating parameters in Bayesian regularization methods in linear regression. For the MCMC algorithms, we only describe the hierarchical formulations and the corresponding posterior distributions. Any feature of the posterior distribution is a legitimate candidate for Bayesian inference: moments, quantiles, highest posterior density regions, etc. [19]. Posterior median is one attractive measure as a robust estimator. Hans [75] emphasized using the posterior mean as point estimate, as it facilitates prediction of future observations via the posterior predictive distribution. However, implementation of MCMC computing time. For practical and computational purposes, it is often parameters and their standard errors. The EM algorithm [76] aims to address this issue by providing MAP estimates along with speedy inference and fast computation.
It must be emphasized that both non-Bayesian and Bayesian regularization methods are essentially optimization methods, with the common goal of determining the model parameters that maximize some objective function. A Bayesian approach can often lead to very different results than a traditional penalized regression approach [76]. Although, the Gibbs samplers discussed in this paper are extremely fast [48], one should be aware of the existence of problems relating to MCMC algorithms, e.g. slow convergence, poor mixing, etc. Therefore, once the simulation algorithm has been implemented and the simulations are drawn, it is absolutely necessary to check the convergence of the simulated sequences [6]. Inference based on the output of a slowly converging algorithm may require long runtime (many iterations e.g. thousands, millions, or more). If the convergence is not attained (painfully slow), one should resort to an alternative algorithm or other remedies, e.g. increasing burn-in period, thinning, etc. [6], for the inference to be valid.

MCMC algorithm
As shown above, the conditional posterior distribution for each parameter has standard form, and thus, can be directly sampled. Thus, the MCMC algorithm can be applied to fully explore the joint posterior distribution by sampling each parameter from its conditional posterior distribution. In summary, the MCMC algorithm proceeds as follows: 1. Initialize all the parameters with some plausible values.

EM algorithm
Given the latent variables β σ by EM algorithm [76]. The algorithm proceeds as follows: 1. Initialize all the parameters with some plausible values. β σ by weighted linear regression on the augmented data as follows: 4. Repeat 1,2, and 3, until convergence. Table 3 gives the summary of E and M steps for different Bayesian regularization methods. In some cases, the hyperparameters need to be updated in the E-step, which can be easily done by plugging in their conditional posterior expectations. Other variants of EM algorithms are proposed by Figueiredo [77] and Xu [78].
Extension to GLM: A generalized linear model (GLM) consists of three components: the linear predictor, the link function and the distribution of the outcome variable [6,79]. The linear predictor can be expressed as: The standard IRLS algorithm proceeds by approximating the generalized linear model by a weighted normal linear regression, and estimating the maximum likelihood estimates of the parameters ( , ) β φ using weighted least squares, and then iterating the process [6]. At each iteration, the algorithm calculates pseudo-data i z and pseudo- = ′′ (28) and the estimates are obtained as β φ are the current estimates of ( , ) β φ . Thus, all the methods described above can be easily extended to generalized linear models by approximating the GLM likelihood by normal likelihood [80,81].
Related Bayesian shrinkage methods: Bayesian regularization methods are not merely the Bayesian versions of different regularization methods. In fact, Bayesian regularization methods fall into the broad class of Bayesian shrinkage methods. Apart from the methods discussed above, there are tons of other Bayesian shrinkage methods available in literature, which include the normal/exponential-gamma model of Griffin and Brown [82]; the normal/gamma and the normal/inverse-Gamma model [70,83]; the horseshoe prior of Carvalho et al. [84]; the generalized double-Pareto model of Armagan et al. [85]; the orthant normal prior of Hans [86]; mixture of uniform prior of Knürr at al. [87]; the Laplace-Poisson model of Chen et al. [88], and the hypergeometric inverted-beta model of Polson and Scott [89]. For the literature of EM algorithms in Bayesian shrinkage methods, refer Gelman et al. [6], Green [90], Polson and Scott [89].

Bayesian model selection using nonlocal priors
Regularization methods identify only one model that maximizes a penalized likelihood function, or minimizes RSS subject to a penalty. On the other hand, most conventional Bayesian methods based on local prior densities [61][62][63][64] provide estimates of posterior model probabilities that are poor and unrealistic, and therefore, the posterior model probability estimates are often unreported for high dimensional linear models [67]. To overcome this deficiency, Johnson and Rossell [67] recently proposed Bayesian model selection methods by specifying nonlocal prior densities on the regression coefficients, which provide accurate estimates of the posterior probability that each identified model is correct. Unlike local prior densities, which are positive at null parameter value, nonlocal prior densities are identically zero whenever a model parameter is equal to its null value [91]. Johnson and Rossell [67] introduced two classes of nonlocal priors on the coefficients, along with their frequentist analogues for both densities, which we describe below.
Product moment (pMOM) densities: The first class of nonlocal densities is called product moment (pMOM) densities, which are defined as where L is the likelihood function and 0 d > . Similar to BIC, model with minimum value of the objective function (equation 31) should be selected as the best model.

Product inverse moment (piMOM) densities:
The second class of nonlocal densities is called product inverse moment (piMOM) densities, which are defined as it should be estimated carefully for efficient computation [67]. The objective function associated with this prior can be expressed as (33) where L is the likelihood function and 0 d > . Similar to above, model with minimum value of the objective function (Equation 33) should be selected as the best model Based on these prior densities, Johnson and Rossell [67] formulated marginal densities in analytical form, leading to analytical posterior probabilities. Using these expressions, they proposed an MCMC scheme to obtain posterior samples. They demonstrated that the proposed method is model selection consistent (posterior probability of selecting the 'true' model approaches 1), as n → ∞ and as long as Methods and Survival Analysis n p ≤ under certain regularity conditions on the design matrix X. Also, the proposed method performed impressively, when compared with Bayesian methods based on local prior specifications [61][62][63][64]. Moreover, the proposed method performs either as well, or better than various regularization methods, as evident from their simulation studies. Although they provided frequentist versions of the nonlocal prior specifications, Johnson and Rossell [67] recommended using Bayesian methods, as it facilitates inference regarding the posterior probability that each model is true along with easy computation.

Bayesian Subset Regression (BSR)
Another recently developed method, which is particularly interesting to us and within the scope of current review, is the Bayesian Subset Regression (BSR) method proposed by Liang et al. [92], for high dimensional generalized linear models. They propose a new prior specification, which results in a Bayesian subset regression (BSR), with the negative log-posterior distribution approximately reduced to EBIC, when the sample size is large. In addition, they propose a variable screening procedure based on marginal inclusion probability, which is shown to have same theoretical properties of sure screening and is consistent, as the SIS procedure by Fan and Song [93], although both SIS [93,94], and its iterative extension ISIS [95] are outperformed by the proposed method, in terms of prediction accuracy. Also, the proposed method outperforms several popular regularization methods, including lasso and elastic net, suggesting that BSR is more suitable in high dimensional models than regularization methods. Here, we describe the method for high dimensional linear models.
Prior specification: To formulate the prior specification of their method, let us denote the number of explanatory variables as P n . Following Liang et al. [92] (38) and equals −∞ otherwise. Thus, the negative of the log-likelihood approximately reduces to the EBIC. This leads to Bayesian subset regression, with the MAP model approximately equivalent to the minimum EBIC model. For the simulation of the posterior, they propose an adaptive MCMC algorithm. With extensive numerical studies, they establish that BSR outperforms penalized likelihood approaches significantly, especially when the dimension of P n is increasing. Under mild conditions, they establish the consistency of the resulting posterior. In addition, they show that the posterior probability of the true model will converge to 1 as n → ∞ .

Asymptotic Behaviors of High Dimensional Bayesian Methods
In this section, we consider the asymptotic behaviors of Bayesian variable selection methods in high dimensional linear models. Here also we denote the number of explanatory variables as P n which is possibly much larger than the sample size n and it is assumed that the regression coefficients satisfy the sparseness condition is the j th 'true' regression coefficient. The sparseness condition describes a general situation when all explanatory variables are relevant, but most of them have very small effects [96]. Several authors have studied the asymptotic properties of Bayesian variable selection methods under different regularity conditions [68,[96][97][98][99]. Wang et al. [96], Jiang [97] and Jiang [98] used a framework developed by Wasserman [100], for showing consistency in the context of density estimation. According to Wasserman [100], 'density consistency' is defined as 'given a proper prior to propose joint densities ( , ) f x y of response y and explanatory variables vector x, the posterior-proposed densities are often close to the true density for large n' . Given a proper prior to propose joint densities of response y and explanatory variables vector x, the above three works showed that the posterior-proposed joint densities are often consistent to the true joint density for large n. Jiang [98] defined a proper prior for different sets of explanatory variables to prove that the posteriorproposed densities with different parameterizations were consistent to the true density. Wang et al. [96] further proved the 'regression function consistency' , which ensures good performance of high dimensional Bayesian variable selection methods, especially when some regression coefficients are bounded away from zero, while the rest are exactly zero. Jiang [97] also studied the convergence rates of the fitted densities for generalized linear models and established consistency under some realistic assumptions. In summary, the asymptotic results reveal that with appropriate prior specification, high dimensional Bayesian variable selection methods not only can identify the 'true' model with probability 1, but also can give consistent parameter estimates [96], facilitating tremendous applications in a variety of fields. For other asymptotic results refer Casella et al. [19] or Moreno et al. [101].
On the other hand, unlike frequentist methods, asymptotic behaviors of Bayesian regularization or shrinkage methods are less studied and poorly understood [68]. Armagan et al. [99] provided sufficient conditions on prior concentration for strong posterior consistency, when ( ) n P o n = as n → ∞ for various Bayesian shrinkage methods, including Bayesian lasso [71], generalized double pareto [85], and horseshoe estimator [84]. Posterior consistency involves examining a posterior probability of a set of parameter values as n → ∞ , where the set is any neighborhood of the true parameter value [102]. Posterior consistency is mainly verified after checking sufficient conditions for a general posterior consistency theorem, after defining a suitable topology of the parameter, and the neighborhood of the true value of the parameter [102]. Bhattacharya et al. [92] studied prior concentrations of various Bayesian shrinkage priors to investigate whether the entire posterior distributions of these methods concentrate at the optimal rate, i.e. the posterior probability assigned to a shrinking neighborhood (proportional to the optimal rate) of the true value of the parameter converges to 1. They argued that most of the Bayesian shrinkage methods are sub-optimal, in terms of posterior contraction, which is considered stronger optimality condition than posterior consistency alone. Due to lack of theoretical justification, Bayesian shrinkage methods are not easily adopted by subjective Bayesians and frequentists [68]. Therefore, much work in this direction needs to be done to better understand these approaches regarding their asymptotic behaviors in high dimensional linear models.

Discussion
Over the years, Bayesian methods have evolved immensely with the growth in modern computing power. These approaches based on modern statistical techniques are powerful tools to handle the associated challenges in high dimensional data analysis. This paper attempts to provide a selective overview of these methods for high dimensional linear models. Our discussion is only limited to linear models; we have not discussed nonparametric Bayesian methods [103][104][105][106][107], and various machine learning techniques [66,95,108], and several other approaches which have their own advantages, and are beyond the scope of current review. We particularly focus on the Bayesian regularization methods, which have enjoyed a great deal of applicability in recent years. We have summarized the model-fitting algorithms for fitting these methods for high dimensional linear models. For more comprehensive review of Bayesian formulations of various regularization methods as hierarchical models, we suggest the readers to read Kyung et al. [48]. Also, we have mentioned two new Bayesian model selection methods, which have shown excellent performance in high dimensional model selection. Moreover, we have addressed the asymptotic behaviors of Bayesian high dimensional methods for linear models under certain regularity conditions. The paper will be particularly helpful in understanding the basic methodology used in high dimensional Bayesian methods. It is to be noted that Bayesian variable or model selection is a much broader topic than what we have described here, which includes several data preprocessing steps in practice including variable transformations, coding of variables, removal of outliers, etc. Therefore, in real life applications, the general framework of Bayesian variable and model selection [109], should be applied to these methods to ensure accurate results and easy implementation.