Reflecting about Selecting Noninformative Priors

Following the critical review of Seaman et al. (2012), we reflect on what is presumably the most essential aspect of Bayesian statistics, namely the selection of a prior density. In some cases, Bayesian inference remains fairly stable under a large range of noninformative prior distributions. However, as discussed by \citet{Hd}, there may also be unintended consequences of a choice of a noninformative prior and, these authors consider this problem ignored in Bayesian studies. As they based their argumentation on four examples, we reassess these examples and their Bayesian processing via different prior choices. Our conclusion is to lower the degree of worry about the impact of the prior, exhibiting an overall stability of the posterior distributions. We thus consider that the warnings of Seaman et al. (2012), while commendable, do not jeopardize the use of most noninformative priors.


Introduction
The choice of a particular prior for a model can be thought as a science in itself or even as an art. When we want to use Bayesian modeling but fail to gather useful information about the prior distribution, the solution is to resort to some statistical distributions named noninformative priors. This choice may be purely mathematical and used as such, even though the posterior distribution is proper and hence a correct density function, it is nonetheless open to criticism. In particular, and this will be the focus of this note, Seaman et al. [1] claimed that using a particular noninformative distribution is a problem in itself, often ignored by users of these priors. The argument goes as follows: if parameters with diffuse proper priors are subsequently transformed, the resulting induced priors can, of course, be far from diffuse, possibly resulting in unintended influence on the posterior of the transformed parameters". Also applications typically employ Markov chain Monte Carlo (MCMC) methods to obtain posterior features, resulting in the need for proper priors, even when the modeler prefers that priors be relatively noninformative", which confuses proper priors with proper posteriors and is used to restrict the focus solely (and inappropriately in our opinion) on proper priors.
More precisely, Seaman et al. [1] investigated side effects of some particular prior choices through examples. This note aims at reexamining this investigation and giving a brief discussion on these topics in the following sections. First, note that a prior is considered as informative by Seaman et al. [1] to the degree it renders some values of the quantity of interest more likely than others", and with this definition, when comparing two priors, the prior more informative is deemed preferable. In contrast to this definition, we stress that an informative prior expresses specific, definite information about the parameter, providing quantitative numerical information that is crucial to the estimation of a model. As pointed out by Robert [2], if there is information about the parameters, the prior distributions need to include this information in. However in most practical cases, the parameter has no reality of its own but rather corresponds to a parameterization of the law describing the random phenomenon observed therein. The prior is a tool employed to summarize the information available on this phenomenon, as well as the uncertainty within the Bayesian structure. There are many discussions of how insight and guidance into appropriate choices between the prior distributions might be obtained.
In this case, robustness considerations also have an interesting role to play [3,4]. This point of view will be obvious in this paper through our Bayesian processing a logistic model for three different noninformative priors. Bayesian robustness modeling distributions provide a flexible approach to resolving problems and conflicts between the data and prior distributions [5]. Also, we can model uncertainty in the prior by specifying a class of possible prior distributions to the parameters [6,7].
For the examples processed in Seaman et al. [1], we exhibit stability in the posterior distributions through various noninformative priors. We first provide a brief review of noninformative priors in Section 2. In Section 3, we will thus run a Bayesian analysis on a logistic model [1] by choosing the normal distribution N (0,σ 2 ) as the regression coefficient prior. We then compare it with a g-prior, as well as at and Jeffreys' priors, concluding to the stability of our results. The next sections cover the second to fourth examples of Seaman et al. [1], modeling covariance matrices, treatment effect in biomedical studies, and a multinomial distribution index. When modeling covariance matrices, we compare two default priors for the standard deviations of the model coefficients. In the multinomial setting, we discuss the hyperparameters of a Dirichlet prior. Finally, we conclude with the argument that the use of noninformative priors is reasonable within a fair range and that they provide efficient Bayesian estimations when the information about the parameter is vague or very poor.

Noninformative Priors
As mentioned above, if prior information is not available and if we stick to Bayesian modeling, we need to resort to the so-called noninformative priors. Since we want a prior with minimal impact on the final inference, we define a noninformative prior as a statistical distribution that expresses vague or general information about the parameter in which we are interested. In constructive terms, historically, the first rule for determining a noninformative prior is the principle of indifference, using uniform distributions which assign equal probabilities to all possibilities [8]. This distribution however is not invariant under reparametrization and invariant non informative priors were later defined [2,9]. If the problem does not have an invariance structure, Jeffreys' priors, then reference priors, exploit the structure of the problem under study in a more formalized way. Other methods are available, like the little-known data-translated likelihood of Box et al., [10], maxent priors and probability matching priors [11]. Bernardo et al., [12] regard the noninformative prior as a mathematical tool and that these priors are introduced as a category of priors that minimize the impact of the prior selection on inference: Put bluntly, data cannot ever speak entirely for themselves, every prior specification has some informative posterior or predictive implications and vague is itself much too vague an idea to be useful. There is no\objective" prior that represents ignorance". It is obvious that prior distributions can never be quantified or elicited exactly, especially when there is no information on those parameters. So, the concept of true prior is meaningless and quantification of prior beliefs is done with uncertainty. As Berger et al, [7] has noted, noninformative priors have the advantage that they can be considered to provide robust solutions to problems and the user of these priors should be concerned with robustness with respect to the class of reasonable noninformative priors".

Example 1: Bayesian Analysis of the Logistic Model
The first example in Seaman et al [1] is a simple logistic regression with probability of coronary heart disease depending on the age x by First we review the original analysis of Seaman et al. (2012) [1] and then run our own analyze by selecting normal distribution as well as the g-prior, the flat prior and Jeffreys' prior.

The original analysis
For both parameters of the model (1), Seaman et al. [1] chose a normal prior N (0,σ 2 ). The first surprising feature in this choice is to see an identical prior on both intercept and slope coefficients, instead of, e.g., a g-prior (discussed in the following) that would rescale the coefficients according to the variation of the corresponding covariate. Since x corresponds to age, the second term β x in the regression varies 50 times more when compared with the intercept. When plotting the resulting logistic cdf across a few thousands simulations from the prior, the cumulative functions mostly end up as constant functions with values 0 or 1. This is obviously not particularly realistic since the predicted phenomenon is the occurrence of coronary heart disease. The prior is thus using the wrong scale: the simulated cdfs should have a reasonable behavior over the range (20,100) of the covariate x. For instance, it should be focusing on a -5 log-odds ratio at age 20 and a+5 log-odds ratio at 100, leading to the comparison pictured in Figure 1 (left versus right). Furthermore, the fact that the coefficient of x may be negative is also ignoring a basic issue about the model and answers the later self-criticism in Seaman et al. [1] that the prior probability that the ED50 is negative is 0:5. Using instead a flat prior here would answer the authors' criticisms about the prior behavior, as we now demonstrate. We stress that Seaman et al. [1] advance no explanation for the choice of the prior variance σ 2 =25 2 , other than there is no information about the model parameters. This is a completely arbitrary choice of prior, which does have a considerable impact on the inference that follows, as shown in Figure 1 (left). Seaman et al. [1] further criticized the chosen prior by comparing both posterior mode and posterior mean derived from the normal prior assumption with the MLE. If the MLE is the golden standard there then one may wonder about the reference of a Bayesian analysis! We recall that, when the sample size N gets large, many simple Bayesian analyses based on noninformative prior distributions give results similar to standard non-Bayesian approaches [13]. From a Bayesian data analysis perspective, we can often interpret classical point estimates as exact or approximate posterior summaries based on some implicit full probability model. For example, Lopes et al, [3] have shown that a Bayesian posterior mean under a conjugate prior and the frequentist MLE are asymptotically equivalent for exponential families. Therefore, as the sample size increases, the influence of the prior on posterior inferences decreases and when N tends to infinity, most priors lead to exactly the same inference. However, for smaller sample sizes, it is inappropriate to summarize inference about the parameter by one value like the mode or the mean, especially when the posterior distribution of the parameter is more variable or even asymmetric. The data set used here to infer on (α,β) is the Swiss banknote benchmark (available in the R language). The response variable y indicates the states of the banknote, whether the banks note is genuine or counterfeit. The explanatory variable is the bill length.
This data yields the maximum likelihood estimations  α =233:26 and  β =-1:09. To check the impact of the normal prior variance, we used a random walk Metropolis-Hastings algorithm as in Marin et al, [14] and derived the estimators reproduced in Table 1. We can spot definitive changes in the results that are caused by changes in the coefficient σ, hence concluding to the clear sensitivity of the posterior to the choice of hyperparameter σ ( Figure 2).

Larger classes of priors
Picking normal priors being far from robust [7], we can limit variations in the posteriors, using the g-priors of [15], where the prior variance-covariane matrix is a scalar multiple of the information matrix for the linear regression. This coefficient g plays a decisive role in the analysis, however large values of g imply a more diffuse prior and, as shown e.g. in Marin et al., [14] if the value of g is large enough, the Bayes estimate stabilizes. We will select g as equal to the sample size 200, following Liang et al., [16] as it means that the amount of information about the parameter is equal to the amount of information contained in one single observation. Our second proposed prior is the flat prior π (α, β)=1. And Jeffreys' prior constitutes our third prior as in Marin et al., [14]. In the logistic case, Fisher's information matrix is where X={x ir } is the design matrix, W=diag {m i π i (1-π i )} and mi is the binomial index for the ith count [17]. This leads to Jeffreys' prior {det(I α, β, X))} 1/2 proportional to This is a nonstandard distribution on (α, β) but it can easily approximated by a Metropolis-Hastings algorithm whose proposal is the normal Fisher approximation of the likelihood, as in Marin et al., [14]. All point estimates in Table 2 are averages of posterior samples of 10 4 simulations.

Range of estimates
Bayesian estimates of the regression coefficients associated with the three noninformative priors above are summarized in Table 2. Those estimates vary quite moderately from one choice to the next, as well as relatively to the MLEs and to the results shown in Table 1 when σ=900. Figure 3 is even more definitive about this. There is no significant difference between those and we conclude at the stability of Bayesian inferences under these different prior choices.

Example 2: Modeling Covariance Matrices
The second choice of prior criticized by Seaman et al. [1], was proposed by Barnard et al. [18] for the modeling of covariance matrices. However the paper falls short of demonstrating a clear impact of this prior modeling on posterior inference. Furthermore the solution of using another proper prior resulting in a \wider" dispersion requires a prior knowledge of how wide is wide enough. We thus assess here the evaluated regression model and then run Bayesian analyses considering both prior beliefs specified by Seaman et al. and Barnard et al. [1,18].

Setting
The multivariate regression model of interest is where Y j is a vector of n j dependent variables, X j is an n j ×k matrix of covariate variables, and β j is a k-dimensional parameter vector.      the individual β j is determined by it towards a common target. The covariance matrix Σ is defined as a diagonal matrix with diagonal elements S, multiplied by a k×k correlation matrix R, "Σ=diag (S) R diag (S)". Note that S is the k×1 vector of standard deviations of β j s, (S 1 , …, S k ). Barnard et al. [18] propose lognormal distributions as priors on S j and while the correlation matrix could have (1) a joint uniform prior which means p (R) α 1, or (2) a marginal prior obtained from the inverse-Wishart distribution for Σ which means p (R) is derived from the integral over S 1 ,…, S k of a standard inverse-Wishart distribution. In the second case, all the marginal densities for rij are uniform when i≠j, (see Barnard et al. [18]. Seaman et al. [1] chose a different prior structure, with a prior on the correlations and a lognormal prior with means 1, -1 and standard deviations 1, 0.5 on the standard deviations of the intercept and slope, respectively. Simulating from this prior, they concluded at a high concentration near zero. They then concluded that the lognormal distribution should be replaced by a gamma distribution G (4, 1) as it implied a more diffuse prior. The main question here is whether or not the induced prior is more diffuse should make us prefer gamma to lognormal as a prior for S j , as discussed below.

Prior beliefs
First, Barnard et al., [18] basic modeling intuition is "that each regression is a particular instance of the same type of relationship". This means an exchangeability prior belief on the regression parameters. As an example, they suppose that m regressions are similar models where each regression corresponds to a different firm in the same industry. Exploiting this assumption, when β j has a normal prior like [18] chose a prior concentrated close to zero for the standard deviation of the slope so that the posterior of this coefficient would be shrunken together across the regressions. Based on this basic idea and taking tight priors on Σ for β j , j=1,…, m, they investigated the shrinkage of the posterior on β j as well as the degree of similarity of the slopes. Their analysis showed that a standard deviation prior that is more concentrated on small values results in substantial shrinkage in the coefficients relative to other prior choices. Consider for instance the variation between the choices of lognormal and gamma distributions as prior of S 2 , standard deviation of the regression slopes. Figure 4 compares the lognormal prior with normal mean and standard deviation -1, 0.5 and the gamma distribution G (4, 1). In this case, most of the mass of the lognormal prior is concentrated on values close to zero whereas the gamma prior is more diffuse. The 10, 50, 90 percentiles of LN (-1, 0.5) and G (4, 1) are 0.19, 0.37, 0.7 and 1.74, 3.67, 6.68, respectively. Thus, choosing LN (-1, 0.5) as the prior of S 2 is equivalent to believe that values of β 2 in the m regressions are much closer together than the situation where we assume S 2 ~ G (4, 1). To assess the difference between these two prior choices on S 2 and their impact on the degree of similarity of the regression coefficients, we resort to a simulated example. In short, our example is similar to that defined in Barnard et al. [18], except for different values of k=1, number of the regression coefficients, m=4, number of normal regressions, and n j =36, number of observations. The explanatory variables are simulated from the standard normal distribution. We also take τ j ~ I G (3, 1) and The prior for Σ is such that π (R) α 1 and we run Seaman et al.'s [1] analyses under S 2 ~ LN (-1, 0.5) and S 2 ~ G (4, 1). Using 10 4 Gibbs sampling simulations, we produce the estimates and standard deviations of Tables 3 and 4, respectively. The difference between the regression estimates is quite limited from one prior to the next, while the estimates of the standard deviations vary much more. In the lognormal case, the posterior of Si is concentrated on smaller values relative to the gamma prior. Figure 5 displays the posterior estimations of the regression intercept, slope and S i , i=1, 2 simulated from a Gibbs sampler based on 10 4 iterations. The impact of the prior choice is quite clear on the standard deviations. Since of intercepts and slopes for all four regressions are centered in (16, 5, 17) and (-10, -9), respectively, we can conclude at the stability of Bayesian inferences on β j when selecting two different prior distributions on S j .

Examples 3 and 4: Prior Choices for a Proportion and the Multinomial Coefficients
This section considers more briefly the third and fourth examples of Seaman et al. [1]. The third example relates to a treatment effect analyzed by Cowles [19] and the fourth one covers a standard multinomial setting. 5.1 proportion of treatment effect captured In Cowles [19] two models are compared for surrogate endpoints, using a link function g that either includes the surrogate marker or not. The quantity of interest is a proportion of treatment effect captured which is defined as where β 1 , β R,1 are the coefficients of an indicator variable for treatment in the first and second regression models, respectively. Seaman et al. [1] restricted this proportion to the interval (0, 1) and under this assumption they proposed to use a kind of beta distribution (conditional beta distribution) on β 1 , β R,1 so that PTE stayed within (0; 1).We find this example intriguing in that, even if PTE could be turned into a meaningful quantity (given that it depends on parameters from different models), the criticism that it may take values outside (0, 1) is rather dead-born since it suffices to impose a joint prior that ensures the ratio stays within (0, 1). This actually is the solution eventually proposed by the authors. If we have prior beliefs about the parameter space (which depends on β 1 , β R,1 in this example) the prior specified on the quantity of interest should integrate these beliefs. In the current, there is seemingly no prior information about (β 1 , β R,1 ) and hence imposing a prior restriction to (0; 1) is not a logical specification. For instance, using normal priors on β 1 , and β R,1 lead to a Cauchy prior on β 1 /β R,1 , which support is not limited to (0, 1). We will not discuss this example any further.

Multinomial model and evenness index
The final example in Seaman et al [1] and in this paper deals with a measure called evenness index that is a function of a vector ϴ of proportions ϴ i , i=1, … , K. The authors assume that these are associated with a Dirichlet prior with parameters first equal to 1 then to 0.25. For the selection function H, the first prior concentrates on (0, 5, 1) whereas the second does not. Since there is nothing special about the uniform, re-running the evaluation with a Jeffreys prior reduces this feature, which anyway is a characteristic of the prior distribution, not of the posterior distribution which accounts for the data. The authors actually propose to use the Dir (1/4, 1/4,…,1/4) prior, presumably on the basis that the induced prior on the evenness is then centered close to 0.5. . Figure  6 shows the corresponding priors on H(ϴ): the concentration of the density of the evenness index on (0.5, 1) decreases by reducing Ɣ i . For Ɣ i =0.1, it is concentrated on (0, 0.7) while for Ɣ i =0.1 most of the mass is on values between 0 and 0.5. To further the comparison, we generated datasets each of sizes N=50, 100, 250, 1000, 10, 000. Figure  7 shows the posteriors associated with each of the four Dirichlet priors for these sample sizes, including their mode which are all close to 0.4 when N=104. Even for moderate sample sizes like 50, the induced posteriors are almost similar. The posterior means on H(ϴ), are reproduced in Table 5. When the sample size is 50, there is a substantial variation, however, between the posterior means such that the as the sample size increases this difference decreases to zero.    Figure 6 and Jeffreys' prior on ϴ for sample sizes 50, 100, 250, 1000, 10, 000.
Since the Dirichlet distributions are conjugate priors, hence possibly lacking in robustness, we propose to set Ɣ i =1/K (here K is equal to 8) which transforms Dirichlet distribution to a Jeffreys' prior. This non informative prior works well and could minimize the influence of the prior input on the inferential output for small sample sizes [2]. Figure 8 reproduces the transform of Jeffreys' prior for the evenness index (left) and the induced posterior densities for N=50, 100, 250, 1000, 10, 000 (right). Once again, the posteriors concentrate around 0.4 even though Jeffreys' prior is more diffuse than the other proposal priors of 6. The last two rows of Table 5 displays means and standard deviations of simulated posterior distributions on H(ϴ)for Jeffreys' prior. The same stability occurs.

Conclusion
In this note, we reassessed the examples of the critical review of Seaman et al. [1]. Our own Bayesian modeling was based on non informative priors such as g-priors, at and Jeffreys' priors, as well as weakly informative priors [13]. According to the outcomes produced therein, the use of non informative distributions as priors result in stable posterior inferences and also give reasonable Bayesian estimations for the parameters at hand. We thus consider the level of criticism found in the original paper rather superficial, as it either relies on a highly specific choice of a proper prior distribution or on ignoring basic prior information. The paper of Seaman et al. [1] concludes with recommendations for prior checks. The recommendations are mostly sensible if expressing the fact that some prior information is almost always available on some quantities of interest. Our only point of contention is the repeated and recommended reference to MLE, since it implies assessing or building the prior from the data. The most specific (if related to the above) recommendation is to use conditional  priors as exposed by Christensen et al. [20]. For instance, in the first (logistic) example, this meant putting a prior on the cdfs at age 40 and age 60. The authors picked a uniform in both cases, which sounds inconsistent with the presupposed shape of the probability function. In conclusion, we find there is nothing pathologically wrong with either the paper of Seaman et al, [1] or the use of "non informative" priors! Looking at induced priors on more intuitive transforms of the original parameters is a commendable suggestion, provided some intuition or prior information is already available on those. Using a collection of priors including reference or invariant priors helps as well to build a feeling about the appropriate choice or range of priors and looking at the induced dataset by simulating from the corresponding predictive cannot hurt.