Medical, Pharma, Engineering, Science, Technology and Business

^{1}Department of Biostatistics, Johns Hopkins School of Public Health, USA

^{2}Department of Biostatistics and Computational Biology, Dana Farber Cancer Institute, Boston, USA

- *Corresponding Author:
- Giovanni Parmigiani

Department of Biostatistics and Computational Biology

Dana Farber Cancer Institute, 450 Brookline Avenue

Boston MA 02215, USA

**Tel:**(617) 632-5323

**Fax:**(617) 632-2444

**E-mail:**[email protected]

**Received date:** March 10 2014; **Accepted date:** April 09, 2014; **Published date:** April 15, 2014

**Citation:** Liu D, Parmigiani G, Caffo B (2014) Screening for Differentially Expressed Genes: Are Multilevel Models Helpful? J Biomet Biostat 5:192. doi: 10.4172/2155-6180.1000192

**Copyright:** © 2014 Liu D, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are are credited.

**Visit for more related articles at** Journal of Biometrics & Biostatistics

Screening for changes in gene expression across biological conditions using high throughput technologies is now common in biology. In this paper we present a broad Bayesian multilevel framework for developing computationally fast shrinkage-based screening tools for this purpose. Our scheme makes it easy to adapt the choice of statistics to the goals of the analysis and to the genomic distributions of signal and noise. We empirically investigate the extent to which these shrinkage-based statistics improve performance, and the situations in which such improvements are larger. Our evaluation uses both extensive simulations and controlled biological experiments. The experimental data include a socalled spike-in experiment, in which the target biological signal is known, and a two-sample experiment, which illustrates the typical conditions in which the methods are applied. Our results emphasize two important practical concerns that are not receiving sufficient attention in applied work in this area. First, while shrinkage strategies based on multilevel models are able to improve selection performance, they require careful verification of the assumptions on the relationship between signal and noise. Incorrect specification of this relationship can negatively affect a selection procedure. Because this inter-gene relationship is generally identifiable in genomic experiments, we suggest a simple diagnostic plot to assist model checking. Secondly, no statistic performs optimally across two common categories of experimental goals: selecting genes with large changes, and selecting genes with reliably measured changes. Therefore, careful consideration of analysis goals is critical in the choice of the approach taken.

Many genomics investigations using expression arrays take the form of searching for genes whose expression level is different across experimental conditions or phenotypes. The list of gene transcripts produced by a microarray analysis is usually the starting point for extensive additional biological work, including independent validation, and both in-silico and laboratory work on sequences and proteins related to the transcripts selected. In this context, microarray experiments are screening, not testing, experiments. Because of the wide range of important questions that can be explored using these arrays, and the costs involved, comparisons across conditions are often made using a limited number of replications. Efficient use of data is critical in improving a laboratory’s ability to correctly identify important biological hypotheses and proceed to test them by appropriate further experimentation.

Specific screening goals vary with the study. Two simple but representative situations are the selection of genes that are changed by a large amount, and the selection of genes that are changed by a reliably measured amount. In either case, the comparison of gene expression across two conditions based on replicated experiments requires a tradeoff of signal, the variation of expression across the two conditions, versus noise, the variation of expression within each condition. Therefore the problem is statistical in nature [1-3]. In this paper we discuss a Bayesian multilevel framework for developing screening tools that adapt to the goals of the analysis and to the genomic distributions of signal and noise. We evaluate a representative set of these tools using both extensive simulations and controlled biological experiments in which the set of altered genes is known.

A variety of approaches for selecting differentially expressed genes have been proposed [4] for a review and Murie et al. [5]. The simplest and still the most widely used is to set a threshold on a measure of signal alone, for example an estimated fold–change. This can be motivated by the desire to identify large changes, although often it is used by simple analogy with other gene expression essays that have much less noise. Upper and lower thresholds of two and one half are often seen in applications.

One limitation of this approach is that it does not consider how reliably gene-specific changes are measured. That is, it implicitly assumes that all genes are subject to the same level of noise. This may not be the case because even after appropriate preprocessing of the data, the within–gene variation in expression can be highly gene–dependent.

A straightforward way to account for both signal and noise is to select genes based on statistics motivated by two–sample testing, such as the T–ratio or the Wilcoxon statistic. For each gene, the T–ratio is an estimate of the signal–to–noise ratio. Because it requires estimating two or three parameters instead of one, when the number of replicates is small, the T–ratio does not necessarily perform better than fold– change, even when the goal is point-null-like. Gains in efficiency over both fold–change and T–ratios can be obtained by considering the ensemble of gene expression measures at once, rather than each gene in isolation. This occurs for at least two reasons. First, genes measured on the same array type in the same laboratory are all affected by a number of common sources of noise. Secondly, many changes in expression are part of common biological mechanisms.

A widely used approach that uses genome–wide information is “significance analysis of microarrays” or SAM [6]. SAM involves transforming the signal–to–noise ratios so that they are approximately independent of noise across genes. The type of transformation used by SAM is designed to protect against false discoveries generated by very small denominators in the T–ratios. The denominators that are really small are highly likely to be so by chance, because genes share many sources of variability, and a certain amount of variation is to be expected from all of them.

More broadly, joint estimation of many related quantities is often approached by multilevel modeling, and the associated Empirical Bayesian [7,8] and Hierarchical Bayesian [9] estimation techniques. See Carlin and Louis [10] for a detailed discussion. In genomics, these may represent variation in two stages. The first stage defines summaries at the gene level, for example test statistics, or estimates of fold change and noise. These describe variability of samples within each gene.

The second stage posits a “genomic” distribution for these gene–level summaries. Such multilevel modeling provides tools for borrowing strength from other genes when making inference on each gene. Some examples of implementations in microarrays are provided by Baldi and Long [11], Newton et al. [12], Efron et al. [13], L¨onnstedt and Speed et al. [14], Ibrahim et al. [15], Parmigiani et al. [16], Wright and Simon [17] among others.

In practice, a question often raised by genomics practitioners is the extent to which simple, real-time, shrinkage statistics motivated by multilevel models would outperform single-gene-at- a-time analysis or SAM. In this paper we set out to systematically address this question. To this end, we found it necessary to develop a general framework for developing and evaluating these fast shrinkage statistics. Our answer will turn out to be that shrinkage can furnish substantial improvement over single-gene-at-a-time analysis or SAM, provided that the statistics chosen will a) take into account the goals of the screening experiments and b) will be chosen based on examination of the properties of the genomic distribution of signal and noise. Even though we considered an extensive collection of statistics, the goal was not that of providing an exhaustive comparison of all approaches that have been proposed, but rather that of highlighting the critical role of the signal–to– noise trade–off and of providing tools to choose among alternative approaches based on the genome–wide behavior of signal and noise. We compared the performance of these statistics using both extensive simulations and real data sets in which fold changes were known.

**Multilevel models for two-group comparisons**

We consider a design in which two biological types are compared on a microarray that probes G genes. Each type is measured on n arrays using either technical or biological replicates. Here technical replicates refer to experiments that have multiple aliquots of the same RNA, while biological replicates refer to experiments that have multiple subjects from a population. Each situation requires a different interpretation of the array-to-array variability, but the formal structure is the same. We do not consider both levels of replication at the same time here. We denote by X_{1gj} the expression for gene g in sample j in the first group, and by X_{2gj} the expression for gene g in sample j in the second group. Expression levels are assumed to be centered around an overall experiment-wise mean.

Recall our interest lies in studying approaches for selecting genes that are differentially expressed between groups. We begin by describing an additive group effect and independent Gaussian errors. That is we assume that the observed expressions are conditionally independent draws from

Here, δ_{g} is the difference in expression level for gene g across groups, μ_{g} is an overall expression level for gene g, also referred to as abundance, or intensity, and is the variance of expression level for gene g in both groups. We refer to δ_{g} as true signal, and to σ_{g} as true noise. In a multilevel setting, our parameterization is different from that assuming E{X_{1gj}}=μ_{g}E{X_{2gj}}=μ_{g}+δ_{g}, which would lead to two different marginal variances in the two groups.

The fit of the normal distribution can often be improved by a suitable transformation of the data. Departures from normality, equal variances, and additivity may occur but are not considered in this manuscript.

Multilevel models postulate a distribution for the abundance, signal, and noise parameters across genes. A common assumption to many of the multilevel models used in microarray analysis is that of conjugate distributions for the second stage of the statistical model, in short a “conjugate model”. In the case of Gaussian data, the conjugate model implies that the gene–specific signal–to–noise ratios and abundanceto- noise ratios are independent of the corresponding gene–specific noise [18,19]. This assumption leads to convenient mathematical representations for many of the steps required by the data analysis, and is sometimes adopted solely for this reason. In practice, however, some microarray experiments follow this independence pattern closely, while others depart from it substantially. The loss of efficiency of screening based on the conjugate model in the latter case can be large.

Here we broaden the conjugate scheme and we investigate four model varieties, that result from the combination of two factors: (i) whether the gene-specific signal is independent of the gene-specific noise, (ii) whether the gene-specific abundance is independent of the gene-specific noise. Formally, for (i) the independence models assumes that δ_{g} and σ_{g} are independent, while the conjugate model assumes that δ_{g}/σ_{g} and σ_{g} are independent.

The remainder of our distributional assumptions are standard for normal multilevel models [9,20]. The models are summarized in **Table 1**.

Data Level |

II. Independence |
CI. Independence of Signal and Noise |

IC. Independence of Abundance and Noise |
CC. Complete Conjugacy |

**Table 1:** The four classes of multilevel models investigated. The array-to-array
variation is modeled in the same way in all four cases. In all cases, j = 1, 2, . . . , n
and g = 1, 2, . . . ,G. All quantities denoted by Greek letters are unknown. A further
set of prior distributions for the hyperparameters is described in the text.

We use the notation d_{g} for the mean difference of expression across two groups, a_{g} for the overall mean expression, and s_{g} for the pooled estimate of the standard deviation. Notationally:

In all four models of **Table 1**, these statistics are independent conditional on gene-specific parameters and have distributions

conditional on gene-specific parameters. All four combinations of **Table 1 **occur commonly in practice. For example, our two experimental data sets show two markedly different relationships between signal and noise.

The vector of unknown genome-wide parameters will be denoted by ξ=(ν,β,λ,τ). There are several estimation approaches available for models of this kind. State-of-the art, computationally intensive approaches are usually based on MCMC [21]. Instead we focus on a faster and simpler empirical Bayesian approach based on estimating ξ by method of moments from the empirical distributions of , d_{g}’s. The resulting estimators are computationally cheap and may include shrinkage of the signal, of the noise, or both. Several method of moments alternatives are available, and results can be strongly affected by this choice. For example, in our experience, the method of moments applied to the distribution of , which is inverse gamma, performs poorly, while the same applied to performs well.

We approach the task of generating a list of candidate genes by ranking genes according to a one-dimensional statistic, and then selecting all genes whose statistic is above a certain cutoff. This is the norm in practice. While more general decision theoretic approaches evaluating the trade-off between false and missed discoveries are available [22,23], these are complex, and would have been prohibitive in our extensive simulation study. The cutoff is often determined by the ability of a laboratory to perform validatory analyses, or, more inferentially, by false discovery rates [24-27]. In our presentation, to simplify the comparison of approaches, we focus on the ranking of genes implied by the statistics, and the ability of each statistic of identifying the top g genes.

Throughout, we draw a distinction between the selection of genes that are changed by a large amount, and genes that are changed by a reliably measured amount. Accordingly we consider two broad families of statistics, ones that estimate the signal, δ_{g}, and ones that estimate the signal-to-noise ratio, δ_{g}/σ_{g}. Because we use statistics as ranking devices and compare them based on ROC curves, we only need to define statistics up to constants that are not gene-specific. A proportionality sign will indicate omission of such constants. **Table 2** summarizes the statistics we examined, organizing them by goal and motivating model structure. **Table 3** summarizes the expressions of statistics motivated by multilevel models.

MOTIVATING MODEL | ANALYSIS GOAL | |
---|---|---|

Large change | Reliably measured change | |

Independence/Normality of Genes | Difference in Expression (F) | T–statistic (T) |

Exchangeability of Genes | Significance Analysis of Microarray (SAM) | |

Complete Conjugacy | Signal (CC.F) | Standardized Signal (CC.T) |

Tail probability (CC.TP) | Bayes factor (CC.BF) | |

Independence of Abundance and Noise | Signal (CI.F) | Standardized Signal (CI.T) |

Tail probability (CI.TP) | Bayes factor (CI.BF) | |

Independence of Signal and Noise | Signal (IC.F) | Standardized Signal (IC.T) |

Tail probability (IC.TP) | Bayes factor (IC.BF) | |

Independence | Signal (II.F) | Standardized Signal (II.T) |

Tail probability (II.TP) | Bayes factor (II.BF) |

**Table 2:** Summary of statistics examined, by goal and motivating model structure.

Model | Statistics | |||
---|---|---|---|---|

F | T | BF | TP | |

CC.m | d_{g} |
|||

CC.c | d_{g} |
|||

IC | d_{g} |
|||

CI | ||||

II |

**Table 3:** Summary of functional forms of all the statistics motivated by multilevel models. For the complete conjugacy case (CC) we consider both statistics that use analytic integration with respect to σ_{g} (labelled CC.m) and statistics that use plug-in estimates of σ_{g} (labeled CC.c).

**Statistics**

In this subsection we enumerate and briefly comment on each of the statistics we considered. The remainder of this section is provided as a reference for future sections. Details of the derivations are given in the Appendix.

**Difference in Expression (F):** This is the observed average difference d_{g}. Usually expression data are analyzed in the logarithmic scale, in which case F corresponds to an estimate of the log fold change across conditions.

**T–statistic (T):** This is the common statistics T ∝ d_{g}/s_{g} used for testing the null hypothesis of δ_{g}=0 one gene at the time.

Significance Analysis of Microarrays (SAM): This was proposed by Tusher et al. [6] and is based on the change of gene expression relative to an adjusted standard deviation. For the two group case considered here, the SAM statistic for gene g is

where s_{0} is the so-called “exchangeability factor”. This factor is estimated using information from the entire set of genes to transform the values of SAM so that noise and SAM are approximately independent.

**Statistics for the Complete Conjugacy (CC) Model:** In the Complete Conjugacy model the conditional posterior distribution of δ_{g} given hyperparameters ξ can be written as

(1)

A set of computationally cheap statistics is derived by considering empirical Bayes estimates of this distribution, obtained by replacing hyperparameters ξ with an estimate . A regularized estimate of signal δ_{g} is

Regularization is independent of the gene, so for any given experiment CC.F will be proportional to F. For this reason we only consider F, although we keep this correspondence in mind when interpreting the results.

A standardized estimate of signal is derived as the ratio of the conditional posterior mean and standard deviation of δ_{g} from expression (1),

The denominator incorporates a linear shrinkage estimate of the gene-specific variance with gene-varying coefficients, penalizing more heavily genes whose signal or abundance are outlying. For this reason, it is critical that the conjugacy assumption be checked, or very valuable information may be lost. On the other hand, when the assumption is met, an increase in efficiency is gained from estimating the denominator.

An empirical Bayes estimate of the Bayes factor [28] for the null hypothesis of no gene-specific differential expression, δ_{g} = 0, is:

Finally, we consider the empirical Bayes approximation, of the probability that the true change δ_{g} exceeds D [29]. Here, D represents a target change across conditions. This tail probability reflects the observed change, its variability and the likely magnitude of biologically significant changes.

Statistics for the Independence of Abundance and Noise (IC) Model: In this model the posterior distribution of δ_{g} given σ_{g} and hyperparameters ξ can be written as

(2)

Unlike in the complete conjugate case, a closed form marginalization with respect to σ is not possible. Therefore we derive results assuming is known. In the actual calculations, to obtain a real-time statistics, is estimated by the posterior mode of the distribution of , ξ. Then our estimate of the normalized signal is

As with CC.F, IC.F is proportional to F, so we only consider F in our results section.

A standardized estimate of signal based on regularized estimates of signal is the ratio of the marginal posterior mean and standard deviation of δ_{g} from expression (2), that is

The Empirical Bayes estimate of the Bayes factor, conditional on gene specific variance is:

Finally, we consider the empirical Bayes tail probability

Statistics for the Independence of Signal and Noise (CI) Model: In this model the posterior distribution of δ_{g} given σ_{g} and hyperparameters ξ can be written as

(3)

Again, to obtain a real-time statistic we develop results conditional on and estimate it with its posterior mode in actual calculation. The estimate of the signal is

A standardized estimate of signal based on regularized estimates of signal is the ratio of the marginal posterior mean and standard deviation of δg from expression (3), that is

The Empirical Bayes estimate of the Bayes factor is:

while the tail probability approximation is

**Statistics for the Complete Independence (II) Model:** In this model the posterior distribution of δ_{g} given σ_{g} and hyperparameters ξ can be written again as

(4)

A regularized estimate of δg, motivated by the independence model is obtained by replacing ξ with and with its conditional posterior mode evaluated at , and approximating the posterior mean by

Unlike IC.F and CC.F, both II.F and CI.F imply a linear shrinkage which depends on the genomic variability of the signal. Dividing II.F by the square root of the variance of δ_{g}, and approximating as before, we obtain

The Empirical Bayes estimate of the Bayes factor is:

while the empirical Bayes approximation to the tail probability is

Notice that the definitions of statistics for the CI and II cases would be the same if was known. Hence the only difference in practice is the posterior mode for . . It should not then be a surprise that the performance of these two are very close. For the same reason, this is also true for statistics in the CC and IC settings conditional on .

The empirical Bayes estimators, both standardized and not, have functional similarities to the SAM score, although shrinkage of the noise in the denominators are determined differently. In empirical Bayes analyses, the shrinkage is driven by the parameters of the genomic distributions of signal and noise, in a form that depends on whether or not conjugacy is assumed. In SAM one applies linear shrinkage to the standard deviation rather than the variance, and the shrinkage intercept s0 is chosen to approximate independence of SAM ratios from noise.

**Overview**

In this section we study the performance of the real time shrinkage statistics on a large number of data sets simulated from each of the four models in **Table 1**. We evaluate each statistic on the basis of the implied ranking of genes, and the ability of each statistic of identifying the top g genes. In the analyses presented here we considered two alternative goals: in one the genes of interest are the top genes by absolute change δ_{g}. In the other the genes of interest are the top genes by signal-to noise ratio δ_{g}/σ_{g}. We considered the top 1%, 2% and 10% for each goal. For each cutoff we create a binary indicator of whether the true parameter is in the top list, and use this indicator as the true class assignment to be predicted. We evaluate performance by an ROC curve [13,30], which is the graph of the true positive fraction versus the false positive fraction for varying thresholds.

As summaries, we consider the overall area under the ROC curve [31] and the partial area under the ROC corresponding to false positive fractions smaller than 20%. We prefer these measures to others incorporating explicitly δ_{g} and δ_{g}/σ_{g} for two reasons: the goal of the microarray experiments we are focusing on is screening rather than estimation; interest usually lies in a relatively small fraction of important findings.

Based on these criteria, our simulations suggest three general conclusions about the alternative approaches for identifying differential genes: i) simple, real-time, shrinkage statistics motivated by multilevel models can outperform alternatives based on analyzing each gene separately, in some cases by a large margin; ii) the same statistics can perform better than the commonly used SAM [6] statistic, provided that careful checking of the multilevel modeling assumptions is carried out, and iii) no statistics is optimal for both the identification of large δ_{g} and large δ_{g}/σ_{g}.

**Design of simulation study**

Our goal was to generate a large and diverse number of scenarios, depending on sample size, conjugagy assumptions, and hyperparameter choices. We considered three sample sizes: 3, 10 and 100 per group. Use of samples as small as 3 is a common scenario in the gene screening experiments taking place during the routine activities of many laboratories, while 10 per group is a common scenario in comparisons across conditions for population genomic studies. Sample size as large as 100 per group are rare and considered here mostly as a check.

For each combination of conjugacy assumption (CC, IC, CI and II) and sample size, we simulated data from 2009 hyperparameter combinations, resulting in a total of 24108 datasets. The 2009 combinations of hyperparameters are based on the grid:

For each combination we derive ν and β from and . Here the total number of combinations is 2009 rather than the full 2401 because some expectation/variance combinations lead to unrealistic settings for _ yielding numerically unstable results.

**Simulation results**

Mining the massive information generated by the thousand datasets required drastic summarization. Here, we present one detaset in detail, and then provide the following summaries: scatterplots in which each deatset/statistic combination is represented by a single point, summaries of pairwise comparisons of statistics by model/goal/ sample size, and summaries of best performing statistics by model/ goal/sample size.

**Figure 1** shows the ROC curves for a single simulation. Data are generated from the II model and the ROC is based on identifying genes with large signal. In this data set, we see a clear separation in the performance of the statistics, both within and across conjugacy structures. The tail probability statistics perform best irrespective of the motivating model, stressing the importance of correctly specifying the analysis goal.

**Figure 1:** ROC curves for all the statistics under study in a single simulation with 3 replicates and 1000 genes. The hyperparameter
values used to simulate the data were ν=2, β=1, γ=1 and τ=1. The ROC curves are grouped by statistical model to unclutter the
displays. The curves for T and SAM are repeatedin each panel.

To summarize this type of comparison for all 2009 simulated data sets, we display pair scatterplots of areas under the ROC curves (AUC). **Figure 2** summarizes results for II data with three replicates. We focus on T, SAM and on the three shrinkage-based statistics that perform best in II data. Whiskers at the top of the graphs for the T statistics indicate that, for a subset of simulation for which other statistics achieve a perfect separation, the T is can still miss a fraction of differentiated genes. The reciprocal situation does not occur, suggesting a substantial inefficiency in the use of the T statistic. For both goals, the best performing model is based on CI, suggesting that the model chosen to represent the abundance/noise relationship is less critical than that chosen for the signal/noise relationship. A complete set of such comparisons including IC, CI and CC is available in the supplementary materials on our website.

**Figure 2:** Comparison of performance of T, SAM and the shrinkage statistics that perform best in independent (II) data. Each point represents the areas under
the curve for two statistics for a particular simulated data set. In the plots above the diagonal, we report the areas when identifying genes with high signal to
noise ratio, while in plots below the diagonal we report areas when identifying genes with large signal. The shrinkage statistics that perform best change across
the two goals, and therefore different shrinkage statistics are shown above and below the diagonal. Because over-plotting points may lead to visually misleading
results in some cases, we also print the percentages PU and PL of points lying above and below the diagonal, and the averages MX and MY of the horizontal
and vertical variables.

Rather than reproducing these plots for each of the simulation scenarios, we present, in **Figure 3**, “heat maps” synthesizing pairwise comparisons of estimators for the four simulation settings. For each pair of statistics, maps encode how often one statistics’ AUC is better than the other’s, allowing for rapid comparison of any two statistics on data generated under each of the four model families. To further summarize, the best performing statistics from the heat maps for sample sizes 3, 10 and 100 replicates are reported in **Tables 4 and 5** (for partial AUC).

**Figure 3: **Heat maps of the results for data simulated from the four models, with sample size 3. The color encodes the sign of the difference between percentages PU
and PL of points lying above and below the diagonal in Figure 2 for all pairs of statistics. Green indicates PU<PL, red indicates PU>PL, while yellow indicates those
cases where the difference was less than 0.05. The best statistics to identify genes with large signal are labeled with purple, while the best statistics to identify reliably
measured differentially expressed genes are labeled with blue. These results are further summarized in Table 4.

**Table 4:** Summary of best performing statistics by AUC. Here “model” is the true model used for simulation. Statistics were evaluated by their ability to identify the top
1%, 2% and 10% of genes with large signals (labeled Signal) and large signal-to-noise ratios (S/N). The rows labeled MM correspond to parameter esti-mation using the
method of moments while those labeled TT correspond to using the true hyperparameter. Instances where the best performing statistic did not match the appropriate true
model and goal are high- lighted in red. Instances where the best performing statistic did not match the model but was consistent with goal are highlighted in blue. A
wildcard * indicates the statistic from either the conjugate or indepen- dence model was the best performer. For example, ”*C.F” indicates that the CC.F and IC.F statistics
were roughly equivalent best performers.

**Table 5:** Best statistics for each simulation scenario with three replicates. Here “model” corresponds to the true model used for simulation. Statistics were
differentiated in their ability to identify the top 1%, 2% and 10% of genes with large signals (labeled Signal) and large signal-to-noise ratios (S/N). The rows
labeled “MM” correspond to parameter estimation using the method of moments. Results using the actual true parameter values (labeled "Truth”) are also
given. Instances where the best statistic did not matching the appropriate true model and goal are highlighted in red. Instances where the best statistic did not
match the model but was consistent with goal are highlighted in blue. The results highlight the importance of matching the statistic to the goal; in only one case
did a T statistic out-perform other statistics to identify high signal changes while in no cases did a fold change out-perform other statistics to identify high SNRs.

Overall, these results emphasize the importance of matching the statistic to the analysis goal: fold changes and tail probabilities appear to be the best statistics for estimating large signals changes. In contrast the signal-to-noise ratios and Bayes factors appear to be optimal for estimating reliably measured differential expression. Also, these result confirm the importance of shrinkage: in only one case did a T statistic out-perform other statistics to identify high signal changes while in no cases did fold change out-perform other statistics to identify high SNRs. IN shrinkage correctly identifying the appropriate modelling assumptions becomes increasingly important as the number of replicates increases. The SAM statistic performs well across models, though often worse than the best shrinkage statistics within a model.

To provide a bound on the improvement in performance that can be achieved by shrinkage, in **Tables 4 and 5** we provide results obtained by plugging in the true values of the parameters of the genomic distributions instead of their estimates in the calculation of the statistics.

When there is no close form, we have used the conditional estimation by plugging in the posterior mode of . To verify how reasonable these real-time statistics are, we considered the Complete Conjugacy model where closed forms are available for all the statistics. We performed another set of simulation with exactly the same hyperparameters comparing the results based on CC.TP and CC.BF in two cases: one based on the conditional posterior of δ_{g} with posterior mode of plugged in, the other based on the marginal posterior of δ_{g} integrating out . Results are shown in the supplementary materials. Real-time statistics generally perform as well as the exact statistics, although there is a minority of cases in which the exact statistics does far better than real-time approximation, especially for tail probabilities.

Finally, we further investigated the seemingly counterintuitive result where the best performing statistic for the data simulated from the Complete Conjugacy model are the IC.TP, IC.BF for genes ranked by both signal alone and signal-to-noise ratio. This behavior persists at larger sample sizes. The reason for the counterintuitive behavior is that some of the hyperparameter combinations lead to simulated dataset that have diagnostic plots consistent with an IC model, in which case IC statistic performs well while the abundance-based shrinkage applied by the CC statistics leads to loss of some of the signal. Additional details are provided in the supplementary materials.

**Datasets**

We analyzed two data sets. The first was reported by Tusher et al. [6] in the context of comparing radiated and unirradiated cell lines. A subset of the genes’ changes, identified based on the SAM statistic, were subsequently validated by independent essays. While the experiment includes some blocking, we analyze it here as though it were a two– class comparison with 4 replicates.

The second data set is from an experiment reported by Dudley et al. [32]. They performed a so called “spike-in” experiment in which they selected 9 genes with very low natural expression and “spiked-in” Cy3- labeled gene-specific oligonucleotides in increments from 0.5 fold to 200 fold. Their experiment used cDNA microarrays including a total of 6307 genes, and had two replicates. We work from ratios of Cy3- to-Cy5 channels, after normalization [33]. While spike in experiments are useful in that true fold changes are known, both the magnitudes of the changes, and the sparsity of changes in the genome are unlikely to be realistic.

**Graphical diagnostics for conjugacy structure**

We begin by investigating the relationship between signal, abundance and noise by displaying boxplots of signal and abundance by noise level. These elaborate on ideas of Dudoit et al. [34] and Tusher et al. [6]. **Figure 4** considers the Tusher data. An SN plot in which the location and dispersion of signal are stable across noise levels suggests the use of an independence model, while one in which the location and dispersion of signal increase with noise level suggests the use of a conjugate model. Thus, a constant box size indicates independence while an increasing box size indicates conjugacy. In simulated data (see Supplementary materials), the diagnostic plot clearly distinguishes conjugacy with respect to signal and noise as well as conjugacy with respect to abundance and noise.

**Figure 4:** Signal-to-noise diagnostic plot for the Tusher data under two transformations. Genes identified by SAM are highlighted in red. The horizontal axis is
the binned gene specific variances. The vertical axis for the plots in the left column is the average expression across the two groups (abundance) while it is the
the average difference in expression (signal) for the right column. Conjugacy appears appropriate for the raw data, and an independence relationship appears
appropriate for the transformed data.

Because results are sensitive to the type of transformation applied to the expression measurements, we display both the original scale and the cube root. Untransformed data show a pattern consistent with the conjugate model, while data transformed using the cube root appears consistent with the independent model. An alternative visualization to the SN plot is a simple scatterplot of signal versus noise. A limitation of this approach is that it can be difficult to establish whether increased variation in signal at different level of noise is due to a true relationship or simply to a higher number of genes at that noise level.

**Figure 5** shows the spike-in data using three transformations. The original scale shows a marked positive relationship between estimated signal and noise, the cube root scale a mild positive relationship, and the logarithm an almost stable relationship, with some indication of larger variation in the signal at lower noise level. These figures do not inform us about absolute intensity, so the larger variation of signal at the low end after the log transformation is not the same as the wellknown “fishtail” effect observed in MVA plots.

**Figure 5:** Signal to noise plot for the spike-in data under three transformations. Spiked-in genes are high- lighted in red. The horizontal axis is the binned
gene specific variances while the vertical axis for the plots in the left column is the average expression across the two groups (abundance) while it is the
the average dif- ference in expression (signal) for the right column. The apparent relationships between abundance and noise and signal and noise clearly
change dependent on the transformation used. While conjugacy appears ap-propriate for the raw data, and independence relationship appears more
appropriate for the log-transformed data.

In this data set the cube root transformation is the most effective in helping identify the truly differentially expressed genes, performing better than the commonly used log. In practical application one does not have the advantage of knowing the true changes when choosing a transformation. The important lesson here is, however, that choosing transformations based on convenient statistical properties such as variance stabilization does not necessarily improve, and could prejudice, our ability to detect signal.

These two datasets stress that both the independence of signal and noise and independence of signal-to-noise ratio and noise may need to be tackled in real applications. While transformation of the measured intensities may allow one to achieve independence, it is not clear that such transformations would be optimal in terms of gene screening.

**Comparison of real-time statistics**

**Figure 6** shows pairwise scatterplots of the statistics CC.TP, II.F, and SAM for the two transformations in **Figure 4**. CC.TP and II.F are the two best performing real-time shrinkage statistics for identifying genes with large signal. The two best statistics for selecting reliably measured genes, CC.BF and II.T, are given in the supplementary materials. All the genes originally identified by Tusher et al. [6] receive high tail probability using both transformed and untransformed data, though additional genes also receive tail probability close to one. On the other hand, the correspondence between SAM and II.F is good after cube root transformation but not in the raw scale. In evaluating these results, one must keep in mind that only genes that exceeded a certain SAM threshold were validated independently in the study. Therefore, direct performance comparisons with SAM are not reliable here.

**Figure 6:** Comparisons of the II.F, CC.TP and SAM statistics on the Tusher data. Genes selected by SAM are highlighted in red for reference. Here the CC.F and
II.TP statistics were identified from Tables 4, 5 and 6 as the optimal statistics for detecting large signal changes for the II and CC models. For this data set, the II
model is supported under the cubic transformations while the CC model is supported for the raw data.

**Figure 7** compare statistics in the spike-in data. For the log transformed data, we would expect a better performance from II.F than CC.TP based on the SN plot. In fact, the II.F statistics shrinks the effects excessively and gives a less efficient ranking. For the cube root transformed spike-in data we would expect and, in fact, see a better performance from CC.TP than II.F in **Figure 7**. For the untransformed spike–in data, results, shown in the supplementary materials, confirm the intuition from the exploratory plots that the conjugacy model should outperform the independence model. **Figure 7** does show this result. In a close view of comparison between CC.TP and SAM on raw data (see Supplementary materials), CC.TP also clearly picks up all the spiked genes, while SAM does not. Spiked genes are genes have large signals, so that poor performance of II.T and CC.BF is no surprise (see Supplementary materials).

**Figure 7:** Comparisons of fold change statistics II.F and CC.TP with SAM in the spike-in data. Spiked-in genes are highlighted in red. Here the CC.F and II.TP
statistics were chosen from Tables 4, 5 and 6 as the optimal statistics for detecting large signal changes for the II and CC models. For this data, the II model is
supported under the log transformations while the CC model is supported for the raw and cubic transformed data. The II.F statistic shrinks conservatively even for
the log transformed data while SAM and CC.TP perform well for both transformed data sets. No statistic performs well on the original scale.

This analysis warns of a potential danger in the use of multilevel modeling. Assuming that the signals follow a Gaussian law assumes that all genes are differentially expressed to some extent, and the goal is to either detect the largest signals or the largest reliably measured signals. This is realistic in case-control comparisons and in experiments in which the experimental intervention changes a large portion of the expression, as during cell division. In contrast, in this spike-in experiment, the distribution of signals is in fact degenerate at 0 for all but the 8 spiked in genes. Therefore, the statistics motivated by the Gaussian law on the signal are not validated from the data. While extreme, the spike-in situation may be relevant in practice when experimental intervention modifies a small set of genes involved in a very specialized pathway.

Diagnosing empirically whether the signal distribution is a mixture is difficult. Appropriate weight should be given to the biological circumstances of the experiment. For example, here an independence relationship is suggested for the signal and noise for the cubic and log transformed data. However, as the majority of the genes are biologically known to have no signal, these plots do not inform us on the question of interest. Furthermore, **Figure 7** shows that the best complete independence statistics for identifying large signal and reliably measured signal changes, II.F and II.T, perform poorly for detecting the spiked-in genes. In summary, aggressively modelling the distribution of signals when the overwhelming majority of genes have no signal can produce poor results.

In this article we present a framework for interpreting, selecting, and estimating shrinkage based screening statistics used in the identification of differentially expressed genes. We also evaluated a representative set of these tools using both extensive simulations and controlled biological experiments in which the set of altered genes is known or partially known.

Our results emphasize two important practical concerns that are not receiving sufficient attention in applied work in this area. First, while shrinkage strategies based on multilevel models are able to improve selection performance, they require careful verification of the assumptions on the relationship between signal and noise. Incorrect specification of this relationship can negatively affect a selection procedure. Because this inter-gene relationship is generally identified in genomic experiments, we suggest a simple diagnostic plot to assist model checking. Secondly, no statistic performs optimally across two common categories of experimental goals: selecting genes with large changes, and selecting genes with reliably measured changes. Therefore, careful consideration of analysis goals is critical in the choice of the approach taken.

The commonly used SAM statistics emerges as a reasonable compromise between the two goals above and is, to some extent, automatically adaptive to different relationships between signal and noise. Improving on SAM is possible but requires careful validation of the assumptions about the upper level distribution. The assumption of conjugacy in the abundance dimension requires careful attention as it is not robust. In particular, estimators based on the CC assumption can be outperformed even on data generated under CC.

Our simulation analysis relies on the assumed normality of data. In practice, two aspects of it are critical. At the lower stage, in small samples, the functional form of the error distribution across samples is hard to assess. On the other hand, at the upper stage,normality can be checked, and transformations may help, although the caveats discussed in Section 4.3 should be considered. Alternative multilevel models have been studied, for example by Newton and Kendziorski [35] who consider gamma models, and M¨uller et al. [22], who extend those to mixtures of gamma models. While these alternatives are worth serious consideration, here we focus on Gaussian models and statistics motivated by the Gaussian setting, primarily because in this way we can practically investigate a variety of relevant statistics on a massive number of simulation scenarios.

Other interesting multilevel approaches have been proposed to analyze designs that are more complex than the two-group comparisons considered here. We refer the reader to Kerr et al. [36], Wolfinger et al. [37], Kooperberg et al. [38], Tai and Speed [9] and Wang et al. [40] for further details. Meta-analysis of multiple microarray studies is another area where multilevel models have proven helpful, as illustrated by Conlon [41] and Scharpf et al. [42].

In recent years gene expression is increasingly measured using technologies based on sequencing short read (RNA-seq) which coexist with the hybridization-based approaches that motivated this work. These technologies generate count data. Though the normal model is sometimes used and can perform reasonably after variance stabilization, differential expression is better analyzed statistically using poisson and negative binomial models as described in [43] or Rapaport et al. [44]. An effective effective approaches using multilevel models is included in the Bioconductor package described by Love et al. [45].

In our analysis we assumed that all genes on the array are potentially changed. This is realistic in case control designs across populations or comparison of cells at different stages of the cell cycle, regulation can be expected in a large number of genes, although differences will vary randomly and many genes will be changed by amounts that are smaller than noise. In tightly controlled experiments, such as a comparison of wildtype versus mutant species, or treated versus untreated cell lines, differential expression may only involve a small number of pathways and genes. While this situation can be reasonably handled in the framework considered here, it would be more accurately modeled by assuming that only a fraction of the genes are differentially expressed across groups A multilevel model could assume assume that a fraction of the δ_{g} are identically zero, while the rest are normally distributed [14].

Our focus here has been on simple and easy-to-compute statistics for gene selection. Multilevel models were used only to provide motivation and a conceptual framework for the derivation of the shrinkage statistics. More generally, multilevel models give raise to potentially more efficient strategies than those considered here, at the price of increased computational expense, for example by MCMC of MCEM. Systematic exploration of those in the thousands of data sets considered here would have been impractical. However, our results suggest that that appropriate shrinkage is a critical part of gene selection and this will hopefully encourage practitioners to consider these more computing intensive approaches as well.

Data from Dudley, Aach, Steffen and Church (2002) was downloaded from http://arep.med.harvard.edu/masliner/supplement.htm. Data from Tusher, Tibshirani and Chu (2001) was downloaded from http://www-stat-class.stanford.edu/SAM/SAMServlet. Work of Giovanni Parmigiani supported by grant NCI 5P30CA006516-49.

- Kohane IS, Kho A, Butte AJ (2002) Microarrays for an Integrative Genomics, MIT Press, Cambridge, MA, UK.
- Speed TP (2003) Statistical Analysis of Gene Expression Microarray Data, Chapman and Hall, London.
- Parmigiani G, Garrett ES, Irizarry RA, Zeger SL (2003) The analysis of gene expression data: an overview of methods and software, Springer, New York.
- Pan W (2002) A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 18: 546–554.
- Murie C, Woody O, Lee AY, Nadon R (2009) Comparison of small n statistical tests of differential expression applied to microarrays. BMC Bioinformatics 10: 45.
- Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. ProcNatlAcadSci U S A 98: 5116-5121.
- Robbins H (1956) An empirical Bayes approach to statistics, Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability 1: 157–163.
- Efron B, Morris C (1973) Combining possibly related estimation problems (withdiscussion). Journal of the Royal Statistical Society, Series B, Methodological 35: 379–421.
- Lindley DV, Smith AFM (1972) Bayes estimates for the linear model (with discussion. Journal of the Royal Statistical Society, Series B 34: 1–41.
- Bhattacharya S1, Ringo Ho MH, Purkayastha S (2006) A Bayesian approach to modeling dynamic effective connectivity with fMRI data. Neuroimage 30: 794-812.
- Baldi P, Long AD (2001) A Bayesian framework for the analysis of microarrayexpression data: Regularized t–test and statistical inferences of gene changes, Bioinformatics 17: 509–519.
- Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW (2001)On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data. Journal of Computational Biology 8: 37–52.
- Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray experiment, Journal of the American Statistical Association 96: 1151–1160.
- Lonnstedt I, Speed T (2002) Replicated microarray data, StatisticaSinica 12: 3–46.
- Ibrahim JG, Chen MH, Gray RJ (2002) Bayesian models for gene expression with DNA microarray data. Journal of the American Statistical Association 97: 88–99.
- Parmigiani G, Garrett ES, Anbazhagan R, Gabrielson E (2002) A statistical framework for expression-based molecular classification in cancer. Journal of the Royal Statistical Society Series B 64: 717–736.
- Wright GW1, Simon RM (2003) A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics 19: 2448-2455.
- Raiffa H, SchleiferR (1961) Applied Statistical Decision Theory, Harvard UniversityPress, Boston
- Ando A, Kaufman GM (1965) Bayesian analysis of the independent multinormal process – Neither mean nor precision known. Journal of the American Statistical Association 60: 347–358.
- Gelman A, Carlin J, Stern H, Rubin D (1995) Bayesian Data Analysis, Chapman and Hall, London.
- Gilks WR, Richardson S, Spiegelhalter DJ (1996) Markov Chain Monte Carlo in Practice, Chapman and Hall, London.
- Muller P, Parmigiani G, Robert C, Rousseau J (2005) Optimal sample size for multiple testing: the case of gene expression microarrays, Journal of the American Statistical Association 99: 990–1001.
- Lin R, Louis TA, Paddock SM, Ridgeway G (2003) Loss function based rankingin two-stage, hierarchical models, Technical report 2003-6, Department of Biostatistics, Johns Hopkins University, USA.
- Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing, J R Statist Soc B 57: 289–300.
- Genovese C, Wasserman L (2001) Bayesian and frequentist multiple testing, Discussion paper, Department of Statistics, Carnegie Mellon University, USA.
- Storey JD (2002) A direct approach to false discovery rates. Journal of the Royal Statistical Society, Series B 64: 479–498.
- Storey JS, Tibshirani R (2003) SAM thresholding and false discovery rates for detecting differential gene expression in DNA microarrays.In:Parmigiani G, Garrett EG, Irizarry R and Zeger SL (eds). The analysis of gene expression data: methods and software, Springer, New York.
- Kass RE, Raftery AE (1995) Bayes factors. Journal of the American Statistical Association 90: 773–795.
- Normand SLT, Glickman ME, Gatsonis CA (1997) Statistical methods for profiling providers of medical care: Issues and applications. Journal of the American Statistical Association 92: 803–814.
- National Research Council; Panel on Discriminant Analysis Classification and Clustering (1988). Discriminant Analysis and Clustering, National Academy Press, Washington, DC.
- Pepe MS, Longton G, Anderson GL, Schummer M (2003) Selecting differentially expressed genes from microarray experiments. Biometrics 59: 133-142.
- Dudley AM, Aach J, Steffen MA, Church GM (2002) Measuring absolute expression with microarrays with a calibrated reference sample and an extended signal intensity range. ProcNatlAcadSci U S A 99: 7554-7559.
- Dudoit S, Yang J (2003)Bioconductor R packages for exploratory analysis and normalization of cDNA microarray data.In:Parmigiani G, Garrett E, Irizarry R,Zeger S (eds). The Analysis of Gene Expression Data: Methods and Software, Springer Verlag, New York.
- Dudoit S, Yang YH, Callow MJ,Speed TP (2002) Statistical methods for identifying genes with differential expression in replicated cDNA microarray experiments.StatisticaSinica 12: 111–139.
- Newton MA and Kendziorski CM (2003)Parametric empirical bayes methods formicorarrays, The analysis of gene expression data: methods and software, Springer, New York.
- Kerr MK, Martin M, Churchill GA (2000) Analysis of variance for gene expression microarray data. J ComputBiol 7: 819-837.
- Wolfinger RD, Gibson G, Wolfinger ED, Bennett L, Hamadeh H, et al. (2001) Assessing gene significance from cDNA microarray expression data via mixed models. J ComputBiol 8: 625-637.
- Kooperberg C, Sipione S, LeBlanc M, Strand AD, Cattaneo E, et al. (2002) Evaluating test statistics to select interesting genes in microarray experiments. Hum Mol Genet 11: 2223-2232.
- Tai YC, Speed TP (2009) On gene ranking using replicated microarray time course data. Biometrics 65: 40-51.
- Wang XV, Verhaak RG, Purdom E, Spellman PT, Speed TP (2011) Unifying gene expression measures from multiple platforms using factor analysis. PLoS One 6: e17691.
- Conlon EM (2008) A Bayesian mixture model for metaanalysis of microarray studies. FunctIntegr Genomics 8: 43-53.
- ScharpfR, Tjelmeland H, Parmigiani G, Nobel AB (2009) A Bayesian model for cross-study differential gene expression. J Am Stat Assoc 104: 1295-1310.
- Kvam VM, Liu P, Si Y (2012) A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. Am J Bot 99: 248-256.
- Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, et al. (2013) Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol 14: R95.
- Love M, Anders S, Huber W (2014) Differential analysis of count data – the DESeq2 package 1–41.

Select your language of interest to view the total content in your interested language

- Adomian Decomposition Method
- Algebra
- Algebraic Geometry
- Algorithm
- Analytical Geometry
- Applied Mathematics
- Artificial Intelligence Studies
- Axioms
- Balance Law
- Behaviometrics
- Big Data Analytics
- Big data
- Binary and Non-normal Continuous Data
- Binomial Regression
- Bioinformatics Modeling
- Biometrics
- Biostatistics methods
- Biostatistics: Current Trends
- Clinical Trail
- Cloud Computation
- Combinatorics
- Complex Analysis
- Computational Model
- Computational Sciences
- Computer Science
- Computer-aided design (CAD)
- Convection Diffusion Equations
- Cross-Covariance and Cross-Correlation
- Data Mining Current Research
- Deformations Theory
- Differential Equations
- Differential Transform Method
- Findings on Machine Learning
- Fourier Analysis
- Fuzzy Boundary Value
- Fuzzy Environments
- Fuzzy Quasi-Metric Space
- Genetic Linkage
- Geometry
- Hamilton Mechanics
- Harmonic Analysis
- Homological Algebra
- Homotopical Algebra
- Hypothesis Testing
- Integrated Analysis
- Integration
- Large-scale Survey Data
- Latin Squares
- Lie Algebra
- Lie Superalgebra
- Lie Theory
- Lie Triple Systems
- Loop Algebra
- Mathematical Modeling
- Matrix
- Microarray Studies
- Mixed Initial-boundary Value
- Molecular Modelling
- Multivariate-Normal Model
- Neural Network
- Noether's theorem
- Non rigid Image Registration
- Nonlinear Differential Equations
- Number Theory
- Numerical Solutions
- Operad Theory
- Physical Mathematics
- Quantum Group
- Quantum Mechanics
- Quantum electrodynamics
- Quasi-Group
- Quasilinear Hyperbolic Systems
- Regressions
- Relativity
- Representation theory
- Riemannian Geometry
- Robotics Research
- Robust Method
- Semi Analytical-Solution
- Sensitivity Analysis
- Smooth Complexities
- Soft Computing
- Soft biometrics
- Spatial Gaussian Markov Random Fields
- Statistical Methods
- Studies on Computational Biology
- Super Algebras
- Symmetric Spaces
- Systems Biology
- Theoretical Physics
- Theory of Mathematical Modeling
- Three Dimensional Steady State
- Topologies
- Topology
- mirror symmetry
- vector bundle

- 6th International Conference on
**Biostatistics**and**Bioinformatics**

November 13-14, 2017, Atlanta, USA

- Total views:
**11620** - [From(publication date):

April-2014 - Oct 17, 2017] - Breakdown by view type
- HTML page views :
**7855** - PDF downloads :
**3765**

Peer Reviewed Journals

International Conferences 2017-18