Department of Mathematics and Statistics, University of the West Indies, St. Augustine Campus, Trinidad and Tobago
Received Date: April 1, 2017; Accepted Date: April 24, 2017; Published Date: April 28, 2017
Citation: Dialsingh I, Cedeno SP (2017) Comparison of Methods for Estimating the Proportion of Null Hypotheses π0 in High Dimensional Data When the Test Statistics is Continuous. J Biom Biostat 8: 343. doi: 10.4172/2155-6180.1000343
Copyright: © 2017 Dialsingh I, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Biometrics & Biostatistics
Advances in Genomics have re-energized interest in multiple hypothesis testing procedures but have simultaneously created new methodological and computational challenges. In Genomics for instance, it is now commonplace for experiments to measure expression levels in thousands of genes creating large multiplicity problems when thousands of hypotheses are to be tested simultaneously. Within this context we seek to identify differentially expressed genes, that is, genes whose expression levels are associated with a particular response or covariate of interest. The False Discovery Rate (FDR) is the preferred measure since the Family Wise Error Rates (FWERs) are usually overly restrictive. In the FDR methods, estimation of the proportion of null hypotheses (π0) is an important parameter that needs to be estimated. In this paper, we compare the effectiveness of 12 methods for estimating π0 when the test statistics are continuous using simulated data with independent, weak dependence, and moderate dependence structures.
Family-wise error rate; False discovery rate; High dimensional data; Hypothesis testing; Multiple hypothesis testing; Proportion of null hypotheses
A hypothesis is a claim or assertion either about a population parameter. In the case of a single hypothesis, we typically test the null hypothesis H0 versus an alternative hypothesis Ha based on some test statistic. We reject H0 in favor of Ha whenever the test statistic lies in the rejection region specified by some rejection rule. The test statistic is a function of the sample data on which the decision to reject or not to reject H0 will be based. Within the realm of hypothesis testing, there are two possible types of errors that can be committed. A Type I error, is committed if we reject H0 when H0 is true while a Type II error occurs when we fail to reject H0 when H0 is false, Table 1 summarizes the error possibilities.
|Declared True||Declared False|
|H0 True||Correct Decision (1-α)||Type I Error (α)|
|H0 False||Type II Error (β)||Correct Decision (1-β)|
Table 1: Possible outcomes for a single hypothesis test.
The problem associated with single hypothesis testing is compounded in the realm of multiple hypothesis testing. Multiple hypothesis testing is concerned with controlling the rate of false positives when we are testing multiple hypotheses simultaneously. When conducting multiple hypothesis tests, the probability of making at least one Type I error is substantially higher than the nominal level used for each test, particularly when the number of total tests, m, is large.
It is not unusual to have thousands of tests being done simultaneously. This implies that the probability of getting at least one significant result approaches 1. These methods of multiple hypothesis testing require an adjustment of α in some way so that the probability of observing at least one significant event, due largely to chance, remains below the chosen level of statistical significance.
Tests are typically assumed to be independent, or that the test statistics are independent and identically distributed, but there are methods and procedures that deal with cases of dependence. We assume that we are testing m independent null hypotheses, H01, H02, … ,H0m with corresponding p-values p1, p2,.., pm, and we call the ith hypothesis “significant'' if we reject the null hypothesis H0i. Table 2 summarizes the possible configurations when testing m hypotheses simultaneously. In this table, V is the number of false rejections (or false discoveries), U is the number of true non-rejections (or true acceptances), S is the number of true rejections, and T is the number of false non-rejections. Here m0, the total number of true null hypotheses, is fixed but unknown. Though random variables V, S, U, and T are not observable, the random variables R=S+V and W=U+T, the number of significant and insignificant tests, respectively, are observable. The proportion of false rejections is V/R when R>0 and the proportion of false acceptances is T/W when W>0.
Table 2: Possible outcomes for m hypothesis tests.
Using this notation, the FWER rate which is the Probability of making at least one type I error among the m tests is given by:
FWER=Prob (V ≥ 1) or FWER=1 – Prob (V=0) (1)
While the False Discovery Rate (FDR) which is defined as the expected proportion of false rejections is given by:
Many methods have been created modifying the FWER concepts. A review of these techniques are found in [1,2]. The majority of these methods involves the use of step-up or step-down procedures [3-5]. Controlling FWER results in tests with low power. The FDR is less restrictive and allows a certain amount of false positives . A number of methods have been established for estimation of the proportion of null hypotheses when the test statistics are discrete . In this paper, we compare 12 methods for comparing false discovery rates when the test statistics are continuous under varying dependence structures.
Methods for controlling false discovery rate
Modern approaches in multiple hypothesis testing focus on False Discovery Rate (FDR) control , which is a simple step-up procedure using the ordered p-values of the tests. False Discovery Rate originated in two papers that dealt with multiple hypothesis testing. Schweder and Spjotvoll  first suggested that the ranked p-values be plotted. The assessment of the number of true null hypotheses m0 would then be done via an eye fitted straight line starting from the largest p-values. The p-values that deviate (outliers) from this straight line would then correspond to the false null hypotheses. The density of the p-values can be expressed as:
f(p)=π0 f0(p) + (1- π0) f1(p) (4)
Where π0 and π1 represent the proportion of p-values under the null density f0 and under the alternative density f1 respectively. For continuous tests, p-values are uniformly distributed on the interval (0,1). The distribution of the p-values under the alternative hypothesis is unknown. Methods for estimating π0 when the test statistics are continuous have been developed by coupling the mixture model with the assumption that either the density of marginal p-values, f(p), or the density of p-values under the alternative, f1(p) is non-increasing.
It has been shown that when the test statistics are continuous and independent, this procedure controls the FDR at level π0 q where π0 is the proportion of true null hypotheses. In this procedure, each of the m hypotheses are treated as null so in this case, π0=1. There are many other methods that were created. We briefly describe 12 of these methods for which we do our comparisons on.
Storey’s Smoother Method
The proposed estimator is:
Where λ ? [0,1) and is called a tuning parameter . Beyond 0.5 the histogram of the p-values is flat and this suggests that mostly null p-values are there. Taking λ=0 gives which is overly conservative and as λ→ 1 equation (6) above indicates that the variance of increases and this makes the estimated q-values unreliable.
The general algorithm for estimating q-values from the p-values is:
1. Assume that we have m ordered p-values according to their evidence against the null hypotheses, i.e. p(1) ≤ p(2) ≤…≤ p(m).
Use equation (6) and compute for various values of λ.
Let f be a natural cubic spline with 3 degrees of freedom of on λ.
Set the estimate of π0 to be
Let t be a threshold where 0 < t ≤ 1 calculate:
2. For i=m-1, m-2, …, 1 calculate:
3. Equation (8) above gives the estimated q-value for the ith most significant feature.
This method is referred to as “smoother” throughout this paper.
Storey’s bootstrap method
Storey’s bootstrap method  resamples the m ordered p-values p(1) ≤ p(2) ≤ … ≤ p(m) with replacement and creates pseudo-data sets for some number of bootstrap resamples B. The p-values are then calculated for each pseudo-data set and a bootstrap estimate is done for the FDR. The method enables a counter to record whether the minimum p-value from the pseudo-data set is less than or equal to the actual p-value for each base test. For m tests there would be m such counters. This process is repeated a large number of times, say B, and the proportion of resampled data sets where the minimum pseudo p-values is less than or equal to an actual p-value is the adjusted p-value. This method has the ability to incorporate all sources of correlation from both the multiple contrasts and the multivariate structure. The resulting adjusted p-values incorporate all correlations and distributional characteristics. The bootstrap method provides strong control when the joint distribution of the p-values for any subset of the null hypotheses is identical to that under the complete null hypothesis, that is, when the subset pivotality condition holds. It always provides weak control of the FWER. This method is referred to as “st.boot” throughout this paper.
Two-step procedure of Jiang and Doerge
This two-step procedure proposed by Jiang and Doerge  increase the power of detecting differentially expressed genes in microarray data. The null hypothesis of equality across the mean expression levels for all treatments is tested in step one. In the second step pairwise comparisons are done on genes for which the treatment means were shown to be statistically significant in step one. This approach estimates the overall FDR used in both steps in such a way that the overall FDR is controlled below a pre-specified FDR significance level. The twostep approach has increased power over a one-step procedure and it controls the FDR at the desired level of significance. The algorithm for this two-step procedure is as follows:
1. Test the null hypothesis that a gene is not differentially expressed across each treatment condition. This can be done using the global F-test from an ANOVA model for instance. For m tests corresponding to m genes an FDR control procedure is applied to control FDR at the α1 level. Tests that produce p-values ≤ some specified c1 are considered to be statistically significant. If we get L significant tests and M genes are declared to have statistically significant treatment effects then we go to step 2. However if L=0 then we stop and conclude that there are no statistically significant pairwise comparisons and therefore no differentially expressed genes.
2. For genes that form collection M we would perform N pairwise comparisons for each gene. We would apply an FDR control procedure at the α2 level to these L*N tests. Any comparisons found that produce p-values ≤ some specified c2 are considered to be statistically significant.
It should be noted that this procedure assumes that genes that show no significant treatment effect (step 1) will also show no statistically significant pairwise comparisons (step 2). However, a gene having a significant treatment effect may or may not have statistical significance in pairwise comparisons. This method is referred to as “jiang” throughout this paper.
Nettleton’s Histogram method
Nettleton’s Method  estimates π0 by estimating the proportion of the observed p-values that follow a uniform distribution. The steps in the algorithm are as follows:
• Partition the interval [0,1] in B bins of equal width.
Assume all null hypotheses are true, and set
Calculate the expected number of p-values for each bin given the current estimate of the number of true null hypotheses.
Beginning with the leftmost bin, sum the number of p-values in excess of the expected until a bin with no excess is reached.
Use the excess sum as an updated estimate of m1, and then use that to update the estimate of m0=m - m1.
Return to Step (iii) and repeat the procedure until convergence is reached.
The number of bins is used as a tuning parameter and the authors suggested using B=20. In their simulation study the histogram based estimator with 20 bins tended to be conservative for independent and autoregressive correlation structures in that it tended to overestimate the number of true null hypotheses for certain correlations that was considered in their simulation study. This method is referred to as “histo” throughout this paper.
Convex decreasing p-value estimate (Langaas)
Langass and Landqvist’s estimator is based on the non-parametric MLE of the p-value density . The authors restricted their attention to decreasing and convex decreasing densities because of their many mathematically attractive properties. The derived estimator of π0 assumed independence of the test statistics. These estimators were found to be robust with respect to the independence assumption but also worked well for test statistics that showed moderate levels of dependence. The aim is to derive a NPMLE (non-parametric maximum likelihood estimate) for a convex decreasing density on [0, 1). The mixture density given in (4) is modified by requiring f(p) be a convex decreasing function with f(1)=0. The algorithm is as follows:
Specify a convex decreasing initial function .
For j=0, 1, 2, … given the current iterate determine using:
If then the current iterate is optimal by and we are done.
If this optimality is not found then the next iterate is:
The procedure employed by this algorithm is similar to the “steepest descent” algorithm used to optimize a function on the Euclidean n-space . The next iterate in each step of the algorithm is the optimal convex combination of the current iterate and the mixing density that corresponds to the most negative directional derivative. This method is referred to as “langaas” throughout this paper.
Robust estimation (Pounds)
Pounds and Cheng’s estimator of π0  when the test statistics are continuous is given by:
is the average p-value for all m hypotheses,
and min(a,b) is the minimum of a and b. This estimator is biased upward but the bias is small when pf1(p) is small or when π0 is close to 1. This method is referred to as “pounds” throughout this paper.
Lowest slope method
This stepwise procedure was developed in ref.  first estimates the value of m0 using the data and this estimate is used in the adaptive procedure itemized in the algorithm that follows:
i. Order the p-values p(1) ≤ p(2) ≤ … ≤ p(m) corresponding to null hypotheses H(1), H(2), ….. , H(m).
For level q and i=1, 2, …., k we compare each p(i) with . If no p(i) is found to be smaller we do not reject any null hypotheses and stop.
Compute the slopes
Starting with i=1 proceed as long as Si ≥ Si-1. When we find Sj < Sj-1 stop.
Then starting with the largest p-value p(m), compare each p(i) with until we arrive at the first p-value satisfying . Wewould reject all k hypotheses whose p values are smaller than p(k).
This method is referred to as “abh” throughout this paper.
Sliding linear method
A sliding linear model (SLIM) to estimate π0 in datasets with dependence structures . Whenever undetected dependence structures exist they distort the distribution of the p-values making any estimate of π0 less effective. The SLIM method of FDR estimation is based on a linear model transformed from the non-linear λ estimator of Storey. The method functions by partitioning the data into local dependence blocks. This data partitioning and optimization allows SLIM to utilize information from a broader range of p-value distributions for π0 estimation. The employed optimization scheme exploits the non-static relationship between the p-values and the q-values by minimizing the difference between the fractions of tests called significant by the p-value and q-value methods. SLIM handles the hidden dependence in the data without the need to empirically adjust the null p-value distributions. This method is referred to as “SLIM” throughout this paper.
The Spacings LOESS histogram (SPLOSH) estimates the conditional false discovery rate (cFDR) . This cFDR is the expected proportion of false positives conditioned on k "significant" findings. SPLOSH is designed to be more stable than Storey’s approach in that the q-value is based on an unstable estimator of the positive false discovery rate (pFDR). SPLOSH is applicable in a wider variety of settings than BUM. The SPLOSH algorithm is as follows:
Let p(1) ≤ p(2) ≤ … ≤ p(k) be k ordered p-values. Let be their adjusted ranks to control the type I error rate.
Compute the midpoint of the interval using:
1. Apply the arc-sine transform for i=1,2,…,k to the ranked p-values using:
Apply LOWESS (LOcally WEighted Scatter-plot Smoother) to for j=1,…,u-1to obtain an estimated curve
For j=1,…,u to obtain the estimated derivative of the CDF up to a unitizing constant c.
Let be an estimate of the PDF at pi for i=1,2,…,k. The trapezoidal rule is used to estimate c using:
where is the sub-interval width..
Let be the CDF and for k=l + 1,..,u let
be an estimate of obtained using the trapezoidal rule in approximate integration.
Take as the minimum of the pdf.
For i=1, … , k we obtain by substituting
For p-values that are equal to zero we use L’Hôspital’s Rule to justify as an estimate of the cFDR:
Define a monotone quantity based on the cFDR estimates r(i) for i=1, … , k.
This method is referred to as “splosh” throughout this paper.
Mixture modeling of the p-value distribution was first proposed in ref. . This method models the p-value distribution as a mixture of a Uniform (0,1) distribution, corresponding to the true null hypotheses and a Beta(α,β) distribution corresponding to the false null hypotheses. The Beta-Uniform Mixture (BUM) Model was proposed by Pounds and Morris . This is basically a mixture of a Uniform(0,1) distribution and a Beta(α,1) distribution. The Beta distribution is chosen because of its flexibility in modeling any distribution on the interval [0,1].
This mixture model assumes independence of gene expression levels across genes. Thus, under the null hypothesis the p-values are uniformly distributed on the interval [0,1] regardless of the statistical test being used, as long as the test is valid, and regardless of the size of the sample. Under the alternative hypothesis the distribution of the p-values tends to cluster closer to zero than one. By referring to the entire distribution of the p-values obtained in the sample we can then address whether there is statistically significant evidence that any of the genes under study exhibits a difference in expression across the groups. This is usually done by performing an omnibus test of whether the observed distribution of the p-values is significantly different from a uniform distribution. This method is referred to as “BUM” throughout this paper.
Grenander density estimator
The Grenander density estimator  is a non-parametric maximum likelihood estimator (NPMLE) similar to an empirical cumulative density function (ECDF). Unlike the ECDF it has the added constraint of an underlying density. The Grenander estimator uses the ECDF as an estimator of the associated distribution function F(p). This estimator is the decreasing piecewise-constant function equal to the slopes of the least concave majorant (LCM) of the ECDF. Monotone regression with weights is used to compute the LCM of the ECDF in order to obtain this estimator. If xi and γi are used to denote coordinates for the ECDF and Δxi=xi+1 - xi and Δγi=γi+1 - γi, then the slopes of the LCM are given by antitonic regression and the raw slopes Δxi and Δγi with weight Δxi . This method is referred to as “strimmer” throughout this paper.
Two stage adaptive procedure
Adaptive procedures first estimate the number of null hypotheses m0 and then use this estimate to revise a multiple test procedure. Knowledge of m0 can be used to improve upon the performance of the FDR controlling procedure. The two stage adaptive procedure makes use of a linear step-up procedure making use of m ordered p-values p(1) . If this k exists then we reject all the k hypotheses associated with p(1) ≤ p(2) ≤ … ≤ p(k), otherwise do not reject any of the hypotheses. If m0 were known, then the linear set-up procedure with:
would control the FDR at precisely the desired level q in the independent and continuous case, and would then be more powerful in rejecting hypotheses for which the alternative holds. The adaptive Benjamini- Hochberg procedure  at the level q is as follows:
If reject all hypotheses; otherwise, test the hypotheses using the linear set-up procedure at level
1. Use the linear set-up procedure at level q, and if no hypothesis is rejected stop; otherwise, proceed.
Estimate m0(k) using
Starting with k=2 stop when for the first time m0(k)>m0(k-1).
Estimate rounding up to the next highest integer.
Use the linear step-up procedure with
This method is referred to as “TSBKY” throughout this paper.
The authors conducted a simulation study to compare the twelve (12) methods for estimating the proportion of null hypotheses π0 under varying levels of dependence. Simulations were drawn from a MVN (μ, Σ) distribution where μ is the mean vector and Σ is the covariance matrix using the R library MASS. The study involved simulating the expression values of 1000 “genes” from 20 samples. The first 10 samples for each “simulated expression value” came from the case subjects and the last 10 samples came from the control subjects. The simulations were done for π0 values of 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99, and 1.00. The data was simulated for three cases of dependency within the genes:
i. Independent Genes - No dependence structure within or between the genes. In this case the leading diagonal of Σ contained all 1’s and the off-diagonal entries, denoted as ρ, were all zero. A 1000 x 12 matrix of p-values was generated for each value of π0. Each column of the matrix would represent 1000 p-values extracted from each of the “m” matrices. This procedure was replicated 1000 times. The p-value matrices were then used as the input test statistics to compute the row means and standard deviations for each π0 estimation method being compared. This data was then bound into one matrix for each estimation method for plotting.
ii. Weak Dependence - In this case the leading diagonal of Σ contained all 1’s and the off-diagonal entries, denoted as ρ, were all 0.1. The simulation was done in the same manner as for independence.
iii. Moderate Dependence - In this case the leading diagonal of Σ contained all 1’s and the off-diagonal entries, denoted as ρ, were all 0.5.
All of the simulations were done using R version 3.2.3. Source codes for most of the estimation methods were available freely online.
In Figure 1 the solid black line that runs from bottom left to top right represents “the truth”, meaning that if the true value of π0 is 0.10 then the estimated value should correspond to 0.10. Departures either above or below this 45° line would allow one to make graphical inferences about the performance of the proposed estimator. This graphical approach has the advantage of allowing for easy visualization of the data and identification of possible trends and patterns that it contains that might not be readily obvious with tabular approaches. All of the methods compared tended to overestimate for low values of π0. The estimators all performed better as π0 increased. This is well documented in literature. “SLIM” performed very well for small values of π0 but tended to underestimate as π0 increased. It was found that “smoother”, “st.boot”, “Langaas” and “histo” all performed very well for most values of π0. “TSBKY” and “abh” overestimated for every value of π0. For example, Table 3 shows that when π0 was 0.30 “TSBKY” gave an estimate of = 0.805 with an estimated standard deviation of 0.022 while “abh” gave an estimate of = 0.607 with an estimated standard deviation of 0.047.
Table 3: Mean (a) and Standard Deviation (b) for compared estimation methods using independent test statistics.
Figure 2 shows the comparison of all twelve methods for weakly dependent test statistics. Most of the estimators overestimated for small values of π0, however, “TSBKY” and “abh” significantly overestimated for most values of π0. For example, Table 4 shows that when π0 was 0.30 “TSBKY” gave an estimate of = 0.814 with an estimated standard deviation of 0.125 while “abh” gave an estimate of = 0.619 with an estimated standard deviation of 0.116. For weak dependent test statistics “BUM” initially overestimated but performed better for π0 ≥ 0.60. With weak dependence in the test statistics “smoother”, “st.boot”, “Langaas” and “histo” all performed very well for most values of π0.
Table 4: Mean (a) and Standard Deviation (b) for compared estimation methods using weak dependent test statistics.
Figure 3 shows the comparison of all twelve methods for moderately dependent test statistics. Most of the estimators overestimated for small values of π0, however, “TSBKY” and “abh” significantly overestimated for most values of π0. For example, Table 5 shows that when π0 was 0.30 “TSBKY” gave an estimate of = 0.837 with an estimated standard deviation of 0.228 and “abh” gave an estimate of = 0.668 with an estimated standard deviation of 0.249. In this case “smoother”, “st.boot”, and “jiang” performed better than most of the other estimators even though they all had a high standard deviation. A small standard deviation is highly desirable because this gives us an indication that the data points are packed very tightly around the mean value that we are estimating, so there is little spread in the data. The table shows that all of the methods appear to perform very badly for π0=0.40, 0.60, 0.90, 0.99, and 1.00. This might be due to the methods ignoring the correlation effect among the test statistics.
Table 5: Mean (a) and Standard Deviation (b) for compared estimation methods using moderately dependent test statistics.
A simulation study to compare 12 methods for estimating the proportion of null hypotheses under different configurations of dependence. Dependence among genes exists. This paper shows which of the methods would be most appropriate under various dependence configurations. The estimation of the proportion of null hypotheses is important especially when it comes to estimation of the false discovery rate. There is no doubt that this field will continue to evolve as datasets becomes larger, the demands on statistics becomes greater, and computing hardware and software improves.