Received date: November 02, 2012; Accepted date: November 05, 2012; Published date: November 12, 2012
Citation:Dai H, Charnigo R, Srivastava T, Talebizadeh Z, Ye SQ (2012) Integrating P-values for Genetic and Genomic Data Analysis. J Biom Biostat 3:e117. doi:10.4172/2155-6180.1000e117
Copyright: ©2012 Dai H, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Biometrics & Biostatistics
Rapid developments in molecular technology have led to evolution in Biostatistics and Bioinformatics, to identify genetic variations associated with complex traits. A large amount of information becomes accessible to investigators through Genome Wide Association Studies (GWAS), gene expression arrays, whole genome sequencing and other technologies.
The increase of variants requires more statistical testing to be conducted in analyses, which poses a “curse of dimensionality” to multiple testing correction methods. For instance, false discovery rate (FDR) and its extended methods are commonly used to adjust multiple individual tests, in order to control the family wise Type I error [1,2]. Unfortunately, in large-scale hypothesis testing, these methods tend to yield low power to detect risk factors.
Global testing (also named omnibus testing) of p-values from numerous individual tests may combine evidence, and turn dimensionality from a curse into rich information. From a systems biology perspective, genes, cells, tissues and organs function as a system through metabolic networks and cell signal networks. In non- Mendelian inheritance such as complex disorders, a subset of variants may jointly confer moderate effects in mediating molecular activities. As a result, signals may not be significant in single marker-single trait analysis, but many such values from related genes might provide valuable information on gene function and regulation.
The global test is designed to evaluate the pattern (distribution) of p-values, instead of choosing p-values less than an arbitrary threshold. Therefore, this method has the potential to identify multiple genes with small effects. Assuming that all individual tests are independent and arise from genes with no effects, p-values are identically and independently distributed as Uniform(0,1). Taking this as a null hypothesis for the pattern of p-values in the global test, one can assess whether p-values, especially small p-values, are generated by chance. The global test of p-values is robust and can be applied to p-values from a t-test, an ANOVA, a linear mixed model, and so forth. Multiple simulation studies and case studies have demonstrated that the approach usually has sufficient power to detect signals of genetic association from a group of genes.
Combination of p-values into a sum or product has long been used by evolutionary biologists in meta-analysis . Many methods can be expressed in the form of , where p-values might first be transformed by a function H. Early researchers had been exploring a raw sum of p-values and sums with various transformations, including log transformation, inverse normal transformation, inverse gamma transformation, logit transformation, and count of p-values less than a threshold, etc. Some classic methods include Fisher’s method , Z-test , and Lancaster’s procedures . Extensive Monte Carlo comparisons have been conducted for independent , and correlated  p-values. The classic methods yield simple limiting distributions when p-values follow the identical and independent uniform distribution, under the global null hypothesis. One can also combine p-values using the product method [9,10]. By taking log-transformation on the product of p-values, the product method becomes a special case of sum of log-transformed p-values .
Order-based approaches are another category of global testing for p-values. Tippett’s procedure is to assess the minimal p-value. Simulation studies show that this approach has well controlled Type I error for both independent and correlated data, but will reduce power to identify multiple genes with small effects . Wilkinson extended Tippett’s procedure to the k smallest p-values. By expanding (α + (1 −α ))m , where m is the total number of individual tests, tables of the incomplete beta function can be used to obtain the probability of tests with p-values less than α . Furthermore, empirical distributions of p-values can be calculated and compared to the uniform distribution. These tests include the positive-side Kolmogorov-Smirnov test, the positive-side Cramer-von Mises test, the newly developed order-based approach that accounts for ordering of p-values under the alternative hypothesis , and the higher criticism method to detect sparse signals .
Recent developments have focused on introducing weight functions and truncation to increase power, as well as on developing global tests for genetic analysis. For instance, a rank truncated method that combines the first k ordered p-values and a truncated product method, that combines p-values that are smaller than a specified threshold, have recently been developed and applied in large scale genomics experiments . Later, an adaptive rank truncated product method was proposed and applied in GWAS . By Yu et al. , permutation testing was used to determine the optimal number of k smallest p-values for a product test. In Hess and Iyer , Fisher’s method was extended to Affymetrix gene expression arrays and shown to be a suitable diagnostic tool for exploratory analysis of microarray data. The combined p-value method was shown to be favorable versus competing methods through validated microarray data analyses.
Efforts have also been made to cope with complex correlations among p-values. In expression quantitative trait loci (eQTL) analysis to identify genotype and phenotype associations , researchers have observed strong correlations among multiple tests due to linkage disequilibrium and functional interactions among single nucleotide polymorphisms (SNPs). To address this issue, Fisher’s method was modified to incorporate correlations among p-values, and then a Satterwhite’s approximation was used to derive the limiting distribution of the test statistic, under the global null hypothesis. Similarly, the weighted Z-test has been modified to include correlations and has been applied in shared controls designs in GWAS .
Modeling p-values using analytic distributions also starts to show promise. A beta mixture model has been proposed to model p-values that might come from a combination of null and alterative hypotheses for individual genes. Then, a modified likelihood ratio test and a D-test are proposed to test homogeneity in the mixture model . In Dudbridge and Koeleman , extreme-value distributions for fixed numbers of combined evidence and a beta distribution for the most significant evidence are shown to be accurate and efficient for large exploratory studies. Analytic modeling may provide a deeper level of insight into properties of p-values. For instance, a mixture model of p-values may not only suggest the existence of overall signals, but also measure the proportion of variants associated with a phenotype, as well as the strength of association effects.
Below we describe two major trends in application of the combined evidence approach to complex genetic data analysis.
Global tests can filter out genes with no association and direct researchers to a smaller part of the genome . Filtration is a critical process in current genetic data analyses to remove noises, irrelevant variants and weak signals. Removing genes using arbitrary cutoff values (such as fold change>1. 5 or p-value<0. 05) might increase bias in gene selection. We advocate incorporating global tests into a gene filtration process. Essentially, one can group genes into gene sets based on biological information, pathway or functional network etc. Global tests of p-values will then be performed in the various gene sets to detect whether overall signals exist. Gene sets with no overall signals will be removed, which will greatly reduce the dimensionality.
A global test of p-values can also be used to select the optimal number of genes for a final analysis. For instance, if an auxiliary measure can be used to rank the genetic variants and this auxiliary measure is independent of the global test, then the global test can be used to find a cutoff for the auxiliary measure and select the optimal number of genetic variants for the final analysis. In MDR analysis (the method for gene-gene interaction), several filtration algorithms (such as SURF , TuRF , and Relief F ) have been developed to rank SNPs based on efficiency and redundancy. Then, global tests can be used to determine the optimal cutoff points for these measures and select the optimal number of genes. The global test and ReliefF combined filtration approach has been applied to a candidate gene study of drug response in Juvenile Idiopathic Arthritis, and has identified gene-gene interaction in the folate pathway .
Pathway analysis is a field of study to detect a wide range of molecular entities which regulate specific cell functions, metabolic processes and biosynthesis. In Traditional Pathway Analysis (TPA), adjusted cutoffs of fold changes/p-values are being used to select significant individual genes (step 1). Next, it will be tested whether significant individual genes are over represented in pathways (step 2). However, the bias and random error in individual gene selections may severely impact subsequent steps of TPA. We suggest incorporating global testing into pathway analysis and reversing the aforementioned two steps by first detecting significant pathways, and then detecting significant genes in the significant pathways, as illustrated in Figure 1. By switching to this omnibus testing based pathway analysis (OPA), the number of multiple tests is dramatically reduced from ~105 to ~102.
Fisher’s method was shown to be asymptotically Bahadur optimal and efficient, assuming p-values are independent. However, there is no uniformly most powerful method of combining p-values. Moreover, accounting for correlations among p-values represents a major challenge to applying global methods that were originally designed based on independence assumptions. Using methods that are designed for correlated data will effectively prevent inflation of Type I error due to complex correlation structures. More ground-breaking theoretical works are needed to develop global tests of p-values that account for such correlation structures.