Methods for Identifying Differentially Expressed Genes: An Empirical Comparison

Microarray technology, which observes thousands of gene expressions at once, is one of the popular topics in recent decades. When it comes to the analysis of microarray data to identify differentially expressed (DE) genes, many methods have been proposed and modified for improvement. However, the most popular methods such as Significance Analysis of Microarrays (SAM), samroc, fold change, and rank product are far from perfect. In order to determine which method is most powerful, it comes down to the characteristics of the sample and distribution of the gene expressions. The most practiced method is usually SAM or samroc but when the data tends to be skewed, the power of these methods decreases. With the concept that the median becomes a better measure of central tendency than the mean when the data is skewed, the test statistics of the SAM and fold change methods are modified in this paper. This study shows that the median modified fold change method improves the power for many cases when identifying DE genes if the data follows a lognormal distribution.


Introduction
Microarray technology has allowed researchers to observe thousands of gene expressions all at once. Gene expression in cells is of relevance because it allows a way to pinpoint disease markers that are related to medical treatments [1]. A job that many researchers may want to perform would be to identify which genes in a cell are differentially expressed. For example, a researcher may need to conduct an experiment to discover differentially expressed genes between two experimental conditions. For explanation purposes this could be between healthy patients and patients who have a condition of interest such as cancer. Microarray analysis will allow the researcher to find which genes are expressed differently between these two groups of patients. The researchers will then be able to develop a treatment that targets these specific genes and create a more effective type of therapy. Further information on microarray technology can be found in Majtan et al. [2].
Over the years many methods have been studied to perform the analysis of microarray data. These methods can be categorized into two types, parametric methods and nonparametric methods. Examples of parametric methods are the t-test, Bayes t-test [3], an analysis of variance approach, and the B-statistic method. Nonparametric methods, on the other hand, have become very attractive in this field of research because of the previous costs of microarray experiments and the availability of replicated data has made it difficult to obtain large samples. Nonparametric methods include Significance Analysis of Microarrays (SAM) proposed by Tusher et al. [4], samroc, which uses a very similar test statistic to SAM's in addition to the use of a receiver operating characteristic (ROC) curve [5], the mixture model method (MMM) [6], nonparametric empirical Bayes method [4] and the Zhao-Pan method.
A variety of comparisons between methods have been performed in the past to find which method is most reliable in discovering true differentially expressed genes. The main purpose in these comparisons is to find the method that correctly identifies the highest proportion of the true differentially expressed (DE) genes as DE while maintaining a small proportion of equivalently expressed (EE) genes being falsely identified as DE.
One of the most widely used methods for microarray analysis is the previously mentioned SAM. However, SAM is not a completely robust method and some shortcomings arise. Many researchers have attempted to modify the method in order to make it more reliable. When the number of significant genes is fairly large in a data set, the estimated number of significant genes by SAM is affected and the test is less powerful. As a solution, Pan et al. [6] suggested the use of MMM to estimate the distribution of the null and test statistic. The MMM allows for identifications of a rejection region for any type 1 error rate. In another attempt to fix this bias, Van de Wiel proposes a method using rank scores within SAM. Just by replacing the data with rank scores, the tendency of SAM to produce a biased estimate of DE genes is eliminated. The results are only valid though when the number of samples, N, is not "too small". On the basis of the test statistic used in SAM, Broberg's [5] created the samroc method. Broberg found that when the number of DE genes is large, then the samroc method is likely to work better than SAM. However, in most of the tests performed, the two methods worked just as well as each other when samroc did not outperform SAM.
Breitling et al. [7] adopted another approach to identify differentially expressed genes called rank product in an attempt to exceed SAM. The results showed that, while being a simpler method than SAM, rank product outperformed SAM in identifying DE genes, even with very small data sets. It is also seen that the rank product method performed very similarly to fold change. Fold change (FC) is a popular method often used because of its simplicity and easy understanding.
was chosen to be 5000 based on the work of Schwender et al. research.

SAM
The test statistic in SAM is very similar to the test statistic from the simple t-test. The difference lies on the introduction of a small constant, s 0 , in the denominator. The test statistic for SAM is as follows: where X i is the expression of the i th gene under experimental condition 1 and Y i is the expression of the i th gene under experimental condition 2 (i = 1,…,n). Further, X and Y are the mean expression levels under conditions 1 and 2 respectively for gene i.
The "gene-specific scatter" or standard deviation s(i) is defined: where J is the number of replicates in experimental condition 1 and K is the number of replicates in experimental condition 2.
The constant, s 0 , is added in order to correct the issue that the traditional t-test faces. The problem with the t-test occurs when genes have low expression levels and yield a small sample variance. The combination of those two factors lead to producing a large test statistic making it very likely that the gene will be identified as DE. The value of s 0 represents a percentile of the standard deviation values of all the genes. The method to compute this value can be found on Page 30 of the SAM user guide [1].
In order to find which genes are DE, SAM calls an algorithm to create the null scores by pooling the data together across the two treatments per gene B times, where B is the total number of permutations. For each permutation, SAM finds the null statistic by using the same formula as the original test statistic, resulting in a total of B null statistics for each gene. The mean of the null statistic is then found for each gene and plotted against the ordered test statistic. The absolute differences between the two values are then found and compared against a cutoff value to determine whether or not there is a significant difference [4]. The cutoff value can be obtained by following the method explained on Page 29 of the SAM user guide [1].

Samroc
Broberg's [5] approach to identifying lists of significant genes while minimizing the rate of false positives and false negatives consists of ranking genes in order of likelihood of being differentially expressed. The test statistic is similar to that of SAM, however the constant s 0 , is chosen in a different manner [9].

Fold change
According to McCarthy and Smyth [11], the earliest publications in analyzing microarray data to identify differentially expressed genes used the fold change rule. The fold change rule is defined as follows [9]: where X and Y are the mean expression levels under conditions 1 and 2 respectively for gene i. The typical accepted cutoff value for the fold change rule is FC i > 2 [11]. McCarthy and Smyth also mention that a disadvantage of the fold change rule is that it does not take variability Comparisons across methods are interesting because each method usually results in outcomes without much agreement. In Jeffery et al. [8] it is found that only 8 to 21% of the genes are commonly identified between the ten different methods being compared including SAM, samroc, fold change, and rank product. The study shows that many factors such as number of genes and number of samples influences which method will obtain the best result. It is concluded that rank product works well under settings with low number of samples and the ROC curve performed well under data sets with large sample sizes. The conclusion by Kim et al. [9] is similar to that of Jeffery et al. [8], noting that the sample size, distribution, and equal variance assumptions of each test greatly impact which test performs better. Our study shows that SAM outperformed samroc when the data follows a lognormal distribution.
Despite the advancement of next generation sequencing (NGS) as an alternative to microarrays, research in analysis of microarrays is still very relevant. Researchers in labs are more comfortable and confident with using microarrays as the technology has been around for a long time and it is less complicated than NGS [10]. Figuring out the most efficient method to identify differentially expressed genes under particular data settings can help master the data analysis step in microarray research.
The focus of the present study is a comparison of the top performing and popular methods SAM, samroc, rank product, and fold change along with modified versions of the SAM method and the fold change rule. As it is evident in Kim et al. [9] and Jeffery et al. [8], sample size and distributional assumption of the data largely impacts the decision of which is the superior method to choose when identifying differentially expressed genes. The aim of this paper was found after evaluating previous research and understanding the biggest drawbacks in this area. Several settings of lognormal cases with various sample sizes will be tested under each of the methods. For the first time, a modification that uses median in place of the mean in the test statistics of SAM and the fold change rule will be made in this paper. The modifications follow from the concept that the median is a better measure of central tendency than the mean when describing skewed data. The expectation is that using the median will better represent the average gene expressions when the microarray data follows a skewed distribution. The modification to fold change will be shown to improve results in identifying differentially expressed genes under skewed data settings. A table of cutoff values for fold change and its modified version is also included in the present study.
The organization of this paper is as follows. In section 2, the statistical techniques are given. A simulation study under the different settings of sample size and skewness is performed on each of the methods in section 3, section 4 will include the application and analysis of a real data set. Finally, conclusions will be made along with a statement of some concerns and future possible research in section 5.

Statistical Methods
This section is a review of several favored statistical methods for identifying differentially expressed genes in microarray datasets. The performance of the methods on data that follow a lognormal distribution are of interest. Let the i th gene expression level of the j th sample under condition 1 be represented by X ij and the i th gene expression level of the k th sample under condition 2 be represented by Y ik , where j = 1,…,J, k = 1,…,K, which represents replicates under condition 1 and 2 respectively. The gene number is represented by i, where i = 1,…,n. For this study n = 5000 genes. The number of genes, n, to follow a lognormal distribution. The simulations of several combinations of sample sizes have been done while also using three different levels of skewness, slight, moderate, and high.

Simulation techniques
The simulation is performed by generating 5000 genes where 500 of them are knowingly differentially expressed. A matrix, W, is generated of size (5000 x (J + K)), J is the number of samples from condition 1 and K represents the number of samples from condition 2. As stated earlier, each data point in the matrix represents a gene expression, X ij and Y ik . The i th gene expression level under condition 1 is represented by X ij and the i th gene expression level under condition 2 is represented by Y ik .
The comparison between SAM, samroc, fold change, rank product, and the proposed modifications using median are performed under cases of randomly generated data from the lognormal distribution. Different levels of skewness are considered: slightly, moderately, and highly skewed. The levels of skewness will be implemented by setting σ = 1,1.2,1.5 respectively. The data follows the model: The choice of the sample sizes under condition 1 and 2, values of J and K, were chosen in order to cover a variety of situations that an experimenter may face when using real data and to be consistent with previous studies on microarray data. Sample sizes of (4,4) and (10,26) were chosen as in Kim et al. and Zhang's study where the latter is also the sample size of the Leukemia data from Baldi et al. [3]. The sample size (8,8) was also chosen since it is of same size as the apolipoprotein AI (Apo AI) dataset from Callow et al. [15]. For a thorough analysis covering more possibilities, sample sizes on a scale of 5 from 10 to 25 were also chosen for J and K. All of the sample sizes can be seen in Table 2. For the purpose of this study, the process of simulating a data set and running the methods under each setting was 500 times, while the previously mentioned studies of Zhang and Schwender et al. used 100 simulations for such comparisons.

Results and discussion
An advantage of simulating gene expression data is that the exact genes that are differentially expressed are known. After each method is performed on the simulated data sets, the total number of genes that were correctly identified as DE, true positives (TP), and the total number of genes that were incorrectly identified as DE, false positives (FP), were recorded. With the number of TP and FP known, then the type 1 error rate and the power were calculated to perform the comparison of methods. The null hypothesis for microarray analysis is that the i th gene under condition 1 is the same as under condition 2 i.e., it is not DE, versus the alternative where the i th gene under condition 1 is significantly different from the i th gene under condition 2 i.e., the i th gene is DE. The hypotheses are important to note in order to find the type 1 error rate, the probability of rejecting the null hypothesis given that it is in fact true, and the power, the probability of correctly rejecting a false null hypothesis. In terms of the microarray analysis done here the type 1 error rate reduces to the number of genes incorrectly identified as differentially expressed, FP, divided by the total number of equivalently expressed genes, 4500, and power reduces to the number into consideration. Since it does not account for variability, it makes it difficult to make sense of a set cutoff value. The shortfalls of the fold change rule led to the development of more sophisticated tests such as SAM, however they also have their flaws and do not have the intuitive appeal which the fold change rule has [7].

Rank product
The rank product method was created with overcoming the problems of fold change in mind, while being statistically rigorous and simple at the same time [7]. After the rank product method gained popularity as a method to detect differentially expressed genes in microarray data, Koziol [12] extended the process to a two sample setting. Koziol defines the test statistic as follows: where J is the number of replicates in experimental condition 1, K is the number of replicates in experimental condition 2, and the rank is taken among the expressions in a single sample, across the n genes, for each sample. R ij represents those ranks assigned to the i th gene under condition 1 and R ik will be those ranks assigned to the i th gene under condition 2. Further, the monotone log transformation is taken on the test statistic to obtain a better approximation of the null distribution and the resulting statistic is: According to Koziol [12], "the exact distribution of log(RP i ) can be tedious" so a normal approximation of the distribution should be adequate, especially for large samples. If there is skewness in the data, then this approximation may not be adequate.

Median fold change
It has been shown that microarray data is consistent and well approximated by the lognormal distribution [13]. The lognormal distribution is known to be a skewed distribution and the best measure of central tendency for this type of distribution is the median [13,14].
With the prevailing use among biologists as seen in [13] because of its attractive nature and simplicity, we are proposing the following modification to the fold change rule: Instead of using the average expression levels of the i th gene under condition 1 and 2, i X and i Y , when calculating the fold change, the median expression levels for the i th gene, i

Simulation Study
Since a theoretical comparison among the test statistics is not possible, a simulation study has been conducted to compare the performance of the test statistics in this chapter. In this section, the performance of SAM, samroc, fold change, rank product and the proposed modifications of fold change using median are compared by applying the methods to simulated gene expression data sets. The methods are compared under the case where the data is simulated  The simulations carried out under the lognormal distribution revealed settings where the SAM method turns out to be the weakest of the methods. SAM worked rather poorly for all sample size combinations where at least one of the conditions had sample size less than 15. For settings where both conditions had 15 or more samples, SAM worked decently with a power most of the time above 0.70 except for few situations where the data was moderately skewed and in all cases that were highly skewed. In highly skewed settings, SAM was rather poor. The samroc method followed similar trends as SAM, however, samroc was more robust in respect to sample size. The performance of samroc was much better than SAM under settings where both conditions had sample sizes of 10 or higher. The values of power and type 1 error rate for each setting under a lognormal distribution are given in Table 1. Levels of skewness are indicated by L=slightly skewed, M=moderately skewed, and H=highly skewed ( Table 1).
As Table 1 shows, the fold change method and the modified fold change method using median were consistently the top two methods across all sample sizes and all skewness settings for the lognormal data. The modified version of fold change with median worked better than the original fold change for all of the simulated sample sizes, obtaining higher levels of power while maintaining a type 1 error rate of 0.05 or smaller. It can also be seen in Table 1 that as the level of skewness rises, the modified version of the fold change method with median further improves over the original fold change. For each sample size simulated, as skewness increases, the difference in power between the original fold change and median fold change increases, with the latter having the higher power. This relationship is illustrated in Figure 1. The improvement in the fold change method was anticipated because the modified version replaced the mean with the median and for the lognormal data, which is a skewed distribution, the median is a more accurate measurement of the central tendency as Manikandan [14] stated.
Even though the median fold change method constantly had the better power as the sample size increased, it is evident that when there are at least 15 samples of each condition and the skewness is not too heavy, all the methods work very similarly, producing about the same power and type 1 error rate. The similar performance between methods toward the higher number of sample sizes leaves the decision of which method to use for analysis of microarray data to the researcher depending on which assumptions best match the data and the method of choice. SAM, samroc, and fold change all have the assumption that the genes share equal variance while rank product assumption is more relaxed allowing the variance to be about equal. The cutoff values for fold change and median fold change were chosen in order to obtain a type 1 error rate of no more than 0.05 and are given in Table 2.

Application
To illustrate the findings of this paper, a real data set, Apo AI data from Callow et al. [15] are analyzed in this section. The Apo AI dataset consists of 5548 genes and 16 samples. Out of the 16 samples, 8 were from control mice and the other 8 samples were from mice with the Apo AI gene knocked out. The 8 mice that had the Apo AI gene knocked out will have a very low high-density lipoprotein cholesterol level and the delivery of the cholesterol to the liver will be affected [15]. The data were preprocessed in the similar way as the leukemia data (4.1), as was done by Kim et al. The difficulty when attempting to analyze this dataset is that there has not been reference genes adopted as biologically significant from previous studies as there was with the leukemia data. Table 3 shows the number of genes that were commonly found between each pair of the five methods. The idea expressed in Jeffery et al. [8] that only a very low percentage of genes will be found significant between multiple methods is supported by the results. The conflicting result between methods is one of the drawbacks of microarray analysis. There is a large inconsistency between the different methods to identify which genes are identified as significantly different between two groups [16] (Table 3).

Conclusion
A comparison of the performance of popular testing procedures for identifying differentially expressed genes from microarray data   such as SAM, samroc, fold change and rank product was conducted. On the basis of the assumption that microarray data are related to the lognormal distribution from Hoyle et al. [13] and the familiar idea that the median is a better measurement of central tendency than the mean when describing skewed data as expressed in Manikandan [14], modifications were attempted on fold change, replacing the mean gene expression values with the median. It has been observed from Simulation results, fold change and the modified median fold change were consistently the top performing methods for lognormal data.
An analysis on a real microarray datasets was also performed to evaluate how the methods and the proposed modification would perform in a real situation. While the analysis on the Apo AI dataset showed that the median fold change method was an improvement to the original fold change, it also gave a nice visualization of how the different methods are inconsistent with each other when identifying differentially expressed genes. Hope findings of the paper will be useful for the practitioners in the area of health sciences and related area.