Keywords 
Gene set enrichment analysis; The receiver operating
characteristic (roc) curve; The area under the roc curve (Auc);
Discrimination ability; Crossvalidation; Linear combination
coefficients; Random forests classification method 
Introduction 
Biological phenomena often occur through the interactions of
multiple genes via signalling pathways, networks, or other functional
relationships. Based on that information, a set of genes with related
functions is grouped together and referred to as a "gene set"; among
them, if the expression level of such a gene set is significantly associates
with the clinical outcomes/phenotypes, then we say that this gene set
is "differentially expressed". Thousands of genes that share common
functional annotation are organized into groups (possibly overlapping),
and the information for grouping genes as gene sets can be obtained
through publicly available annotation databases such as the Kyoto
Encyclopaedia of Genes and Genomes (KEGG), Gene Ontology (GO),
and GenMAPP. 
Many statistical approaches, such as gene set analysis (GSA)
methods, are used to determine whether such functionally related gene
sets express differentially (enrichment and/or deletion) in variations
of phenotypes, and a common approach of GSA methods is first to
identify a list of genes that express differently among two groups of
samples using a statistical test, and this list of differentially expressed
genes is then examined with biologically predefined gene sets to
determine whether any set in the list is overrepresented compared
with the whole list of gene sets [13]. These types of approaches do not
consider the correlation structure and the order of genes in a gene set.
Therefore, Mootha et al. [4] and Subramanian et al. [5], in order to
improve on GSA, proposed the Gene Set Enrichment Analysis (GSEA),
in which they consider the distributions of entire genes in a gene set,
rather than a subset from the list of differential expression genes, and
use some statistic to assess the significance of predefined gene sets. 
Following the idea of GSEA, many statistical methods have
been proposed, such as the global test [6], the twosample ttest like
approach [7], the ANCOVA test [8], the Hotelling's T2 test [9], the
MaxMean approach [10], the SAMGS test [11], the global statistics
approach [12], the Randomsets method [13], the Logistic Regression
(LRpath) approach [14], and the MANOVA test [15] amongst others.
These approaches rely on different statistical assumptions and data
structures, which then usually lead to different results even when they
are applied to the same data set. For the underlying assumptions and
a comprehensive review of these methodologies can be found in, for
example, Goeman and Buhlmann [16] and Nam and Kim [17]. From their papers, we note that none of the methods have addressed the
feasibility of discriminating different phenotypes via a priori defined
gene sets. 
On the other hand, there various machine learning type algorithms
have been developed from a classification prospective. For example, Lin
et al. [18] demonstrated that the classification accuracy and robustness
of classification in analyzing microarray data can be improved by
considering the existing biological annotations, Pang et al. [19,20]
used the random forests classification and regression, Wei and Li [21]
applied a boostingbased method for nonparametric pathwaysbased
regression (NPR) analysis, and Tai and Pan [22,23] proposed a group
penalization method that incorporates biological information to build
a penalized classifier to improve prediction accuracy. In particular,
Lottaz and Spang [24] provided a biologically focused classifier, say
StAM, based on the GO hierarchical structure, and only the genes
annotated in the leaf nodes of the GO tree can be used as predictors,
and other genes (relevant, but not annotated yet), cannot be used in
their method. Since the biological information of gene sets may come
from different databases, such as KEGG or BioCarta, and not limited
to the GO annotation only, therefore, when the GO terms of genes
in gene sets are unavailable, this type of method may result in losing
information about the predictive model. Although, NPR aims to
improve on the predictive accuracy via incorporation of biological
knowledge, there is no selection criterion with respect to identification
of differential gene sets. 
In this paper, we propose a GSA method that can not only to
identify gene sets that have only a subset of genes with expression
profiles, which are strongly associated with the class distinction, but
also to increase the discrimination ability such that subjects with
different phenotypes or clinical outcomes are appropriately classified
by integrating the selected gene sets together. During the analysis, the
proposed method treats each gene set as a whole, while retaining ability
to interpret impacts of individual genes on a prediction model. Here we
assume the data set contains the a priori biological information of gene
sets. For data sets without gene sets information, the proposed method
can still be applied in the same manner, if an appropriate clustering
algorithm can be applied to obtain gene clusters first. However, the
focus of this type of analysis is usually different from that of analyzing
data with intrinsic gene set information. The details are given in the
websupplement. 
It is wellknown that the AUC, as a summary index, shares the
threshold independent characteristic of ROC curves. Hence, in this
paper, we propose this AUCbased method for identifying differential
gene sets and study the detailed procedure of selecting gene sets with
discrimination ability. In addition, AUCbased statistics are proposed
and can be used to assess and rank the significance of gene sets. The
remainder of this paper is organized as follows: the background of
proposed methods and the two new selection procedures are given in
Section 2. In Section 3, the performances of our proposed approach
are compared to the random forestsbased pathway analysis approach
(pathwayRF) [19] based on the synthesized data sets and a series of
real expression data. In addition, our method is also compared to other
hypotheses testing based approaches such as global ANCOVA gene set
testing [8] and the rotation gene set testing [16]. A discussion of the
proposed approach is presented in Section 4. For other properties of
AUC and its applications, please refer to Metz et al. [25], Su and Liu
[26], Zhou et al. [27], Pepe [28], Liu et al. [29], Ma and Huang [30] and
Wang et al. [31], and the references therein. 
Methods 
Suppose that each subject has p observed continuousvalued
outputs of genes in a binary classification problem, say diseased and
normal classes. Assume further that these p genes belong to one of K
predefined gene sets based on, for example, GO category. The proposed
method consists of the following three steps: 
(i) constructing a classifier with continuous decision output based
on genes within each gene set; 
(ii) evaluating the discrimination ability of all gene sets to detect
diseased and normal groups based on the classifiers obtained
in (i); 
(iii) identifying the sets with high discrimination ability through
construction of a classifier ensemble using the classifiers
obtained in (i). 
After step (i), each gene set is represented by a function of the
classifier constructed using the genes of this gene set. That is, we treat
each gene set as a unit simply by taking the output values of the function
value of the corresponding classifier as a new feature, and subjects are
represented using these new features. Therefore, a linear ensemble of
these new features that maximizes the AUC is constructed, which has
the best classification performance in terms of AUC. In step (iii), we
construct a linear ensemble of classifiers based on these new features.
We then identify gene sets according to their contributions to this
final ensemble, which can usually be indexed using the corresponding
coefficients of the linear combination. Furthermore, another crossvalidation
schemes are also recommended to assess the identification
stability of differential gene sets. 
Please note that if we are only interested in gene sets identification,
then the only requirement of the baseclassifiers used in step (i)
is being able to provide continuous classification function values,
otherwise there is no limitation on the usages of classification methods
in the proposed method. Thus, many classification methods, such as
linear discrimination, logistic regression, support vector machines
and others, are applicable in this case. Since different classification
methods will usually select/utilize genes in different ways such that the
maximum classification performance can be achieved, then it makes the
assessment of the impact of individual genes very difficult if different
classification methods are used within different gene sets. In this paper,
in order to fully take advantage of the thresholdindependence of the
ROC curve and retain the interpretation ability of each gene, we adopt
the best linear combination of genes that maximizes AUC within each
gene set for constructing the base classifiers in (i). Thus, our method
not only identifies the differential gene sets, but also provides the
measure of importance of genes within each gene set. Moreover, in the
web supplement we also illustrate how to apply our method when there
is no intrinsic grouping information available, in which the Kmeans
and a hierarchical clustering methods are applied to obtain gene sets. 
The area under the ROC curve 
Let W be a continuousvalued random variable, and suppose
that for a prespecified threshold c, a subject is classified as diseased
(positive) if W>c, or normal (negative) otherwise. Then the ROC
curve for such a simple classifier W is , where and are functions of the true positive and false positive probabilities,
respectively. The is defined as an integration of true positive rate over the whole range of the false positive rate. Let function
ψ(u)=1 if u>0; =0.5 if u=0 and =0 if u<0 and {W_{D,1},...,W_{D,n},W_{N,1},...,W_{N,m}}
be observations of n diseased and m normal subjects. Then it is wellknown
that the AUC can be estimated using the WilcoxonMann
Whitney Ustatistic 
(1) 
It is shown in Kowalski and Tu [32] that 
(2) 
where , ρ1 and ρ2 are the limits of (n + m) / n and
(n + m) / m , respectively, with 



In addition, it is shown that asymptotic standard deviation σ can
be estimated by 
(3) 
where 


The detailed arguments are given in the web supplement 
Linear combination of genes within a gene set 
Suppose that each gene set has pk genes, and X_{k}'s and Y_{k}'s are pk
dimensional random vectors of gene expressions data of the kth gene
set from two (normal and diseased) groups, k=1,...,K. As mentioned
in (i), for each k, we first construct a linear classifier to discriminate
between these two different groups by using the p_{k} genes within kth
gene set. Let ℓ_{k} be a vector of constants with length p_{k}, then the linear
combinations ℓ'_{k}X_{k} andℓ'_{k}Y_{k} are called risk scores. There is usually
no distribution information available for X_{k} and Y_{k}, so the best ℓ_{k},for
each k, that achieves the maximum AUC among all possible p_{k}
dimensional vectors must be calculated through observations. Let
{X_{kj},Y_{ki}: j=1,...,m;i=1,...,n}be observed expression values of genes in the
kth gene set of the two groups. From (1), the empirical AUC estimate
can be rewritten as 
(4) 

Since step function ψ(∙) is not continuously differentiable, to
compute such a linear combination coefficient by directly maximizing
AÛC(ℓ_{k}) is difficult. Thus, we follow Ma and Huang [30] and Wang et
al. [31], to use a sigmoid function S(t)=1/(1+exp(t)) to approximate
ψ(∙). Consequently, the smoothed AUC estimate is defined as, 
(5) 
It has been shown in Wang et al. [31] that for sufficiently small h,
S((yx)/h)≈ψ((yx) and AÛC_{s}(ℓ_{k}) is a strongly consistent estimator of
AUC. So, we obtain a vector ˆ
ℓ_{k} of the ''optimal'' linear combination coefficients for the kth gene set through maximizing the smoothed
AUC estimate (5) with respect to ℓ_{k}; that is, 

Since AUC is scale invariant, AÛC(ℓ_{k}) has the same value as
AÛC(cℓ_{k}) for any positive constant c. Hence, an anchor gene should be
determined before finding the solution that maximizes AÛC_{s}(ℓ_{k}) such
that ℓ_{k} is identifiable. However, when the number of genes in one set is
more than the total sample size, e.g. p_{k}>>(n+m), obtaining vector _{k} by
direct maximization of (5) could lead to overfitting. Therefore, we use
the PTIFS method proposed in Wang et al. [31] to find the ``optimal''
ˆ
ℓ_{k} for each _{k}. We describe a summary of the PTIFS algorithm based
on the kth gene set with p_{k} genes as following, details seen in Wang et
al. [31], where we assume that gene 1 is the anchor gene and denote sets
of active genes and inactive genes at the iteration procedure by A_{k} and
A_{kc} , respectively. 
Algorithm 
(i) Find the anchor gene which is A_{k}= {1} as assumed and set G_{k} =φ,
the empty set. 
(ii) For the current active set A_{k}, calculate the corresponding
coefficients,_{k}^{A}
, of the linear classifier by the criterion function
(6). For each
j∈ A_{kc} , compute R_{j}=(ℓ_{k} ) , the empirical
AUC for the jth gene based on all the subjects of the two
groups; and compute its empirical AUC,
R_{j(0)} , based on
subjects that are misclassified by A _{k}^{A} , where l_{k}=β_{k}_{j} is a vector
of length p_{k} with the jth component being 1 and the others
being 0. 
(iii) If for all or , then stop. Otherwise
choose ,and update the active set
A_{k} by adding the feature j_{0} and excluding it from the inactive
set , where λ>0 is a prespecified constant weighting on the
misclassified subjects. 
(iv) Update _{k}^{A}
l by using objective function (6) with respect
to the updated active set Ak in step 
(iii). Remove the set from A_{k} and add it to inactive
set If j_{0}∈B , then exclude j_{0} from and add j_{0} to G_{k},
otherwise add the elements of G_{k} to and let G_{k}=φ. 
(v) Repeat (ii)(iv) until AÛC(_{k}^{A})≥τ , where 0.5 ≤τ < 1 . 
In many biological studies, such as treatment/drug developments,
the assessment of individual genes is highly valued. When a nonlinear
classifier is used, it is usually difficult to distinguish the impacts of
individual genes due to the complicated classification function, which
makes designing the followup studies on individual genes very
difficult. That is another reason, in addition to the technical ones, why
the linear classifier is preferred in the proposed method. It is clear that
in this case, the impact of both gene sets and individual genes can be
identified and followup experiments can be easily planned based on the
findings. PTIFS is adopted when the followup biological confirmation
is of concern. Because of its parsimonious property, PTIFS may also
benefit biological researchers designing the necessary and expensive
experiments. AUC is chosen as an indicator of performance here
due to its threshold independent property. Moreover, the AUC can
be calculated easily such that a test statistic can be founded on that.
However, the AUC can be replaced by other performance measures
without any difficulty as long as a method of calculating coefficients of
a linear combination vector under such a measure is available. 
Assessment of gene set significance 
According to the methods used to maximize AUC, we use 3
measures to assess the significance of a gene set, which are AUC
statistic of gene set, crossvalidation AUC of gene set, and coefficients
of linear combination of gene sets. Once step (i) is completed for all
K gene sets, we have
_{k} for each k =1,...,K. The AÛC_{k}= AÛC_{k}
is then calculated using (4). For each k, define statistic z_{k}=AÛC_{k}–0.5,
which will be used as an index for identifying differential gene sets. It
is known that if gene set k has no discrimination ability to distinguish
the diseased subjects from the normal ones, then its corresponding
ROC curve is usually close to the diagonal line of the unit square
and the AUC approximately equals 0.5. The larger the zk, the better
the classification performance of gene set k. Thus, we can select the
topranking gene sets according to z_{k} until a prescribed threshold is
reached. For a given linear combination ℓ_{k} of genes in gene set k, hence,
from (2), AÛC(ℓ_{k}) is asymptotically normally distributed. However,
AÛC_{k} is a minimizer of AÛC(ℓk) with respect to ℓk and the asymptotic
distribution of AÛC_{k} is difficult to be computed. Therefore, if a p value
of gene set must be needed, then a permutationbased approach can
be employed to calculate the empirical pvalue for each gene set by
randomly drawing gene sets with the same size as gene set k. 
In practice, the size of genes in a gene set may be much larger than
the number of subjects available for analysis. Thus, to prevent overfitting
so that the corresponding AÛC will not be overoptimistically
large, the crossvalidation method should be used. Following the usual
crossvalidation scheme and applying the constructed classifier to the
testing sample, the predicting AUC, say AUC_{cv,k}, is calculated for each
k. Since the larger the AUC_{cv,k}, the higher the discriminant potential of
the kth gene set, we can now select gene sets according to AUC_{cv,k} as
in the z_{k} cases. 
To construct the final classification ensemble in (iii), we first
represent all individual subjects by their corresponding function values
after applying the classifiers obtained from individual gene sets to all
subjects. So, those classification scores of each subject are treated as
a set of new variables. The final ensemble is an integration of all gene
sets using PTIFS, which provides us a linear combination of these new
variables such that the final AUC is maximized. Consequently, the
gene sets are ranked according to absolute values of linear combination
coefficients, and the topranked ones are selected. The interpretation
ability of individual genes is retained due to a linear combination is
used in each stage. Note that once we represent subjects by function
values of the gene set based classifiers, the dimensionality of these new
variables is reduced to K  the total number of gene sets considered,
which is usually much smaller than the size of genes considered. 

Results 
Simulation study 
We evaluate the performance of the proposed method using an
extensive simulation study. For the evaluating purposes, N= 10000
genes are generated from a multivariate normal (MVN) distribution
with a mean vector μ and a diagonal variancecovariance matrix Σ for
both diseased and normal groups, and only a fraction out of N genes,
say γ ∈ (0,1), is differentially expressed with a mean difference δ. The
sets S_{0} and S_{1} respectively contain the nondifferential and differential
genes. Thus, there are 10000γ genes in set S_{1} and 10000(1γ) genes
in set S_{0}. The diagonal elements of Σ are equal to 1. Genes in set S_{1}
are correlated with a correlation coefficient ρ ^{ij} between gene i and
gene j. In this study, ρ is set to 0 (i.e., independent model) and 0.5. The 
informative gene set is composed of 100 genes, where 100γ genes are
randomly selected from set S_{1} and the other 100(1γ) genes are from set
S_{0}. This procedure is then repeated 10 times. Additionally, we generate
10 noninformative gene sets in which 100 genes are randomly selected
from set S_{0}. In summary, we generate 20 gene sets in each (diseased and
normal) groups, and there were 100 genes in each gene set. In the 20
simulated gene sets, only the first 10 sets has discriminant ability, and
only the first 100γ genes in these gene sets are differently expressed. In
our simulation studies, γ is set to be 0.05 and 0.10, as well as δ is 0.0, 1.0,
and 1.5. We note that, when δ=0, the Type I error rates of competing
methods is investigated. The training sample sizes are n=m=50 and
testing sample sizes are n=m=20 for both diseased and normal groups.
All simulations are repeated 200 times. 

We investigate, from a classification prospect, the performance in
identifying differential gene sets of the proposed method, and compare
with those in pathway analysis using random forests classification
(pathwayRF) [19] the least square support vector machine (lssvm),
global gene set testing (Global ANCOVA) [8] and the rotation gene set
testing (Roast) [16]. We use the outofbag (OOB) error estimate, cross
validation error rate, Ftest statistic and p value to assess the discriminant
abilities of gene sets for methods pathwayRF, lssvm, Global ANCOVA
and Roast, respectively. The OOB estimate is based on recording the
votes of classification on the test set left out of the bootstrap sample.
For each gene set, based on training data sets, we calculate the fivefold
crossvalidation AUC, AUC_{cv} using parsimonious threshold
independent feature selection (PTIFS), OOB as that in the pathwayRF
method, cross validation error rate for the lssvm, F test statistic for
the Global ANCOVA method and p value in the Roast method. The
PTIFS, proposed in Wang et al. [31], is a LARStype algorithm that
helps to find the linear combination of markers that maximizes AUC.
Note that AUC is not available for pathwayRF due to the complicated
voting scheme of the random forests process. Since Global ANCOVA
and Roast are based on views of regression, they can not be used to
make prediction for twogroup classification data. We then compare
the accuracies of detecting gene sets of these five approaches, where
the accuracy is defined as the percentage of correctly identified gene
sets. In Figures 1 and 2 with ρ =0.0 and 0.5, the upper three panels and
the first two panels in the lower respectively present mean accuracy of
AUC_{cv}, pathwayRF, lssvm, Global ANCOVA and Roast for γ= 0.05
or 0.10 and δ= 1.0 or 1.5. We can see that the accuracy is greater than
0.95 for AUC_{cv} with cutpoint in interval (0.60, 0.75), for pathwayRF
with cutpoint in interval (0.30, 0.40), for Global ANCOVA with cutpoint
in (1.0, 2.0) and for Roast with cutpoint in (0.01, 0.06) except the
parameter combination (γ =0.05, δ =1.0). Method lssvm has the least
accuracy smaller than 0.52 which suggests that lssvm is not suitable to
identify differentially expressed gene set. Hence, the proposed AUC_{cv} is
a promising and reliable measure to identify differential gene sets. At
the same γ, AUC_{cv} with larger δ =1.5 has higher accuracy than those
of the cases with smaller δ( =1.0). For the same δ, AUC_{cv} with larger
γ(= 0.10) has greater accuracy than those with smaller γ(= 0.05). This
is because AUC_{cv} is sensitive to the discrimination ability of gene set.
Similarly, the larger δ for given γ or the larger γ for given δ, the greater
accuracy; and pathwayRF, Global ANCOVA and Roast also perform
well in these cases. In addition, the k AÛC for the kth gene set is
computed and the statistic zk = AÛC_{k} 0.5 is calculated. The third panel
in the lower of Figures 1 and 2 presents the mean accuracy values of
based on z_{k} with γ= 0.05 or 0.10, and δ= 1.0 or 1.5. It is found that,
except situation of γ= 0.05 and δ= 1.0, all zk have accuracies greater than
0.90 for any cutpoint value in the interval (0.43, 0.45), which indicates that the z_{k} performs well for identifying differential gene sets. Table 1
presents the average fitting and prediction error rates of the topten
gene sets selected by AUC_{cv}, pathwayRF and lssvm. The results again
confirms that our proposed AUC_{cv} provides competitive classification
results, in terms of the classification errors, to that of pathwayRF, and
much better than that of lssvm. Table 2 shows the empirical type I
errors of three methods, Global ANCOVA, Roast and AUC statistic z_{k},
using the nominal level of 0.05. Note that the p value of z_{k} is computed
by using permutation method. As expected, the type I error rates of all
three methods are reasonably close to or below the nominal level in all
settings. 
When ρ = 0.0 and 0.5, as an example, Figures 3 and 4 respectively
describe the average values of withingeneset coefficients for the first
gene set (gene set 1) obtained using PTIFS with γ= 0.05 and 0.10 and
δ= 1.0 and 1.5. In these figures, the vertical lines represent the positive
(up) and negative (down) values of the coefficient, respectively, and the
length of lines indicates the magnitude of the absolution value of the
coefficient within each gene set. When γ= 0.05 (0.10), the magnitudes
of coefficients of the first 5 (10) genes are the largest since only the
first 5 (10) genes have discrimination ability. For the same δ, the
magnitude of the coefficient of genes selected at γ= 0.05 is bigger than
that at γ= 0.10. This is because when the fraction γ is large, the number
of differentially expressed genes selected increases and the number of
candidate genes also increases. So, the selection frequencies become
smaller than those of the case with small γ, due to the parsimonious
property of PTIFS. By applying PTIFS to the newly extracted variables
using the baseclassifiers of individual gene sets, we are able to select
the differentially expressed gene sets. Table 3 lists the number of gene
sets selected (selnum) based on PTIFS, AUC and the misclassification
rate of the linear classifier using both training and testing data sets. It
shows that our method performs well in terms of classification error
rate. 
Real examples 
We apply our method to data sets obtained from six microarray
gene expression studies with intrinsic gene sets, which are Gender,
p53, Diabetes, Leukemia, lung cancer from the Boston study and lung
cancer from the Michigan study. All these data sets are frequently
used in gene set analysis [4,5,11,15]. In each study, the gene sets are
clustered into several sets based on two catalogs: (C1) chromosomes
and cytogenetic catalog, and (C2) functional catalog. The gender data
set consists of 15 males and 17 females, and each sample has 15,056
mRNA expression profiles. The p53 data set is from a study identifying
targets of the transcription factor p53 from 10,100 gene expressions
with 17 normal and 33 mutation samples. In the diabetes study, the DNA microarrays are used to profile expressions of 15,056 genes from
34 skeletal muscle biopsy samples (17 normals and 17 patients). The
leukemia data set includes 10,056 expression profiles from 24 acute
lymphoblastic leukemia (ALL) patients and 24 acute myeloid leukemia
(AML) patients. The lung cancer data set is obtained from studies
conducted by the Boston and Michigan groups. The Boston group
studied 5,217 gene expression levels from 31 dead and 31 live subjects,
and the Michigan group studied lung tumors by comparing 5,217 gene
expression profiles derived from 24 dead subjects with those of 62 live
subjects. 
We apply both the AUC_{cv} and pathwayRF methods to the seven
data sets for identification of differential gene sets: Gender (C1), Gender (C2), p53 (C2), diabetes (C2), acute leukemia (C1), and lung cancer
(C2) from the Boston study, and lung cancer (C2) from the Michigan
study. The number of gene sets are 212, 318, 308, 318, 182, 258 and 258,
respectively. For each gene set, we use the PTIFS algorithm to obtain
an optimal linear combination coefficients of genes. We obtain a 5fold
crossvalidation estimate of AUC, AUC_{cv}, and apply pathwayRF with
OOB values. We then rank all gene sets via AUC_{cv} and OOB. Tables 4 and 5 show the topten ranking gene sets based on AUC_{cv} and the
smallest OOB for these seven gene expression data sets, respectively. 
Figures 5 and 6 describe the withinset total nonzero coefficients of
the top five ranking gene sets, where the yaxis labels stand for gene set
labels and the xaxis labels consist of the total genes where coefficients
are not equal to 0 in these top five gene sets. In these two figures, as
before, the up and downdirections of the short vertical lines represent
whether the coefficients are positive or negative, respectively. The
lengths of lines stand for the magnitudes of the absolute values of the
coefficients. From Tables 4 and 5, we found that there are some gene
sets that are simultaneously identified using AUC_{cv} and pathwayRF
with OOB, such as Gender (C1), Gender (C2) and P53. However,
many gene sets chosen by these two methods are different, such as
in the diabetes, and lung cancer data sets from the Boston study and
Michigan study. Note that for leukemia data set, there are many gene
sets that have excellent discrimination ability; that is, no classification
error. In fact, there are many gene sets selected by both of AUC_{cv} and
OOB in this case. Table 6 lists the average fivefold crossvalidations
error rates of the topten gene sets identified by AUC_{cv} and pathwayRF,
which suggests that the proposed AUC_{cv} is comparable to pathwayRF.
The PTIFS method chooses only one gene set in most of
the data sets, except the lung cancer data set from the Boston study.
The longer the vertical line, the greater the discrimination ability of
genes within the gene set. Hence, it is shown in Figures 5 and 6 which
genes have the greatest ability to separate the two groups (e.g. male and
female, ALL and AML) among those genes under consideration. From
a statistical perspective, when the number of variables is much larger
than the number of subjects, it is difficult to have a firmly, satisfactory
variable selection scheme which is overwhelmingly better than others.
Thus, further study on the selected variables is essential. However,
in the gene set selection problem discussed here, the selected gene
sets have to be reconfirmed through intensive biological laboratory
work. Hence, the parsimoniousness of the proposed method is usually
preferable in this respect. 
Discussion and Conclusions 
We propose a gene set selection method based on the discrimination
abilities of gene sets with AUC as the classification performance
criterion. This kind of a discriminationbased method is rarely
discussed in gene set selection research and is different from methods
that are based on clinical outcomes or phenotypes. Our algorithm is founded on the PTIFS algorithm [31], which can select features from
high dimensional data sets with small number of subjects. We also
introduce two AUCbased statistics to assess the discriminant abilities
of gene sets for binary class distinctions. In addition to selecting the
gene set, our algorithm can further quantify the impact of individual
genes within the gene set when the gene setbased classification of
phenotypes is conducted. 
From the numerical results of the synthesized data set, we found
that the proposed method successfully selects the targeted gene sets. In
the numerical results from analyzing real data, the selected gene sets
are somewhat different from those selected in other papers that analyse
the same data sets based on other association analysis. This difference
is primarily due to the fact that the classic gene setbased tests aim to
detect significant gene sets in which genes are differentially expressed
between phenotypic conditions. However, our method is to identify
gene sets with regard to their ability to predict phenotypic conditions.
As we know, the amount of enrichment will influence the ability of
identifying differential gene sets. The GSEA is conservative, especially
when the percentage of differentially expressed genes is relatively small
within the gene set. This has been clearly illustrated by applying GSEA
to two lung cancer data sets published as supporting information on
the GSEA web site, since most gene sets are not statistically significant
in terms of pvalues. Nonetheless, the simulation studies reveal that
our method is able to identify gene sets with small alterations between
two phenotypes, even when only 5% of genes in the gene set are
differentially expressed. Specifically, we found two nuclear factorKB
(NFKB)related sets with higher AUC in the Boston Lung cancer data.
These gene sets are thought to be major transcription factors regulating
many important signaling pathways involved in the tumor promotion.
In contrast, there is a large overlap among the significant gene sets
between the two methods in the Gender data sets. This result is due
to a large proportion of differentially expressed genes within the sets.
In summary, our method provides a powerful alternative to gene sets
information methods currently available in the literature. The gene sets
selected by our approach may reveal distinct prospects of expression
profiles, which are useful for biologists when discrimination ability is
of concern. 
Concerning the framework of multipletesting issue, the false discovery rate (FDR) approach can be used to adjust for multiple
comparisons using pvalues derived from two AUCbased statistics.
However, the FDR and the estimated qvalues depend on the number
of gene sets. Since the construction of gene sets is based on the
biologically relevant information retrieved from public databases, the
number of gene sets may be different across different databases and
different gene sets may share common genes. Thus, it is critical to select
the thresholds to control for FDR's across different experiments and
count for the possible complex correlation structure among pvalues.
Further work is required in order to estimate error rate and therefore it
is not discussed in this article. 
There are several possible extensions of the proposed method;
for example, we can replace AUC with other performance measures
such as partial AUC. In addition, the linear combination within gene
sets can be replaced by other methods, even apply a highly nonlinear
classification algorithm, if our only concern is to identify gene sets
and not the impact of individual genes. In fact, from the prospective
of gene sets selection, we do not even require that the classification methods used within gene sets be homogeneous. We can simply allow
the classification method used for each gene set to be the best for that
particular gene set among a group of classifiers under consideration,
and then the rest of the steps can still be easily applied. 
Acknowledgements 
This work is partially supported via NSC972118M001004MY2 and NSC98
2118M039002 funded by the National Science Council, Taipei, Taiwan, ROC, and National Natural Science Foundation of China (Grant No. 11101396). 
References 
 Draghici S, Khatri P, Martins R, Ostermeier G, Krawetzb S (2003) Global functional profiling of gene expression. Genomics 81: 98104.
 Khatri P, Draghici S (2005) Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 21: 35873595.
 Rivals I, Personnaz L, Taing L, Potier M (2007) Enrichment or depletion of a GO category within a class of genes: which test? Bioinformatics 23: 401407.
 Mootha V, Lindgren C, Eriksson K, Subramanian A, Sihag S, et al. (2003) PGC1 aresponsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 34: 267273.
 Subramanian A, Tamayo P, Mootha V, Mukherjee S, Ebert B, et al. (2005) Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles. Proc Natl Acad Sci USA 102: 1554515550.
 Goeman J, van de Geer S, de Kort F, van Houwelingen H (2004) A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 20: 9399.
 Tian L, Greenberg S, Kong S, Altschuler J, Kohane IS, et al. (2005) Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci USA 102: 1354413549.
 Mansmann U, Meister R (2005) Testing differential gene expression in functional groups: Goeman's global test versus an ANCOVA approach. Method Inf Med 44: 449453.
 Kong SW, Pu WT, Park PJ (2006) A multivariate approach for integrating genome wide expression data and biological knowledge. Bioinformatics 22: 23732380.
 Efron B, Tibshirani R (2006) On testing the significance of sets of genes. Ann Appl Stat 1: 107129.
 Dinu I, Potter J, Mueller T, Liu, Q, Adewale A, et al. (2007) Improving gene set analysis of microarray data by SAMGS. BMC Bioinformatics 8: 242.
 Chen JJ, Lee T, Delongchamp RR, Chen T, Tsai CA (2007) Significance analysis of groups of genes in expression profiling studies. Bioinformatics 23: 21042112.
 Newton M, Fernando A, Johan A, Srikumar S, Paul A (2007) Randomset methods identify distinct aspects of the enrichment signal in geneset analysis. Ann Appl Stat 1: 85106.
 Sartor M, Leikauf G, Medvedovic M (2009) LRpath: a logistic regression approach for identifying enriched biological groups in gene expression data. Bioinformatics 25: 211217.
 Tsai CA, Chen J (2009) Multivariate analysis of variance test for gene set analysis. Bioinformatics 25: 897903.
 Goeman J, Buhlmann P (2007) Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 23: 980987.
 Nam D, Kim S (2008) Geneset approach for expression pattern analysis. Brief Bioinform 9: 189197.
 Lin SM, Devakumar J, Kibbe WA (2006) Improved prediction of treatment response using microarrays and existing biological knowledge. Pharmacogenomics 7: 495501.
 Pang H, Lin A, Holford M, Enerson B, Lu B, et al. (2006) Pathway analysis using random forests classification and regression. Bioinformatics 22: 20282036.
 Pang H, Zhao H (2008) Building pathway clusters from Random Forests classification using class votes. BMC Bioinformatics 9: 87.
 Wei Z, Li H (2007) Nonparametric pathwaybased regression models for analysis of genomic data. Biostatistics 8: 265284.
 Tai F, Pan W (2007) Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms. Bioinformatics 23: 17751782.
 Tai F, Pan W (2007) Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data. Bioinformatics 23: 31703177.
 Lottaz C, Spang R (2005) Molecular decomposition of complex clinical phenotypes using biologically structured analysis of microarray data. Bioinformatics 21: 19711978.
 Metz C, Wang PL, Kronman HB (1984) A new approach for testing the significance of differences between the roc curves measured from correlated data. In Information Processing in Medical imaging VIII F Deconick (ed.) 432445.
 Su J, Liu J (1993) Linear combinations of multiple diagnostic markers. J Am Statist Ass 88: 13501355.
 Zhou C, Obuchowski N, McClish D (2002) Statistical Methods in Diagnostic Medicine. New York: Wiley.
 Pepe M (2003) The statistical evaluation of medical tests for classification and prediction. New York, Oxford University Press.
 Liu A, Schisterman E, Zhu Y (2005) On linear combinations of biomarkers to improve diagnostic accuracy. Stat Med 24: 3747.
 Ma S, Huang J (2005) Regularized roc method for disease classification and biomarker selection with microarray data. Bioinformatics 21: 43564362.
 Wang Z, Chang YI, Ying Z, Zhu L, Yang Y (2007) A parsimonious thresholdindependent protein feature selection method through the area under receiver operating characteristic curve. Bioinformatics 23: 27882794.
 Kowalski J, Tu X (2007) Modern applied Ustatistics. John Wiley and Sons Inc.
