Feature Selection using Bootstrapped ROC Curves

Background: In modeling a N by m data matrix, i.e. N samples on a m dimensional space, the issue arises when m is bigger than N. The sample size cannot be increased, especially in medical research, due to the limited number of diseased subjects. Feature selection is often used to select a subset of relevant m variables, often lower than N, for use in model construction. Method: A multiple step bootstrap method is proposed to quantify relevance of candidate predictors with the outcome based on their areas under the Receiver Operating Characteristic curve (ROCAUCs) from bootstrap resamples and then select only significant variables, which meet pre-specified criteria, as a feature selection process. Results: Extensive simulation was conducted using thousands of predictor variables and 5 levels of prediction ability between the true predictor and the outcome. The results from the simulation data indicate that the mean of ROCAUCs from bootstrap samples is close to the true ROCAUC. Even with only 30 cases and 30 controls, 25 out of 25 listed predictor variables provide the correct level of classification ability by using mean of bootstrapped ROCAUCs. The proposed bootstrapped ROCAUCs method outperforms the single ROCAUC. The standard error of mean of bootstrapped ROCAUCs was 20% to 50% smaller than the standard error of the single ROCAUC estimate from the original sample. An illustrative example is presented to apply the proposed methodology to identify the gene expressions that could predict clinical survival in breast cancer patients, using the Van’t Veer study’s breast cancer data. Conclusion: We conclude that the bootstrapped ROCAUCs methodology is intuitive and attractive for use in feature selection problems when the goals of the study are to identify important predictors and to provide insight regarding the discriminative or predictive ability of individual predictor variables. Such goals are common among microarray studies and new biomarker discovery.


Background
Information technology advancement has brought an explosive growth of data in recent decades. We collect data on a diverse and numerous assortments of variables, not knowing which ones will be relevant to the outcome of interest. The sample size of study subjects cannot be increased, especially in medical research, due to the limited number of diseased subjects. In modeling an N by m data matrix, i.e N subjects on an m dimensional space, the challenge is to identify relevant variables to be included in modeling the outcome. Thus, variable filtering plays an important role in reducing the number of variables before the formal model building. Thereafter, a subset of size k variables (usually k<N) remains as the result of the feature selection process and can be accomplished by a well-understood method for modeling low dimensional data.
The goal of variable filtering is to eliminate the majority of irrelevant variables, while keeping as many of the true predictors as possible, by reducing the size of candidate variables to a smaller number, k. The value of a feature selection procedure largely depends on the accurate assessment of variables in terms of importance to the outcome (variable importance) and the criteria for selecting the candidate variables to be included in the model building (variable selection). When both outcome and predictors are continuous, Fan and Lv [1] proposed Sure Independence Screening (SIS), which ranks all the variables by the absolute value of empirical Pearson's correlation coefficient between the outcome and each predictor. The method is not applicable when the outcome is dichotomized or survival time, such as case-control or time to disease onset data. Genuer et al. [2] proposed a method for variable selection using random forest score of importance, which is limits the predictors as the classifiers. In medical research, the researchers are often interested in the search of biomarkers that could be used to differentiate the disease population from non-diseased population or to predict time to event. Biomarkers are often measured on a continuous scale. In this setting, Pepe et al. [3] proposed to evaluate the predictors based on their individual empirical ROC value, and applied to an ovarian cancer dataset to select a subset of genes that could distinguish between cancerous and normal organ tissues. Jeffrey et al. [4] compared and evaluated a ROC method with other traditional methods, such as t-statistics. They concluded that Pepe's method performed well with datasets that had low levels of noise and large sample size. However, the method cannot be used when the outcome is survival time. Moreover, it did not take into consideration the variability of quantification process due to the finite number of study subjects and large number of predictors in high dimensional data. Boulesteix and Slawski [5] discussed the variability of gene ranking methods and showed ranked gene lists are highly unstable in the sense that a small change of the data set usually affects the obtained gene list. Taking into account the above observations, and considering the importance of variable filtering in high dimensional data, we propose a method using bootstrapped ROCAUCs to quantify each variable's discriminative or predictive ability. By selecting only relevant variables as a filtering process, which meets pre-specified criteria, the number of variables for model constructions is reduced.
This remaining paper is organized as follows. In Section "Methods: procedure for bootstrapped ROCAUCs", we describe the procedure to generate bootstrapped ROCAUC estimates for quantifying variable importance and we recommend the various criteria for variable selection. In Section "Application results: breast cancer gene expression and clinical survival", we evaluate the performance of proposed method, based on simulations of normal models for the case-control study (section "Simulation results"). In Section "Simulation study I", we present simulations for a prospective follow-up study where the outcome is survival time. In section "Simulation study II", an illustrative example is presented to apply the proposed methodology to the Van't Veer dataset, which was screened for gene expression variables that could predict clinical survival in breast cancer patients. We then follow with a discussion on the applicability of the proposed feature selection method and draw our conclusions in Section "Conclusion".

Methods: Procedure for Bootstrapped Rocaucs
The quantification of variable importance is crucial not only for ranking the candidate variables in the screening process but also to interpret and understand the data. It is the initial step of variable screening. When the predictor variables are measured on a continuous scale, the Receiver Operating Characteristic (ROC) curve is one of the best statistical techniques used to characterize their ability in classifying or predicting the disease outcome. The area under ROC curve (ROCAUC) is the summary index of ROC curve and can be interpreted as a measure of distance or, equally, a measure of stochastic dominance.

Area under Receiver Operating Curve (ROCAUC)
We review ROCAUC in this subsection. For a continuous variable Y and a binary outcome D, let D=1 if diseased and D=0 if non-diseased. Using a threshold c to define a binary test from a continuous variable Y as Let the corresponding true and false positive rate at the threshold c be TPR (c) and FPR (c) respectively, we define: and the ROC curve is plotting the entire set of possible true and false positive fractions obtained by dichotomizing Y with different thresholds. That is: ROC( . )={(FPR(c), TPR(c)), c € (-∞, +∞)}.
The area under the entire ROC curve (ROCAUC) is a global summary statistic of ROC curve, based on all possible cut-off values of a variable. It is defined as: In a prospective cohort study, a binary outcome can change over time. Suppose we have a time-dependent outcome along with continuous biomarkers and we want to see how well marker Y predicts the survival time for the subjects. Let Ti and Ci denote survival and censoring times for ith subject, We observe (Zi, δi ) where Zi=min(Ti, Ci) and δi=I (Ti ≤ Ci ). Denote Di (t) the time-dependent outcome status for subject i. at time t. For any threshold c, the true positive and false positive rates are time-dependent functions, defined as The time-dependent ROC curve plots TPR(c, t ) vs. FPR(c, t ) for any threshold c, so that, the area is a time-dependent function: The ROCAUC can take on values between 0 and 1. It is a monotonic increasing function. The variable with AUCROC of 1 is a perfect predictor because the true positive is 100% and the false positive is 0%. In contrast, the variable with an area under 50 is useless for classification/prediction. ROC curves are invariant to monotone transformations of the raw data. This property makes them appealing for comparisons across variables and hence for ranking. We suggest using non-parametric estimate of ROCAUC in the sample as it is more robust and do not depend on the distributions of the raw predictor values. Non-parametric ROCAUC estimates will be utilized to quantify the variable importance. More details on the non-parametric ROCAUC estimates can be found in the paper by Heagerty et al. [6] for survival outcome or the paper for binary outcome by Hanley and McNeil [7].

ROCAUC estimates from bootstrap resampling
Let S={(D i , Y i ), i=1, 2….n} denote a sample of N independent subjects, who have been measured with m continuous independent variables y i =(y i1 , y i2 ,….y im ) and the binary disease outcome (D=0 nondiseased, D=1 diseased). We assume that the sampled subjects are independent identical distributed (i.i.d) random variables with the distribution function F, i.e,  Once we have quantified the variable importance, we can rank the predictor variables either by the mean or certain frequency of ROCAUC values, that is, we will have a rank for all m variable based on bootstrapped ROCAUCs ( Figure 1):

Variable selection criteria
It is important to recognize that an appropriate statistical approach depends on the scientific objectives of the study. The decision for selecting candidate variables should be flexible depending on the objective of study and the information you have already known on the disease. We present the various criteria for variable selection.
If the known risk factors are in the data, it is recommended to keep any variable that has the higher rank order than the known risk factors, for example, a variable with mean (ROCAUCs)>-maximum (mean (ROCAUCs) of known risk factors. Another possibility is that we don't have known risk factors, but we would like to include a fixed size of p candidate variables. We may keep the top p variables ranked by the frequency or mean ROCAUCs, that is, . When we have little information on the disease, it may be appropriate to simply keep any variable which has fair discrimination or prediction ability, such as a frequency of ROCAUC values above 0.70 in over 80% of the re-samplings. The selected candidate variables can then be included in the traditional multivariate modeling for low dimensional data for model selection process. As bootstrapping also provide the variance of ROCAUC estimates, we may consider selecting the variable based on its ROCAUC estimate's confidence interval, such as the lower limit of confidence interval above 0.60.

Application Results: Breast Cancer Gene Expression and Clinical Survival
We apply our method to the public available dataset of gene expression profiling in predicting clinical survival outcome among breast cancer patients reported by Van de Vijver et al. [8]. The data can be downloaded through R package 'breastCancerNKI' (http:// www.bioconductor.org/packages/2.13/data/experiment/html/ breastCancerNKI.html). Total 295 breast cancer patients who were treated by modified radical mastectomy or breast-conserving surgery, followed by radiotherapy between 1984 and 1995 at the hospital of the Netherlands Cancer Institute were included. In this data set, approximately 25,000 human gene expressions were recorded for each patient. The endpoint of interest was the clinical survival time during the 10-years follow-up. The median follow up time was 7.2 years and the median survival time was 3.8 years. We evaluated the ROCAUCs from 1000 bootstrapped samples for all 24,496 gene expression markers. The results for top 15 substances/genes based on mean ROCAUCs are summarized in Table 1. The best individual gene only had a fair prediction (0.70 ± 0.06) on 10 year's survival. Thus, the combination of gene expressions may be furthered investigated to improve the overall predicative ability in the multivariate models.

Simulation results
We conduct extensive simulation studies to evaluate the finite sample performance of the proposed bootstrapped ROCAUCs method under a variety of settings obtained by controlling several critical factors such as the sample size and the type of outcome. By doing so, we can compare the performance across the setting and identify the settings favorable to the method. The advantage of using simulated data is that we know the truth underlying the data and therefore we have a gold standard against which to compare results. When the setting satisfies all the assumptions for bi-normal ROC model, that is, data is normally distributed for both diseased population and non-diseased population, it can be shown that true ROCAUC is: where Φ is the standard normal cumulative distribution function,  The results from the simulation data indicate that the mean of ROCAUCs from 1000 bootstrap samples is close to the true ROCAUC. Even with only 30 cases and 30 controls, 25 out of 25 listed predictor variables provide the correct level of classification ability by using mean of ROCAUCs from bootstrap samples. Using the single ROCAUC estimate from the original sample, 3 (more than 10%) predictor variables did not provide the correct level of classification ability. ROCAUC estimate varies across the bootstrap samples, even when the sample size is moderate (Figure 2). The quantification of variable importance based on a single ROCAUC estimate from the original sample thus is not stable and sensitive to the small change of data. Figure 3 illustrates the comparison of variability of a single ROCAUC and variability of mean of bootstrapped ROCAUCs. The standard error of mean of bootstrapped ROCAUCs was 20% to 50% smaller than the standard error of the single ROCAUC estimate from the original sample.

Simulation study II
In simulation study II, we look at the scenarios from the prospective cohort study when the disease outcome is time-dependent. Assuming a cohort of subjects at risk for certain disease, we collect biomarker values at baseline then follow the subjects for 5-year or until disease onset. The outcome of interest here is time to disease onset. We simulate the outcome variable following the exponential distribution with rate parameter λ=0.10, 0.20 or 0.25. With this set up, approximately 40%, 60% and 70% of subjects will have disease onset by the end of study    The shortcoming of the proposed method is the high computational cost, which is a problem for other current feature selection methods as well. However, with the advancement in computer technology, computational time should not constitute a substantial drawback relative to other approaches to feature selections.
The above method can be readily used to many applications such as new biomarker discovery when the outcome is time dependent, and/or microarray studies that are aimed to explore a large pool of genes and select a subset of genes that are differentially expressed. This information may help provide insight regarding the discriminative ability and/or predictive ability of individual predictor variables and develop more specific treatment strategies for patients. It is particular useful for screening the variables in the bioinformatics studies when the amount of variables is extremely large and the number of subjects is comparatively very small.
at year 5. For predictor variables, we use the settings that are similar to those in simulation study I for diseased and non-diseased subjects. A time-dependent ROCAUC was obtained for each predictor and repeated for 1000 bootstrap samples. Tables 3.1

Conclusion
We have demonstrated the procedure of feature selection based on the discriminative or predictive ability of variables via bootstrapped ROCAUCs. Filtering the variables can eliminate noise and alleviate the effect of the curse of dimensionality. The pre-selected variables can then be entered in the traditional low dimensional multivariate model for model building.
The proposed method has the unique advantage of quantifying the importance of candidate variables and is computationally much faster than any penalized or stepwise regression methods for variable selections. The evaluation of variables is based on the whole ROC curve rather than a single accuracy index. ROC curves have become ubiquitous in many application areas and the various advances have been discussed across published articles. Moreover, the uncertainty of variable importance is taken into consideration by computing the frequencies or mean of bootstrapped ROCAUC estimate values. Comparing to a single estimate of ROCAUC from the actual sample, the bootstrapped ROCAUC estimate is more robust as the sample variance or sample standard deviation, which are non-robust, can be greatly influenced by outliers. This type of variable selection is flexible and the criteria for selection may depend on the scientific objectives of the study. Another advantage of the proposed method is that we can select the variables by evaluating their predicting power