Feature Selection using Bootstrapped ROC Curves
- *Corresponding Author:
- Ping Xu
Department of Pediatrics, College of Medicine
University of South Florida, 3650 Spectrum Blvd
Suite 100, Tampa, Florida, USA
Tel: (813) 3969552
Fax: (813) 9105952
Received date: September 10, 2014; Accepted date: October 21, 2014; Published date: October 24, 2014
Citation: Xu P, Liu X, Hadley D, Huang S, Krischer J, et al. (2014) Feature Selection using Bootstrapped ROC Curves. J Proteomics Bioinform S9:006. doi:10.4172/jpb.S9-006
Copyright: © 2014 Xu P, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Background: In modeling a N by m data matrix, i.e. N samples on a m dimensional space, the issue arises when m is bigger than N. The sample size cannot be increased, especially in medical research, due to the limited number of diseased subjects. Feature selection is often used to select a subset of relevant m variables, often lower than N, for use in model construction.
Method: A multiple step bootstrap method is proposed to quantify relevance of candidate predictors with the outcome based on their areas under the Receiver Operating Characteristic curve (ROCAUCs) from bootstrap resamples and then select only significant variables, which meet pre-specified criteria, as a feature selection process.
Results: Extensive simulation was conducted using thousands of predictor variables and 5 levels of prediction ability between the true predictor and the outcome. The results from the simulation data indicate that the mean of ROCAUCs from bootstrap samples is close to the true ROCAUC. Even with only 30 cases and 30 controls, 25 out of 25 listed predictor variables provide the correct level of classification ability by using mean of bootstrapped ROCAUCs. The proposed bootstrapped ROCAUCs method outperforms the single ROCAUC. The standard error of mean of bootstrapped ROCAUCs was 20% to 50% smaller than the standard error of the single ROCAUC estimate from the original sample. An illustrative example is presented to apply the proposed methodology to identify the gene expressions that could predict clinical survival in breast cancer patients, using the Van’t Veer study’s breast cancer data.
Conclusion: We conclude that the bootstrapped ROCAUCs methodology is intuitive and attractive for use in feature selection problems when the goals of the study are to identify important predictors and to provide insight regarding the discriminative or predictive ability of individual predictor variables. Such goals are common among microarray studies and new biomarker discovery.