Medical, Pharma, Engineering, Science, Technology and Business

Department of Mathematical Science, New Jersey Institute of Technology, USA

- *Corresponding Author:
- Fang Y

Department of Mathematical Science

New Jersey Institute of Technology, USA

**Tel:**+1 973-596-3000

**E-mail:**[email protected]

**Received Date**: September 06, 2016; **Accepted Date:** September 30, 2016; **Published Date**: October 07, 2016

**Citation: **Fang Y (2016) A Pass to Variable Selection. J Biom Biostat 7: 318. doi:10.4172/2155-6180.1000318

**Copyright:** © 2016 Fang Y, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Visit for more related articles at** Journal of Biometrics & Biostatistics

Many regularized procedures produce **sparse** solution and therefore are sometimes used for variable selection in linear regression. It has been showed that regularized procedures are more stable than subset selection. Such procedures include LASSO, SCAD, and adaptive LASSO, to name just a few. However, their performance depends crucially on the tuning parameter selection. For the purpose of prediction, popular methods for the tuning parameter selection include C_{p}, cross-validation, and generalized cross-validation. For the purpose of variable selection, the most popular method for the tuning parameter selection is BIC. The selection consistency of BIC for some regularized procedures have been shown. (Here the selection consistency means that the probability of selecting the data generating model is tending to one when the sample size goes to infinity, assuming that the data generating model is a subset of the full model.) However, knowing degrees of freedom is required in the use of BIC. For many regularized procedures, such as those for graphical models and clustering algorithms, the formulae for degrees of freedom do not exist.

Recently, stability selection has become another popular method for variable selection [1,2]. However, most methods based on stability depend on some hyper-tuning parameter explicitly. For example, the method in [1] depends on a **threshold** (pre-set as 0.8 in [1]) and the method in [2] depends also on a threshold (pre-set as 0.9 in [2]). Therefore, it is desirable to propose some method to avoid such hypertuning parameter in stability selection methods. One suggestion is to combine the strength of both stability selection and cross-validation. Since cross-validation is one variable selection method based on **prediction**, the new method is referred as the prediction and stability selection (PASS).

Consider variable selection in linear **regression**, y_{i} = x_{i}β + ε_{i}, i = 1,…,n. Assume β = (β_{1},….,β_{p})′ is sparse in the sense that < p, where . Without loss of generality, assume = {1,…,q}. A general **framework** for the regularized regression is . This framework includes LASSO, SCAD, and adaptive LASSO. If is used to estimate , most regularized procedures have been shown to be selection consistent with appropriate λ = λ_{n}, **emphasizing** its dependence on data. In general, as shown in [3], there are five cases:

**Case 1:** If , then with **probability** tending to one.

**Case 2:** If , then , where γ0 is fixed and its sign pattern may or may not be the same as that of β.

**Case 3:** If , then and the sign pattern of is consistent with that of β with probability tending to one.

**Case 4:** If , then the sign pattern of is consistent with that of β on with probability tending to one, while for all sign patterns consistent with that of β on , the probability of obtaining this pattern is tending to a limit in (0,1).

**Case 5:** If , then and with probability tending to one.

A good criterion should intend to select λ_{n} from case 3; selecting λ_{n} from cases 1 or 2 might lead to under-fitting while from cases 4 or 5 might lead to over-fitting. If the two degenerate cases (1 and 5) are pre-excluded, the criterion, referred to PASS, incorporates crossvalidation, which avoids under-fitting, and Kappa selection proposed in [2], which avoids over-fitting. To describe this criterion, consider any regularized procedure with λ and randomly partition the dataset {(y_{i},x_{i}{(y_{i}, x_{i}),…,(y_{n},x_{n})} into two halves, and , where . Based on Z_{1} and Z_{2} respectively, is obtained and then submodel is selected, k = 1,2.

If λ is from Case 4, both submodels, , would include non-informative variables randomly. The agreement of these two submodels can be measured by Cohen’s Kappa Coefficient, . On the other hand, if λ is from Case 2, either submodels, , might exclude some informative variable. To avoid such under-fitting, consider cross-validation, CV(Z_{1},Z_{2}; λ). Now we are ready to describe the PASS algorithm, which runs the following five steps.

**Step 1:** Randomly partition the original dataset into two halves, and .

Step 2: Based on and respectively, two sub-models, and , are selected.

Step 3: Calculate and .

Step 4: Repeat Steps 1-3 for B times and obtain the following ratio,

Step 5: Compute PASS(λ) on a grid of λ and select .

Following the above five cases, we can show that the proposed PASS criterion is selection consistent under some regular conditions. The new criterion has several advantages. First, it does not depend on any hyper-tuning parameter. Second, the **implementation** is straightforward. Third, it can be applied to variable selection in any models such as linear model, generalized linear model, and Cox’s proportional hazard model. Fourth, it can also be applied to variable selection in both supervised learning and **unsupervised** learning.

- Meinshausen N, Buhlmann P (2010) Stability selection (with discussion). Journal of the Royal Statistical Society 72: 417-473.
- Sun W, Wang J, Fang Y (2013) Consistent selection of tuning parameters via variable selection stability. Journal of Machine Learning Research 14: 3419- 3440.
- Bach F (2008) Bolasso: model consistent lasso estimation through the bootstrap. Proceedings of 25th International Conference of Machine Learning: 33-40.

Select your language of interest to view the total content in your interested language

- Adomian Decomposition Method
- Algebra
- Algebraic Geometry
- Algorithm
- Analytical Geometry
- Applied Mathematics
- Artificial Intelligence Studies
- Axioms
- Balance Law
- Behaviometrics
- Big Data Analytics
- Big data
- Binary and Non-normal Continuous Data
- Binomial Regression
- Bioinformatics Modeling
- Biometrics
- Biostatistics methods
- Biostatistics: Current Trends
- Clinical Trail
- Cloud Computation
- Combinatorics
- Complex Analysis
- Computational Model
- Computational Sciences
- Computer Science
- Computer-aided design (CAD)
- Convection Diffusion Equations
- Cross-Covariance and Cross-Correlation
- Data Mining Current Research
- Deformations Theory
- Differential Equations
- Differential Transform Method
- Findings on Machine Learning
- Fourier Analysis
- Fuzzy Boundary Value
- Fuzzy Environments
- Fuzzy Quasi-Metric Space
- Genetic Linkage
- Geometry
- Hamilton Mechanics
- Harmonic Analysis
- Homological Algebra
- Homotopical Algebra
- Hypothesis Testing
- Integrated Analysis
- Integration
- Large-scale Survey Data
- Latin Squares
- Lie Algebra
- Lie Superalgebra
- Lie Theory
- Lie Triple Systems
- Loop Algebra
- Mathematical Modeling
- Matrix
- Microarray Studies
- Mixed Initial-boundary Value
- Molecular Modelling
- Multivariate-Normal Model
- Neural Network
- Noether's theorem
- Non rigid Image Registration
- Nonlinear Differential Equations
- Number Theory
- Numerical Solutions
- Operad Theory
- Physical Mathematics
- Quantum Group
- Quantum Mechanics
- Quantum electrodynamics
- Quasi-Group
- Quasilinear Hyperbolic Systems
- Regressions
- Relativity
- Representation theory
- Riemannian Geometry
- Robotics Research
- Robust Method
- Semi Analytical-Solution
- Sensitivity Analysis
- Smooth Complexities
- Soft Computing
- Soft biometrics
- Spatial Gaussian Markov Random Fields
- Statistical Methods
- Studies on Computational Biology
- Super Algebras
- Symmetric Spaces
- Systems Biology
- Theoretical Physics
- Theory of Mathematical Modeling
- Three Dimensional Steady State
- Topologies
- Topology
- mirror symmetry
- vector bundle

- 6th International Conference on
**Biostatistics**and**Bioinformatics**

November 13-14, 2017, Atlanta, USA

- Total views:
**8072** - [From(publication date):

December-2016 - Oct 21, 2017] - Breakdown by view type
- HTML page views :
**8001** - PDF downloads :
**71**

Peer Reviewed Journals

International Conferences 2017-18