Abilities of Statistical Models to Identify Subjects with Ghost Prognosis Factors
|Nguyen JM1,2,3*, Gaultier A1 and Antonioli D3|
|1SEB, CHU NANTES, 85, Rue Saint Jacques 44093 Nantes Cedex 01, France|
|2INSERM, UMR892, 8 quai Moncousu - BP 70721, 44007 Nantes Cedex 01, France|
|3HWRS, Atlanpôle, Route de Gachet, 44300 Nantes, France|
|Corresponding Author :||Nguyen JM
SEB, CHU NANTES, 85
Rue Saint Jacques 44093 Nantes Cedex 01, France
E-mail: [email protected]
|Received: October 22, 2015 Accepted: November 04, 2015 Published: November 06, 2015|
|Citation: Nguyen JM, Gaultier A, Antonioli D (2015) Abilities of Statistical Models to Identify Subjects with Ghost Prognosis Factors. J Health Edu Res Dev 3:141. doi:10.4172/2380-5439.1000141|
|Copyright: © 2015 Nguyen JM, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.|
|Related article at Pubmed, Scholar Google|
Many tools are available to estimate prediction quality, but none are available to assess the ability, of a predictive model to identify completely missing or unknown prognostic factors, designated as ghost factors (GFs). However, it may be possible to predict whether a subject carries a GF.
To simulate the presence of a GF, a significant prognostic factor and all variables correlated with it were removed prior to model analysis. Public datasets and simulated data were used. A predictive statistical model was developed to assess the relationship between the presence of a GF and the predictive capacity of a given model based on the correlation between predicted outcome and GF presence. Five statistical models were compared using this procedure.
After evaluating 6 real databases, the only statistical method consistently able to identify subjects with GFs was the use of optimized regression models. Using simulated, linearly correlated data, optimized regression models exhibited up to a 92% success rate, whereas conventional linear models had less than 53% success. Random forest and classification tree models had the highest success rates compared to the other evaluated models.
Model-based outcome prediction was assessed with respect to the presence of GFs. As GFs are unknown, only subjects who are carriers of significant unknown prognostic factors can be identified. As complex models outperformed linear models in identifying GF presence, we assume that the associations between GFs and outcome-predictive factors are also complex and not linear.