Figure 4: Parameter estimates with varying n for all selected variables. Each of the four panels was obtained with a specific sample size with n={100; 200; 400; 1000}. Whatever the panel, the following distributions were plotted: 1-distribution of the estimates for the ΩR sets over 200 identification sets parameters (histogram with horizontal hatching). 2-distribution of the estimates of the ΩR sets over 200*50 validation datasets (histogram with diagonal hatching). The vertical dotted line indicates the mean of the latter distribution. The vertical continuous line indicates 0.2.
The R selected variables are a mixture of False Positives (V) and True Positives (S) that respectively corresponds to the left and the right mode of the distributions of ΩR estimates. When increasing the sample size, S increases at the cost of a decrease of V, thus modifying the shape of the distribution of the estimates for the ΩR variables. When n=1000, there are fewer FP and the estimates for TP are no longer over-estimated. This shows the consequence of the regression to the mean phenomenon in terms of confirmation of the selected candidate biomarkers.