Author(s): Schaubel D, Hanley J, Collet JP, Bolvin JF, Sharpe C,
Abstract Share this page
Abstract Preexisting computerized databases are potentially valuable sources of epidemiologic data. Since such databases are infrequently created specifically for etiologic research, data may be available for the exposure of interest and, through record linkage, for the endpoint of interest, but lacking for potential confounders. Because of the size of these databases, two-stage sampling is an efficient alternative to surveying the entire study population for confounder data. At stage 1, information on exposure and disease status is obtained for the entire study population. Confounder data are collected for probability-selected subsamples at stage 2. Logistic regression is performed on the stage 2 samples, with the parameter estimates and variances appropriately corrected to account for the stage 1 data. In this paper, the authors present methods for determining the required stage 2 sample size in the case of categorical exposure and confounding variables. Sample size tables, power curves, and a computer program have been produced to accommodate a binary exposure and a single binary confounder. With the increasing availability of preexisting yet incomplete databases, the potential for use of two-stage sampling will greatly increase in the future. This investigation provides a basis for estimating the number of participants to sample for the collection of confounder data at the second stage.
This article was published in Am J Epidemiol
and referenced in Journal of Biometrics & Biostatistics