^{1}Biostatistics Core, Karmanos Cancer Institute, Department of Oncology, Wayne State University School of Medicine, Detroit, MI 48201, USA
^{2}Departments of Bioinformatics and Biostatistics, School of Public Health and Information Sciences, University of Louisville, Louisville, KY 40202, USA
Received date: December 04, 2015; Accepted date: December 24, 2015; Published date:December 30, 2015
Citation: Kim S, Jang H, Wu D, Abrams J (2015) A Bayesian Nonlinear Mixedeffects Disease Progression Model. J Biom Biostat 6:271. doi:10.4172/21556180.1000271
Copyright: © 2015 Kim S, et al. This is an openaccess article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Biometrics & Biostatistics
A nonlinear mixedeffects approach is developed for disease progression models that incorporate variation in age in a Bayesian framework. We further generalize the probability model for sensitivity to depend on age at diagnosis, time spent in the preclinical state and sojourn time. The developed models are then applied to the Johns Hopkins Lung Project data and the Health Insurance Plan for Greater New York data using Bayesian Markov chain Monte Carlo and are compared with the estimation method that does not consider randomeffects from age. Using the developed models, we obtain not only agespecific individuallevel distributions, but also populationlevel distributions of sensitivity, sojourn time and transition probability.
Cancer screening; Nonlinear mixedeffects models; Sensitivity; Sojourn time
The disease progression model introduced by Zelen and Feinleib [1] assumes that disease progresses through three states: where So corresponds to the diseasefree state, Sp the preclinical disease state when an asymptomatic individual unknowingly has the disease that a screening exam can detect, and Sc the clinical state when the disease manifests itself in clinical symptoms. In addition, if one enters the preclinical state (Sp) at time t_1 and the clinical state (Sc) at time t_{2}, the time difference (t_{2}t_{1}) is called sojourn time. The length of the time (t_{2}t) is called lead time if one is offered a screening exam at time t within the preclinical state (t_{1} ≤ t ≤ _{2}) and the disease is diagnosed.
Screening aims to detect disease in the preclinical state before symptoms appear, which may greatly increase the chances for effective treatment. The disease progression model has been used to analyze the screening data with three key components [2,3]: sensitivity of the screening modality, the sojourn time distribution, and the transition probability density from the diseasefree to the preclinical state. These three parameters are building blocks for screening modeling, because all other parameters of interest can be expressed as a function of these key parameters. In particular, the sensitivity of the screening modality is critical for evaluating the predictive performance of a screening program. Note that the sensitivity is defined as the conditional probability that a screening test is positive when one is in the preclinical state (Sp).
Screening sensitivity may depend on a variety of factors, such as position, location and size of the tumor, experience of the radiologist, age, etc. Wu et al. [2] modeled the sensitivity as a function of age with an agedependent transition probability density and then applied the model to the Health Insurance Plan of Greater New Yorker (HIP), a breast cancer screening study, and later to the Mayo Lung Project data, a lung cancer screening study [3]. However, sensitivity of mammogram increases as a woman’s age increases, while screening sensitivity of chest Xray for lung cancer does not depend on age. In later developments, sensitivity was modeled as a function of the time spent in the preclinical state along with age at diagnosis [4], but sensitivity was influenced by the proportion of time in the preclinical state to the sojourn time more than by age for breast cancer screening. For this reason, Kim and Wu [5] recently modeled sensitivity as a function of the sojourn time and time spent in the preclinical state and then applied the model to the Johns Hopkins Lung Project data. Nevertheless, it still remains unclear which of these models describes sensitivity more appropriately.
Besides, the progress of disease can vary by age. In particular, it is well known that human cancer incidence depends on age, and the risk of being diagnosed with cancer increases with age [6]. Therefore, it is desirable to estimate the results of disease (e.g., cancer) screening for overall summaries as well as for agespecific summaries. To our knowledge, disease progression models have not considered variation in age.
The main objectives of this study were to resolve the two aforementioned issues: (i) finding a proper sensitivity model and (ii) estimating the disease progression models by considering variation in age. To this end, we generalize sensitivity as a function of age at diagnosis, sojourn time, and time spent in the preclinical state and a nonlinear mixedeffects model is developed for disease progression models from a Bayesian framework. We then applied our models to the Johns Hopkins Lung Project (JHLP) data and the Health Insurance Plan for Greater New York (HIP) data. All simulations were run by using the statistical software R (R Development Core Team), and the algorithms described in this study can be obtained upon request from the authors.
The remainder of the paper is organized as follows. In Section 2, we introduce a disease progression model and our generalized sensitivity model. A Bayesian nonlinear mixedeffects model is developed based on a trinomial distribution in Section 3. In Section 4, the developed models are applied to JHLP and HIP data. Concluding remarks are found in Section 5.
Suppose an individual begins the screening exams at the th age group, 1 ≤ i ≤ K where K is the size of age groups, and let represent N_{i} ordered screening exams starting at age, t_{i,0} where D_{i} is the fixed followup time after the last examination. The jth screening interval is , j = 1,2,…Ni at the ith age group. The following notation is used: the jth annual screening exam occurs at age , for j = 1,2,…N_{i}, and , nij is the total number of individuals examined at is the number of cases diagnosed and confirmed at ; and rij is the number of interval cases within the interval .
For each age group, we model sensitivity to vary with three factors, which are screening age t, sojourn time T, and time spent in the preclinical state s, by
(1)
Where t is the average age at entry in the entire study group and s ∈ [0,T] is the time spent in Sp. This is an extension of the sensitivity of Kim and Wu [5] where the sensitivity depends on the sojourn time and the time spent in the preclinical state. Note that sojourn time T is a random variable in this model. Here, in general, the parameters α and γ are responsible for the maximum value and for the rate of the sensitivity, respectively, while the parameter β explains how the behavior of the sensitivity changes with age. Namely, the maximum sensitivity increases as the parameter α increases. When s/Tis close to zero, sensitivity increases rapidly if γ < 1, while sensitivity increases gradually if > 1. Sensitivity is an increasing function of age when the parameter β is positive.
Let D_{ij} be the probability of an individual correctly diagnosed at the jth scheduled exam, given at and started the screening exam at age t_{i,0} (i.e. the ith age group), and I_{ij} the probability of an interval case in . These two probabilities, for j = 1,2,…N_{i}, are:
and
Note that W(x) is the probability density function for transition from S_{o} to Sp at age X and is modeled as a subdensity of a lognormal distribution,
where 20% was selected based on the previous analysis on Lung cancer screening [3]; q(t) is the probability density function (pdf) of the sojourn time in Sp; and is the survivor function of the sojourn time. The loglogistic distribution was used to model the sojourn time [2]:
where κ and are positive scale and location parameters in the loglogistic family. For more detailed descriptions of the loglogistic distribution, refer to Kim and Wu [5].
The disease progression model with the proposed sensitivity model has the parameter For the cohort aged t_{i,0} at study entry, the conditional likelihood function is
(2)
where N_{i} is the number of screenings. As a result, the overall likelihood function for all the study groups is the product of the agespecific contributions across all age groups
(3)
The conventional parameter estimation ignores the agespecific effect since its likelihood function is based on Equation (3). One approach to incorporating the agespecific effect into a model is to use a mixedeffects model. In the next section, a nonlinear mixedeffects model is developed with the agespecific effect as a randomeffect.
The threestage hierarchical nonlinear mixedeffects model is developed for disease progression models from a Bayesian framework. The firststage model has the form of
where D_{i,j} is the probability of an individual correctly diagnosed at the jth scheduled exam given at the ith age group; I_{i,j} is the probability of an interval case in() at the ith age group; n_{ij} is the total number of the individuals examined at at the ith age group; S_{ij}: the number of cases diagnosed at at the ith group; r_{ij}: the number of interval cases within the interval at the ith age group; and θ_{i} is a pdimensional agespecific parameter vector, where, in this project, p=7 and . . Note that Tri stands for a trinomial distribution. The secondstage is structured based on a multivariate normal distribution (MVN) and is given by
where θ is the populationaverage parameter vector, and Σ_{p} is the betweenage covariance matrix. The third stage of the model describes the prior distributions
for given c, C_{p} , , and R_{p}, where W is a Wishart distribution. The posterior distribution for is proportional to
The previous published results in Kim and Wu and Wu et al. were used for selecting of the hyperparameters (i.e., prior information) of the JHLP and HIP data, respectively. That is, we used the following estimates of mean and standard deviations for the hyperparameters (α,β,γ,μ,σ^2,κ,ρ), respectively, (1.676 ± 1.338, 0.085±0.078,0.1293 ± 0.0806,4.3440 ± 0.0008,0.0426 ± 0.0036, 1.6278 ± 0.8242,0.0263 ± 0.0150), for the JHLP data, and (1.676 ± 1.338,0.085 ± 0.078,0.1293±0.0806,4.340 ± 0.076,0.190 ± 0.076,2.509 ± 0.927, 0.886±0.287), for the HIP data. The Gibbs sampler is defined by the following full conditional distributions except for agespecific parameterθ1:
(4)
(5)
where
Because of the nonlinear functions D_{ij} and I_{ij}, the conditional distribution of θ_{1} is nonstandard and known up to a normalizing constant,
(6)
where L_{i}(θ_{1}) denote the conditional loglikelihood for the ith age of Equation (2).
The developed methods are applied to both the Johns Hopkins Lung Project (JHLP) data [5,7] and the Health Insurance Plan for Greater New York (HIP) data [8]. Since no analytical formulas were available for the likelihood function, we used the Markov chain Monte Carlo (MCMC) approach to estimate the posterior distribution for each parameter. In details, let the full conditional posterior distribution of a vector of parameters, θ, be , that is, the posterior distribution of θ given all other quantities in the model. Since our model is nonlinear, it leads to a form for that is nonstandard and is known only up to a normalizing constant. For this reason, we updated the parameters via MetropolisHastingswithinGibbs steps and chose randomwalk Metropolis for updating the θ_{1} parameters, which is a natural choice of MetropolisHastings [9]. For setting of the prior distribution (hyperparameters), we employed the previous results in Kim and Wu [5] and Wu et al. [2] as stated before.
For comparison, we analyzed the JHLP and HIP data using both the conventional and developed approaches. Here the conventional approach means that all the parameters were estimated based on the overall likelihood, which is Equation (3), without considering the variation in different age groups. For convenience, we call the developed algorithm the mixedeffects disease model (MEDM) and the conventional algorithm the fixedeffects disease model (FEDM).
For both MEDM and FEDM, we generated one MCMC chain for each of JHLP and HIP data until the MCMC chain reached at least 20,000 iterations. We then obtained the MCMC chain with the size of 1,000 using the last 10,000 iterations for each chain with burnin of at least 10,000 and a thinning of every 10 steps. Supplementary Information (Figures S1 and S2 in Appendix) are the trace plots of the simulated
Markov chains only for populationlevel parameters. The solid and dotted horizontal lines represent the mean and the median of each parameter, respectively.
The jhlp and hip data
In the Baltimore metropolitan area from 1973 to 1978, the Johns Hopkins Lung Project (JHLP) trials enrolled 10,386 men aged 45 years and older who smoked at least one pack of cigarettes per day (or who had smoked this much within one year of enrollment) and who had no prior history of respiratory cancer. All participants were then randomized to either chest Xray only or a dual screen (chest Xray and sputum cytology) groups, resulting 5,160 men in the chest Xray only arm and 5,226 in the dualscreen arm. Participants in the chest Xray arm received chest Xray screening test annually, for 8 consecutive years. If any of the tests were positive, the screen was considered positive and a definitive workup exam, such as biopsy, was done. In this study, we used the chest Xray arm, including the total number of participants in each screening exam, the number of detected and confirmed cancer cases in each screening exam, and the number of interval cases. These data were stratified by age at study entry from 45 to 88 years old.
The Health Insurance Plan of Greater New York (HIP) study began at the end of 1963 and was the first randomized clinical trial for regular screening exams that include mammography as a screening test for breast cancer [8]. The study enrolled asymptomatic women aged 40 to 64 years who had no history of breast cancer. The participants were randomized into study and control arms, with about 31,000 women in each arm. The screening program for the study arm specified up to four annual breast exams with both a mammogram and a clinical breast exam, while the control arm received usual care. Data from the study arm was used for this study, where the data were stratified by age at study entry from 40 to 64 years old.
Table 1 displays the empirical means, standard deviations, and 95% credible intervals (CIs) of the posterior distributions of parameters. As for MEDM, the fixedeffect estimates (i.e., populationlevel estimates) are considered to compare with the estimates of FEDM. We can see the large absolute difference in log (α) and log (γ) between two estimation methods. However, all 95% CIs of estimates of FEDM overlap with these of populationlevel estimates of MEDM except for the estimate of JHLP and HIP and the estimate ˆñ of HIP (Table 1).
log(α)  β  log(γ)  μ  log(σ^{2})  k  log(ρ)  

JHLP  FEDM  1.10 ± 2.21 (6.50,1.51) 
0.04 ± 0.09 (.15,0.19) 
4.55 ± 5.48 (17.30,1.04) 
4.17 ± 0.32 (3.52,4.50) 
2.71 ± 0.78 (3.74,0.62) 
2.19 ± 0.89 (1.04,4.33) 
3.49 ± 1.39 (5.80,1.51) 
MEDM  0.17 ± 0.14 (0.12,0.44) 
0.02 ± 0.06 (0.10,0.14) 
0.54 ± 0.09 (0.37,0.71) 
5.10 ± 0.16 (4.70,5.33) 
0.19 ± 0.22 (0.78,0.00) 
3.84 ± 0.84 (1.99,4.94) 
3.80 ± 0.30 (4.37,3.14) 

HIP  FEDM  0.92 ± 1.06 (3.28,0.81) 
0.00 ± 0.12 (0.19,0.18) 
1.58 ± 1.32 (4.48,0.59) 
4.22 ± 0.38 (3.52,4.65) 
1.25 ± 0.73 (2.28,0.06) 
1.74 ± 0.69 (1.02,3.56) 
0.94 ± 0.34 (1.62,0.30) 
MEDM  0.18 ± 0.07 (0.03,0.32) 
0.03 ± 0.05 (0.12,0.06) 
0.53 ± 0.12 (0.29,0.77) 
5.21 ± 0.07 (5.04,5.33) 
0.08 ± 0.08 (0.26,0.00) 
4.03 ± 0.73 (2.21,4.96) 
2.82 ± 0.20 (3.17,2.39) 
Table 1: Estimates of fixedeffects and mixedeffects using JHLP and HIP data. The empirical means, standard deviation, and 95% credible intervals of posterior distributions are reported.
Estimates of the variancecovariance matrix Σ of MEDM are shown in Table 2. Since only log (α), β, log (γ), and m are considered as randomeffects, the size of Σ is four by four. For both JHLP and HIP data, there is greater variation in the parameters log (α) and log (γ) than these in other parameters, indicating that sensitivity is influenced by age at diagnosis. Forest plots of each individuallevel estimate of MEDM are plotted in Figure 1. In case of the parameters β and μ, the empirical means of the individuallevel estimates are very close to that of the populationlevel estimate for both JHLP and HIP data. On the other hand, we can see a larger variation of the individuallevel estimates of log (α) and log (γ). These imply that the parameters α and β have a large influence on age, so does sensitivity. The individuallevel estimates of each age at diagnosis can be found in Supplementary Information (Tables S1 and S2 In appendix).
JHLP  

log(α)  β  log(γ)  μ  
log(α)  204.28 ± 83.92 (91.46,404.52) 
0.04 ± 0.56 (1.21,1.05) 
90.80 ± 47.89 (21.52,206.16) 
0.11 ± 0.90 (1.65,1.98) 
β  0.01 ± 0.00 (0.00,0.02) 
0.01 ± 0.38 (0.82,0.73) 
0.00 ± 0.01 (0.01,0.01) 

log(γ)  113.40 ± 42.72 (55.80,217.48) 
0.04 ± 0.56 (1.05,1.18) 

μ  0.03 ± 0.02 (0.01,0.07) 

HIP  
log(α)  β  log(γ)  μ  
log(α)  302.53 ± 120.48 (139.40,572.52) 
0.24 ± 0.62 (1.51,1.13) 
284.07 ± 106.85 (134.47,528.61) 
0.01 ± 0.61 (1.17,1.25) 
β  0.01 ± 0.00 (0.00,0.02) 
0.34 ± 0.83 (1.89,1.48) 
0.00 ± 0.00 (0.01,0.01) 

log(γ)  407.42 ± 123.58 (229.03,709.96) 
0.01 ± 0.75 (1.48,1.63) 

μ  0.01 ± 0.01 (0.01,0.03) 
Table 2: Estimates of variancecovariance matrices of MEDM using JHLP and HIP data. The empirical means, standard deviation, and 95% credible intervals of posterior distributions are reported.
Figure 1: The forest plots of individuallevel estimates of MEDM. (a) JHLP and (b) HIP. The red dotted lines indicate the means of the posterior populationlevel estimates of log (α), β, log (γ), and m. The solid dots and the solid lines represent empirical means and 95% credible intervals of the posterior distributions of each estimate.
The developed sensitivity models depend on age at diagnosis, the time spent in the preclinical state and the sojourn time, resulting in a function of age and the proportion of time spent in the preclinical state to the sojourn time. Note that the average age in Equation (1) is globally set to 55 years for all age groups in both JHLP and HIP data.
Figure 2 shows the posterior sensitivities estimated by FEDM and MEDM on JHLP and HIP data. The populationlevel estimates ˆã of FEDM are less than one (i.e., log(γˆ ) < 0 ), while these of MEDM are greater than one (i.e., log(γˆ ) > 0 ), for both JHLP and HIP data (Table 1). As a result, the populationlevel posterior sensitivities estimated by FEDM are larger than these by MEDM when s/T is near zero, based on the subplots between η and s/T in Figure 2. Interestingly, many individuallevel posterior estimates of γ are less than one (i.e., log(γˆ ) < 0 ), although the populationlevel estimates are larger than one (i.e., log(γˆ ) > 0 ), as can be seen in Figure 1 and Supplementary Information Tables 1 and 2. On the other hand, the populationlevel sensitivities of JHLP are an increasing function in age regardless of estimation methods, while these of HIP are a decreasing function in age in MEDM , based on the subplots between age and s/T in Figure 2. This is because the posterior estimates of JHLP and HIP are positive and negative, respectively. In general, both JHLP and HIP data show large differences in sensitivity between FEDM and MEDM.
The individuallevel posterior sensitivities are shown in Supplementary Information (Figures S3 and S4 in appendix). In particular, these predicted sensitivities show significant variations among age groups, which might be resulted from the large variations in parameters log (α) and log (γ) in Table 2.
Figure 3 shows the posterior transition probability estimated by FEDM and MEDM. The estimates of MEDM for both JHLP and HIP are larger than these of FEDM, resulting that the modes of FEDM are a little smaller than these of MEDM (61 vs. 72 years and 51 vs. 73 years, respectively, for JHLP and HIP). The individuallevel variation of the transition probability can be seen in Supplementary Information (Figure S5 in appendix). The variation in age is larger in JHLP data than in HIP data.
The posterior sojourn time distributions are depicted in Figure 4. As expected by that the 95% CIs of the estimates of HIP are not overlapped, the sojourn time distributions of HIP are very different between FEDM and MEDM of HIP. The modes of sojourn time in HIP are 1.01 and 15.15 years for FEDM and MEDM, respectively, while these of JHLP are 21.21 and 38.38 years for FEDM and MEDM, respectively (Figure 4).
We propose a generalized sensitivity model which depends on age at diagnosis, time spent in the preclinical state and sojourn time, and the developed sensitivity model is applied to a novel nonlinear mixedeffects model for a disease progression model in a Bayesian framework.
As for JHLP data, FEDM along with the developed sensitivity model estimates the parameters using the same data as used in Kim and Wu [5]. The main difference is the sensitivity model, and their sensitivity model is a special case of our model. That is, when the parameter b is equal to zero in Equation (1), our model becomes same as their model. Generally, our results are consistent with these of Kim and Wu [5]. However, our estimate of γ is smaller than their estimate, 0.01 vs. 0.13, respectively (Table 3 in Appendix), resulting that our sensitivity increases much faster than theirs. This might be due to ageeffect in the sensitivity.
When the populationlevel sensitivity of JHLP data is compared with that of HIP data, we can see a different trend from each other in the sense that the sensitivity increases as men get older in lung cancer, while the sensitivity decreases as women get older. In other words, the probability to detect lung cancer is higher in old than in young, and the breast cancer might be detected more in young than in old by a screening exam. This can be explained by the fact that lung cancer does not have a reliable early detection test compared with other cancers [10].
The main advantage of MEDM over FEDM is to incorporate variation in age into the disease progression model, resulting in estimates with better precision. By doing so, we can obtain not only populationlevel estimates but also individuallevel estimates. Accurate estimates are critical for policy makers to predict the performance of a screening exam. In this regard, the proposed MEDM along with a generalized sensitivity model might provide a more accurate assessment of screening for policy makers [11].
We appreciate the helpful comments of the anonymous reviewer and the editor for helpful suggestions. This work is partially supported by NSF grant DMS 1312603. The Biostatistics Core is supported, in part, by NIH Center grant P30 CA022453 to the Karmanos Cancer Institute at Wayne State University.
The authors have declared no conflict of interest.