A New Stochastic Model of Retinoblastoma Involving both Hereditary and Non-hereditary Cancer Cases

Background and purpose: Retinoblastoma is initiated by the mutation or loss or inactivation of the retinoblastoma gene (the Rb gene) in chromosome 13q14. Further, both the germline cells (eggs and sperm) generating the individuals may carry a mutant allele of the Rb gene so that individuals may carry both mutants in the Rb locus at the embryo stage in which case cancer tumor may develop at or before birth. Recent molecular studies have also shown that besides the abrogation of cell dierentiation by the inactivation of the Rb locus, the apoptosis mechanism needs also to be inhibited or abrogated for the generation of retinoblastoma tumor. The purpose of this paper is to develop new stochastic and statistical models for retinoblastoma to incorporate these biological findings.


Introduction
For the human pediatric eye cancer-retinoblastoma, Knudson [1] discovered that the cancer was initiated by the mutation or loss or inactivation of the retinoblastoma gene (Rb gene) in chromosome 13q14. This discovery was further documented by cytogenetic studies by Cavenee et al. [2] (see [3], Chapter 3). For retinoblastoma, Knudson [1] has thus proposed a two-stage model for the development of cancer tumors. According to this model, the person is in the first stage (I 1 stage) if one copy of the Rb gene in an eye stem cell in this person has lost or mutated or inactivated. The person is in the second stage (I 2 stage) if both copies of the Rb gene in an eye stem cell have been mutated or lost or inactivated. According to this model, cancer tumors are derived from primary I 2 cells, where a primary I 2 cell is an I 2 cell generated by an I 1 cells through mutation or deletion or inactivation of the other Rb allele. A specific feature of this model is that both the germ line cells (eggs and sperms) and somatic cells can carry a mutant of the Rb gene leading to inherited cancer cases. The above two stage model for retinoblastoma was taken for granted until the early 1990's when it was discovered that default loss of Rb was not tumorigenesis unless death by apoptosis was also inhibited [4][5][6][7][8][9][10][11]. Based on results of molecular biology, DiCiomme et al. [6] have thus proposed a threestage biological model for carcinogenesis of retinoblastoma with the first two stages being associated with the Rb gene and with the third stage involving abrogation of apoptosis and /or cell cycle control by some genetic and/or epigenetic changes [12,13]. Let I 3 denote the third stage, where an I 3 cell is an I 2 cell with additional genetic and/or epigenetic changes to abrogate or inhibit apoptosis and/or cell cycle control. Then the 3-stage biological model proposed by DiCommo et al. [6] assumes that retinoblastoma develop by the pathway N → I 1 → I 2 → I 3 → Tumor. Despite these biological studies, however, stochastic mathematical models for retinoblastoma based on these biological studies had never been derived, nor had these models been tested against cancer incidence data such as those from SEER from NCI/NIH; for more detail, see Section 4.
Given the above background of retinoblastoma, the objective of this paper is to develop a biologically supported stochastic mathematical model of carcinogenesis for the development of retinoblastoma tumors. Because for retinoblastoma, a large number of babies develop cancer tumors at birth (0) (see data in Table 1), we will also proceed to develop a biologically supported statistical model to incorporate genetic segregation of the mutant allele of the Rb locus in the population.
A biologically supported stochastic model of retinoblastoma incorporating inherited cancer cases A modified MVK model for retinoblastoma with discretetime: Consider a discrete-time model with one time unit corresponding to six months or longer (See Section 4 and Remark s 1 and 4). Let N denote normal stem cell and T cancer tumor. Then the MVK model with discrete-time is characterized by the following postulates: (a) N (I 0 ) → I 1 → I 2 → Tumor by genetic changes and/or epigenetic changes.
(b) The I i cells are subjected to stochastic proliferation (birth) and differentiation (death).
(c) With probability one each I 2 cell generated at time t from an I 1 cell would develop by clonal expansion (i.e., stochastic birthdeath process, [14]) into a detectable cancer tumor at time t + 1.

(d) All cells develop independently of other cells.
Because this model is not exactly the same as the MVK model proposed by Moolgavkar and Venzon (see [3], Chapter 3; [15]), in what follows we will refer this model as the modified MVK two-stage model. In Section 4 it will be shown that this model fits cancer incidence data (SEER data from NCI/NIH) extremely well and much better than a three-stage model.
The following biological observations also suggest that this discretetime MVK model is consistent with the observations of the biological model by Diciommo et al. [6]: (a) Apoptosis is activated only if the number of I 2 stem cells has grown into a cluster of large number of I 2 cells (10 6 ~ 10 8 or larger) in which case the probability of some genetic and/ or epigenetic changes to induce inhibition or abrogation of apoptosis is greatly enhanced. (The rate of genetic and/or epigenetic changes to induce inhibition of apoptosis is about 10 −3 ~ 10 −4 ; see Section (4.3)).
(b) Apoptosis abrogation often occurs in the last stage in the multistage model of carcinogenesis [16,17] by that time there are already a large number of I 2 cells.
(c) Cancer tumors are heterogeneous populations of cells with tumor stem cells being a very small minority [18]. These biological findings imply that once an I 2 cell is generated from an I 1 cell, with probability close to one a genetic change or epigenetic change will quickly and almost inevitably develop, so that, as a close approximation , the third mutation can be ignored mathematically. These results are consistent with a discrete-time two-stage model with the assumption that with probability one a primary I 2 at time t would develop into a detectable tumor at time t + 1.

Remark 1:
To develop stochastic models of carcinogenesis, in the literature [3,19,20] it is conveniently assumed that the last stage cells (i.e. I k cells in a k-stage model) develop instantaneously into cancer tumors as soon as they are generated. In this case, one may identity I k cells as cancer tumors so that T (t) is Markov. When k=2, this is the MVK two-stage model (see [3], Chapter 3; [15]). However, as shown by Yang and Chen [14], Yakovlev and Tsodikov [21], Klebanov et al. [22] and Fakir et al. [23] , in many cases T(t) is not Markov. Nevertheless, if one assumes a discrete time model with one time unit to correspond to six months or one year, then because the growth of I k cells is very rapid, with probability close to one an I k cell generated at time t will develop into a detectable tumor by time t + 1 [16,17] in these cases, one may practically assume T(t) as Markov.

A stochastic model of retinoblastoma involving inherited cancer cases
As first documented by Knudson [1], both the germ line cells (eggs and sperms) and somatic cells may carry the mutant Rb allele r. If both germ line cells (egg and sperm) generating the individual carry the mutant Rb allele r, then this individual is at the I 2 stage at the embryo stage (fertilized egg stage) so that for this individual cancer tumors develop at and /or before birth; see Remark 2. As shown in Table 1, this is clearly the case since the retinoblastoma incidence from the SEER data of NCI/NIH gives the highest cancer rate at birth; see Remark 3 in Section 3 for SEER data.
To account for inherited cancer cases in the stochastic model of retinoblastoma, let p be the frequency of the r gene in the population so that q = 1 − p is the frequency of the R allele (Normal allele of the Rb locus) in the population. Assume that the population is very large and that mating (marriage) between people is random with respect to the retinoblastoma locus. For each individual, let the embryo stage denote the time of the fertilized egg in his/her mother's womb from which this individual is developed. Then, by the Hardy-Weinberg law [24,25] the frequency of individuals with genotypes RR, Rr and rr at the embryo stage in the population are given by q 2 , 2pq and p 2 respectively.
To develop stochastic models to incorporate inherited cancer cases, observe that during pregnancy the proliferation rates of all stem cells are very high; hence, as a close approximation one may practically assume that with probability one individuals with genotype rr at the embryo stage would develop detectable cancer tumors at or before birth. Similarly for individuals with genotype Rr at the embryo stage, the R allele in some stem cells may be mutated or lost or inactivated during pregnancy in which case with positive probability these individuals would carry stem cells with genotype rr at or before birth to develop cancer tumor at birth. Because spontaneous mutation of genes in normal individuals is very low (10 -6 ~ 10 -8 ) during pregnancy [17,19,20], one would expect that normal people at the embryo stage would remain to be normal people at birth. Hence, practically one may assume that individuals with genotype RR at the embryo stage would remain to have genotype RR at birth.

Remark 2:
Without exception, every human being develops from the embryo in his/her mother's womb, when stem cells of different organs divide and differentiate to develop different organs respectively (See Weinberg [26], Chapter 10). To develop stochastic models of retinoblastoma with genetic component, we thus let time at the embryo stage to be the starting time for carcinogenesis.

The probability distributions for developing detectable cancer tumors
For retinoblastoma, the development of cancer tumors of individuals depend on his/her genotype at the embryo stage in his/her mother's womb; see Figure 1.
Embryo stage: Given that the individual has genotype rr at the embryo stage (an I 2 person), then with probability one this individual would develop cancer tumors before or at birth. On the other hand, if the individual has genotype Rr at the embryo stage, then with probability α, the R allele in some stem cells of this individual may At Birth: RR Rr r r Tumor 1-be mutated or lost to generate I 2 stem cells in this individual during pregnancy to develop cancer at birth. Thus for each individual with genotype Rr at the embryo stage, with probability α this individual would develop detectable cancer at birth, and with probability 1 − α this individual would remain an I 1 stage individual at birth in which case one more stage is required to develop retinoblastoma after birth (I 1 → I 2 → Tumor). If the individual has genotype RR (normal people) at the embryo stage, then one may practically assume that with probability one this individual remains a normal people at birth so that after birth, retinoblastoma are developed by a two-stage model N → I 1 → I 2 → Tumor. This is described in Figure 1.
Let Q Rr (j) (Q RR (j)) denote the probability that an individual with genotype Rr (RR) at birth develops cancer during the j−th time period (t j −1, t j ] (j > 1) after birth. For individuals with genotype RR at the embryo stage, the probability that this individual would develop cancer during the j−th time period after birth is Q RR (j). For individuals with genotype Rr at the embryo stage, the probability that these individuals would develop cancer during the j−th period after birth is (1 − α)Q Rr (j).
To derive Q RR (j) and Q Rr (j), let β 1 be the transition rate per cell division of I 1 → I 2 and β 0 the transition rate per cell division of N → Then, by using similar methods given in Tan [16], Tan et al. [17], Tan and Chen [27], Tan et al. [28][29][30], it can be shown that From equations (1)-(2), observe that approximately the probability of developing cancer during the j-th time period after birth in each individual are functions of R i (t j )'s which in turn are functions of the expected number of I 1 stem cells in the individual during the j−th time period. Let N 0 be the number of normal stem cells at birth and denote by b 1 the birth rate per cell division (proliferation rate) of I 1 cells and d 1 the death rate per cell division (differentiation rate) of I 1 cells. Using stochastic difference equations as described in Tan et al. [17], Tan and Chen [27], Tan et al. [28,29] and Tan et al. [31], it can easily be shown that the expected number of I 1 cells at time t from normal individuals at birth and from individuals who are I 1 people at birth are given respectively by: In the above equation, observe that γ 1 is the proliferation rate per cell division of I 1 cells so that the number of I 1 cells will increase if γ 1 > 0 and decrease if γ 1 < 0. Using the basic result

A statistical model and the probability distribution of the number of detectable tumors
The data available for modeling carcinogenesis are usually cancer incidence over different time periods. For example, the SEER data (see Remark 3) of NCI/NIH for retinoblastoma are given by {(y 0 , n 0 ), (y j , n j ), j = 1, . . . , n}, where y 0 is the number of cancer cases at birth and n 0 the total number of birth and where for j ≥ 1, y j is the number of cancer cases during the j-th age group of a one year period and n j is the number of normal people who are at risk for cancer and from whom y j of them have developed cancer during the j-th age group. Given in Table 1 are the SEER data for retinoblastoma cases during the period 1973-2007. From this data set, notice that there are a large number of cancer cases at birth implying a large number of inherited cancer cases. In this section, based on the models in Sections (2)-(3) we will develop a statistical model for these data sets. Note: (1) The observed Incidence rates per 10 6 individuals from SEER are zero after 10 years old. Hence we fit only data up to 10 years old.
(2) The data in Table 1 are yearly data with 0 denoting birth and j the jth years old, j=1,… 10.

Remark 3:
The SEER data are data compiled by the Surveillance and End Results (SEER) Program of the National Cancer Institute/ NIH, a premier source for cancer statistics in the United States. This program collects information on incidence, prevalence and survival from specific geographic areas representing 28 percent of the US population and compile reports on all of these plus cancer mortality for the entire country; for more retail about SEER, the readers are referred to the web of Google Search.
The probability distribution of observed cancer incidence incorporating inherited cancer cases: To incorporate inherited cancer cases, among the n j people at risk for retinoblastoma, let n 1j be the number of individuals who have genotype RR at the embryo stage, n 2j the number of individuals who have genotype Rr at the embryo stage, and n 3j = n j − n 1j − n 2j the number of individuals who have genotype rr at the embryo stage. Then, from results in Section (2.2) and by the Hardy-Weinberg law, the probability that a random individual from the population has genotype RR, Rr and rr at the Rb locus is q 2 , 2pq and p 2 respectively. Hence, from basic probability theory [32], the conditional probability distribution of (n 2j , n 3j ) given n j is multinomial with parameters {n j ; 2pq, p 2 }. That is, (n 2j , n 3j )|n j ~ M L{n j ; 2pq, p 2 }. It follows [32] that n 3j |n j ~ Binomial{n j , p 2 }.
The probability distribution of y 0 : Because y 0 is the number of cancer cases at birth, y 0 derives either from individuals who have genotype rr at the embryo stage or from individuals who have genotype Rr at the embryo stage. Because with probability one an individual with genotype rr at the embryo stage would develop cancer at or before birth, all n 30 individuals with genotype rr at the embryo stage would develop retinoblastoma at birth; on the other hand, with probability α (0 < α < 1), each individual with genotype Rr at the embryo stage would develop cancer at birth. Hence, from basic probability theory [32], y 0 = n 30 + z 0 , where z 0 |n 20 ~ Binomial{n 20 , α}. Furthermore, as shown in Section (3.1), (n 20 , n 30 )|n 0 ~ Multinomial{n 0 ; 2pq, p 2 }. Hence, we have, because n 0 is very large and p is very small: The probability distribution of y j (j ≥ 1) : To derive the probability distribution of y j (j ≥ 1) in the j−th age group after birth, observe that y j can only be developed from individuals who have genotype Rr or RR at the embryo stage.
Among the y j cancer cases, let y 1j be the number of cancer cases generated by people with genotype RR at the embryo stage, and y 2j the number of cancer cases generated by people with genotype Rr at the embryo stage. Then y j = y 1j + y 2j ; y 1j are developed from n 1j people only and y 2j are developed from n 2j people only.
Because cancer develops in each individual independently of other individuals and because individuals with the same genotype are expected to yield the same phenotype , to derive the conditional probability distribution of (y 1j , y 2j = y j − y 1j ) given (n ij , i = 1, 2, n j ), one may practically assume that each individual develops cancer independently of other people and that all individuals with the same genotype at the embryo stage develop cancer by the same mechanism. Then, because the probabilities {P 1 (j) = Q RR (j), P 2 (j) =(1 − α)Q Rr (j)} are very small, the conditional probability distribution of { y 1j , y 2j = y j − y 1j } given {n 1j , n 2j , n j } is { } where h (y; λ) is the density at y of the Poisson distribution with mean λ.
Put Q T (j) = n 1j P 1 (j) + n 2j P 2 (j). From Equation (4), the conditional distribution of y j given {n ij , i = 1, 2, n j } is Poisson with mean Q T (j). It follows that the probability distribution of y j given n j is The probability distribution P (y j |n j ) given by equation (5) is a mixture of Poisson distributions with mixing probability distribution given by the multinomial distribution of {n 1j , n 2j } given n j . This mixing probability distribution represents individuals with different genotypes at the embryo stage in the population.
Let Θ be the set of all unknown parameters (i.e. the parameters (p, α) and the birth rates, the death rates and the mutation rates of N and I 1 cells). Based on data ( , 0,1, .,

The fitting of the model and applications
To illustrate the application of the models in Sections 2-3, in this section we apply the model to some of the NCI/NIH SEER data of retinoblastoma to derive some important information about retinoblastoma. Because the biological findings in references [4]~ [13] suggest a three stage model, we will also fit a three stage model which assumes that retinoblastoma develop by the pathway N → I 1 → I 2 → I 3 → Tumor. For fitting this model to the SEER data, we give in the Appendix the probability distributions of cancer incidence data and some basic formula for implementing the fitting of SEER. Because this model is a special case of the general k-stage model described and analyzed by Tan [16] and Tan et al. [17], for more detail and mathematical theories, we refer the readers to Tan [16] and Tan et al. [17].

Methods for fitting data
Because it is well documented that the Bayesian inference procedures are the most efficient procedures [33,34], we will use the Bayesian approach via the data augmentation and Gibbs sampling procedures to estimate the unknown parameters and to derive predicted cancer cases. For the model of this paper, the basic approach are summarized in five steps: (a) Expand the model and data (n j , y j , j = 0, 1, . . . , m) to include the un-observable variables {n 20 , n 30 , n 1j , n 2j , y 1j , y 2j , j = 1, . . . , m} and derive the joint probability distribution of all random variables. This probability distribution is given in Equation (6) below.
(d) Derive the general conditional posterior distribution of all parameters given {y j , y 1j , n ij , i =1, 2, j = 1, . . . , m}. Notice that this posterior distribution is proportional to the product of the prior distribution P(Θ) and the joint distribution of all the variables given by Equation (6). (The prior distribution of the parameters is given in the next Section (4.2).) (e) Apply the Gibbs sampling procedure to estimate the unknown parameters and the state variables. The details of these steps and the Gibbs sampling procedure and its applications to cancer modeling have been illustrated in great detail in [17,24,28].
Notice that the above distribution is a product of multinomial distributions and Poisson distributions. The above joint density will be used as the kernel for estimating the unknown parameters and for predicting the state variables.

Fitting of the model by cancer incidence data
To fit the SEER retinoblastoma cancer data, we let one time unit correspond to 6 months after birth and let the time at birth be 0; see Remark 4. For the prior distributions of Θ, because biological information have suggested some lower bounds and upper bounds for the mutation rates and for the proliferation rates, we assume P(Θ) α c, (c > 0) where c is a positive constant if these parameters satisfy some biologically specified constraints, and equal to zero for otherwise. These biological constraints are: (1) 0 < α < 1 as α denotes a probability measure, and because the estimated frequency of most mutant genes in human population are from 10 -3 ~ 10 -5 [25], we let 0 < p < 10 −2 ; (2) Because γ 1 = b -d is the proliferation rate per cell division of cells with genotype Rr at the Rb locus and since Rb is a tumor suppressor gene, we let −0.01 < γ 1 < 1; see also estimates from Tan [16], Tan et al. [17], Tan and Chen [27], Tan et al.( [28][29][30]) and Luebeck and Moolgavkar [35].
We will refer the above prior as a partially informative prior which may be considered as an extension of the traditional non-informative prior given in Box and Tiao [37].

Remark 4:
In our model and the fitting of data we have assumed 6 month for one time unit since for most of human cancers, the last stage cells grow very fast and 6 months is a long time interval for the last stage cell to develop into a detectable tumor [3,16,17,[27][28][29][30]. Luebeck and Moolgavkar [35] had used one year as the time unit for fitting human colon cancer. In our computation, we have tried 3 months, 6 months and one year as time unit and found that the 6 month period is the best time for fitting retinoblastoma.
Using this prior distribution and applying the method in Section (4.1) to the models in Sections (2)-(3) and the SEER data in Table 1, we have estimated the unknown parameters and the predicted cancer cases. Given in Table 2 are the estimated parameter values and its standard errors. Given in Figure 2 are the plots of predicted cancer cases from the modified MVK two-stage model of Section (2) and from a threestage model respectively. (For a multi-stage model of carcinogenesis and its fitting to the data [16,17]. For comparison purposes, we have provided numbers of predicted cancer cases from the modified MVK two-stage model and from a three-stage model together with the observed cancer cases over time from SEER. We have also examined SEER data from 1973-2005 and SEER data from 1973-2006 and found that the observed cancer incidence per 10 6 are identical to those from 1973-2007. This indicates that the epidemic of retinoblastoma cancer has reached a steady state condition in US population. Note: 1 1 1 1 Table 2, notice that the standard errors for the estimates of λ 1 and λ 2 are quite large. These results imply wider confidence intervals and higher uncertainty in the estimation of these two parametric functions. These results are expected because λ 1 is a function of 4 parameters each of which are subjected to uncertainty in estimation and because there is considerable uncertainty in the estimation of u 2 (t 0 ).

Fitting results
From results in Tables 1 and 2 and from Figure 2, we have made the following observations: (a) As shown by results in Table 1 and Figure 2, it appeared that the modified MVK two-stage model fitted the SEER data extremely well; on the other hand, the three-stage model can not fit the SEER data; see Table 1 and (b) From Table 2, it is observed that the estimate of γ 1 is quite small (the estimate is of order 10 −3 ~10 −4 ) indicating that the phenotype of Rr is quite close to that of RR; this is consistent with the biological hypothesis that the Rb gene is a tumor suppressor gene.
(c) From Table 2 Table 2, the estimate of p from the SEER data is of order 10 −3 indicating that in the US population, the frequency of the recessive allele r of the Rb gene is approximately of order 10 −3 . Table 2 also showed that the estimate of α was 0.83, indicating that most individuals with genotype Rr would develop cancer at birth. This is consistent with the observation that there are high observed cancer incidences at birth for retinoblastoma in the SEER data.

Discussion and Conclusions
In this paper, by assuming discrete time we have proposed a modified MVK two-stage model to model cancer progression and tumor development of retinoblastoma. We found that the biological findings (a)-(d) given in Section (2.1) imply that this model is consistent with the observation and results of molecular biology studies by DiCiommo et al. [6], Laurie et al. [7], Gallie et al. [12] and Corson and Gallie [13]. To account for inherited cancer cases in the stochastic model of retinoblastoma, we have also developed a generalized mixture model for retinoblastoma in human beings. In this mixture model, the mixing probability is a multinomial distribution to account for the distribution of the three genotypes (rr, Rr, RR) of the Rb locus in chromosome 13q14 among individuals in the population. This mixture model allows us to estimate for the first time the frequency p of the mutant allele of the Rb gene in the US population.
To illustrate the usefulness and applications of our models and methods, we have applied our models and methods to the retinoblastoma SEER data of NCI/NIH. Our analysis clearly showed that the proposed modified MVK two-stage model fitted the data almost perfectly (see Table 2 and Figure 2); on the other hand, the stochastic three-stage model fitted the data poorly (see Table 2 and Figure 2) even though in the three stage model there are four more parameters (see Appendix). Notice, however, our modified MVK two-stage model is more general than the classical MVK two-stage model as described in Tan [3] in that we postulate that cancer tumors develop from primary I 2 cells by clonal expansion [14]. (The stochastic multi-stage models in the literature assume that cancer tumors develop from last stage cells immediately as soon as they are generated, ignoring completely cancer progression [23].
Applying our models and methods to the SEER data of retinoblastoma, we have derived for the first time some useful information on the epidemic of retinoblastoma in the population. Specifically, we mention: (1) For the first time, we have estimated the frequency of the mutant allele of the Rb gene in the US population (p~ 5.81 × 10 −3 ).
(2) The estimate of the proliferation rate (γ 1 ) of I 1 cells (i.e. cells with genotype Rr) is 1 γ = 2.4954 × 10 −4 ~ 0. This is consistent with the biological hypotheses that the Rb gene is a tumor suppressor gene, and unlike the p53 gene in chromosome 17p [38], there is little or no haploid-insufficiency for the Rb gene in cells with genotype Rr.
Using models and methods of this paper, one can easily predict future cancer cases for retinoblastoma in the population. Thus, by comparing results from different populations, our models and methods can be used to assess cancer prevention and control procedures. This will be our future research topics; we will not go any further here.