Received date: November 05, 2013; Accepted date: January 13, 2014; Published date: January 20, 2014
Citation: Montesinos-López OA, Montesinos-López A, Eskridge K, Crossa J (2014) Estimating a Proportion Based on Group Testing for Correlated Binary Response. J Biomet Biostat 5:185. doi:10.4172/2155-6180.1000185
Copyright: © 2014 Montesinos-López OA, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are are credited.
Visit for more related articles at Journal of Biometrics & Biostatistics
When the sampling scheme is in clusters and when the pools (of size k) within a cluster are assumed not to be independent, the Dorfman model for estimating the proportion under the binomial model is incorrect. The purpose of this paper is to propose a method for analyzing correlated binary data under the group testing framework. First, assuming that the probability of an individual varies according to a beta distribution, we derived an analytic expression for the probability of a positive pool and the correlation between two pools in each cluster. Second, we derived the exact probability mass function of the number of positive pools in each cluster that should be used to obtain the maximum likelihood estimate (MLE) of the proportion of individuals with a positive outcome. However, this MLE is not efficient in terms of computational resources. For this reason, we proposed another estimator based on the beta-binomial model for obtaining the approximate MLE of the proportion of interest. Based on a simulation study, the approximate estimator produced results that are very close to the exact MLE of the proportion of interest, with the advantage that this approach is computationally more efficient.
Pools; Correlated data; Clusters; Group testing; Betabinomial model
The group testing model of Dorfman  is effective for reducing the number of diagnostic tests because instead of performing n individual diagnostic tests, it only requires when retesting is not done (where k is the pool size). However, caution needs to be exercised when choosing the pool size (k), because if k is too large, the diagnostic test may be sensitive to dilution effects [2,3]. Assuming perfect testing, a pool is declared positive if at least one of the k individuals is positive, and declared free of the disease if the test is negative.
The assumption of a homogeneous distribution of transgenic maize (Zea mays L.) in a population, though easy to use in practice, is unrealistic  and therefore may affect the quality of the estimated proportion of interest. Since plant samples are taken at different locations throughout a geographical region or seed samples are taken from seed lots obtained from different regions, this means that individual plants or seed lots are inherently clustered by design and share common characteristics . This clustering results in correlated samples. Therefore, it is important to develop methods for analyzing pooled data when individuals are correlated and do not require the assumption of homogeneous plant distribution, as in a binomial distribution.
When there is overdispersion (extra-binomial variation), binary data often show greater variability than predicted by the binomial model . Overdispersion is said to be the norm in practice, and nominal dispersion, the exception. Hung and Swallow  studied the robustness of group testing in estimation problems when the underlying assumption of independent individuals is violated. They found that when defectives are clustered, as in a serial correlation model with positive serial correlation, even using a small group size offers little robustness. Group testing to estimate the proportion of defectives in a serially correlated population should be done cautiously. The recommendation is not to form groups directly from the ordered population, but to randomly assign the individuals to groups and destroy the correlation. However, group testing for classification purposes only (whether defective or non-defective) benefits from having the defectives clustered, and the clustering should be preserved and exploited .
Liu et al.  provide confidence interval procedures for estimating proportions estimated by group testing with groups of unequal size adjusted for overdispersion (extra-binomial variation). They used a quasi-likelihood approach to correct for the presence of overdispersion. However, in this case, heterogeneity in pool responses is induced by using different pool sizes (k) and may be due to the number of pools per cluster used in the group testing method. In their study, Liu et al.  introduced heterogeneity by assuming three clusters (m=3) and using a different pool size (k1,k2,k3) in each cluster, with the following number of pools per cluster: N1=5, N2=10 and N3=15. For example, when k1=20, k2=10 and k3=5, they observed that if Y1=5, Y2=7 and Y3=4 then where Yi denote the number of positive pools observed, i=1,2,3, and denotes the estimated dispersion parameter. However, if Y1=1, Y2=7 and Y3=3, then , which indicates that the proportion of group testing varies widely for specific combinations; this also implies the presence of overdispersion. Here it is important to point out that the outcomes of the units in each cluster are assumed to be independent, identically distributed (i.i.d.) binomial distributions with N and p and that testing was conducted with no errors. However, the assumption of i.i.d. binomial distribution with N and p is not appropriate when the sampling process is hierarchical and the plants in each cluster are correlated due to genetic factors or because the plants are spatially adjacent .
Regression models for pooled data have been proposed that incorporate covariates to identify which factors influence prevalence [8-10], while assuming that individual statuses (positive or negative) are independent random variables. Group testing regression models with fixed and random effects have also been developed to handle within-cluster correlation among individual latent binary responses , where the correlation is incorporated into the model by using the clusters as random effects; with help of covariates, it is possible to vary the prevalence between units. However, when we do not have access to covariates, it is not possible to know the unit-specific prevalences that control the correlation between units induced by this variability in the prevalence between units. Also, with these models, it is not possible to get a closed form of the likelihood function and of the correlation between pools (or individuals) induced by the random effect. For this reason, it would be useful to develop an alternative method for analyzing pooled correlated data that takes into account the correlation between individuals when estimating the proportion of interest. Such a method would provide us with an analytical expression for the likelihood function that we could use for calculating the probability of a positive pool and the correlation between two pools.
Furthermore, ignoring the correlation among individuals with cluster data under group testing produces a biased estimate of the proportion of interest; it also narrows down the confidence intervals and causes overestimated p-values for hypothesis testing. Data analysis methods are available for data with correlated responses in a non-group testing context with the correlation incorporating extra-binomial variation. One way of including extra-binomial variation is by introducing an unobserved continuous variable Pi which is independently distributed on the interval (0,1) with E(Pi)=pi; , where ø is the parameter of overdispersion, and by assuming that, conditional on Pi=pi, Ri is binomial (mi,pi), E(Ri)=mi pi and , where δ is the intraclass correlation. However, note that when mi=1, the variance does not change Var(Ri)=pi (1−pi), but we are still introducing a correlation between the individual binary responses [11,12].
A special case of this model for extra-binomial variation, described by Williams , assumes that Pi has a beta distribution, which results in Ri having a beta-binomial distribution. Another distribution with the same relationship between E(Ri) and Var(Ri) is the correlated-binomial model [13,14], in which ø plays the role of a correlation between the binary components of a population.
Turechek and Madden  used beta-binomial distribution to estimate the proportion (p) when there is heterogeneity. The key element of their approach was to approximate the probability of a positive pool of size k in the presence of heterogeneity with the probability of a positive pool under the binomial model (assuming homogeneity, that is, assuming that p is constant across clusters) and adjust this binomial probability with the design effect . In this case, deff was defined as the ratio of the variance of the beta-binomial model divided by the variance of the proportion under the binomial distribution. Turechek and Madden  then defined the effective pool size which represents the reduction in information obtained in the pool size due to the effects of over-dispersion. Then, to correct for overdispersion when calculating the probability of a positive pool, they replaced k with k2deff in the binomial model to approximate the probability of a positive pool under the beta-binomial model. However, this effective pool size sometimes does not predict the probability of a positive pool in the presence of heterogeneity very well, so they suggested using effective pool size to produce better results . It is important to point out that this approach works well if the correlations between pools are negligible; however, most of the time this assumption is violated in the context of plants collected from the same cluster that share genetic and environmental background.
Recent work by Lendle et al.  proposed group testing procedures for case identification with correlated responses for studying the efficiency of a group testing procedure when units within clusters are correlated, understanding by efficiency the expected number of diagnostic tests per unit required to classify all units as either positive or negative. In the work of Lendle et al. , clusters were assumed to be of equal size with the same distribution, contain exchangeable units and have a particular type of distribution. They used three models to examine how the efficiencies of group testing procedures are affected by correlated responses: a beta-binomial model where π has a beta distribution with mean p and variance σp(1−p); the model of Madsen , which is useful for modeling exchangeable binary data letting π=p with probability 1−σ, π=0 with probability σ(1−p), and π=1 with probability σp; and the model of Morel and Neerchal , which is constructed by letting with probability p and with probability 1−p. However, it is important to point out that the focus of the Lendle et al.  paper was classification, not estimation. In fact, they derived a closed-form expression for the expected number of tests per unit (i.e., efficiency) of hierarchical and matrix-based group testing procedures used for classification when units within clusters are correlated under a class of model for exchangeable binary random variables. Considering the above three models of exchangeable binary random variables in their study, they found that if units from the same cluster are tested together, the efficiency of a particular procedure can be improved, sometimes substantially, relative to random arrangements, which ignore information about cluster membership .
The main objective of this research is to propose a method for estimating binary responses using the Dorfman group testing model without retesting when the data were collected in clusters and the individuals within each cluster are positively and equally correlated. Negative correlations are not discussed here. To account for this correlation in the analysis, we proceed as in the standard context of the group testing binomial model, but vary the parameter p as a beta distribution, which is used to achieve a closed form of the probability mass function (pmf) of the number of positive pools in each cluster. This also allows deriving a closed form for the probability of a positive pool and the correlation between two pools . This pmf is used to estimate the proportion of interest (π) and the correlation between two individuals (δ) .
It is essential to point out that with this method we get a closed expression for the probability of a positive pool and the correlation between two pools that is not available in conventional approaches for pooled correlated data. However, with the proposed model these maximum likelihood estimates (MLEs) are difficult to compute, so we approximated them by using the beta-binomial distribution which was applied directly over the pooled correlated data to obtain estimates of and Equating these two estimates with the closed-form expressions derived for and , we get the approximate MLEs for π and δ while solving a system of nonlinear equations. These approximate MLEs based on the beta-binomial distribution produce results that are close to the exact MLEs derived using the proposed pmf with cheap computational resources.
Suppose that our population is composed of I clusters, and that N independent clusters are drawn from the I clusters in the population. Further within the l-th cluster, we form ni pools of size kl individuals, where we use the Dorfman model without retesting, with random allocation of individuals to the pools. Let Yijl denote a binary random variable that indicates whether the i-th individual within the pool j (j=1,2,…,nl) in the cluster l (l=1,2,…,N) is diseased (Yijl=1) or not diseased (Yijl=0). Let be the indicator variable, whether the j-th pool inside the cluster l is positive Zjl=1 or negative (Zjl=0).
Let us assume that all clusters are independent and that for each cluster, conditional on p, all individuals have a Bernoulli distribution with parameter p, and that p varies according to a beta distribution with parameters α=π/θ and β=(1−π)/θ, where π, θ>0. It is not difficult to show that for each individual, the unconditional mean and variance, respectively, are π and π(1−π), while the correlation between any two individuals within the same cluster l, Yijl and Yi’jl (i≠i’), is (see Appendix A and Kupper and Haseman  for details). In this context, from Appendix A we derived that the probability that a pool of size k is positive, is given by
The correlation between any two pools in the same cluster, , Zjl and Zj’l (j ≠ j’), is derived in Appendix A and given by
where is the probability that a pool of 2k individuals is positive. Although we are using only pools of size k, is a simplified notation and will be used in the proposed graphical estimator method. In this way, we can see that both the probability that a pool (of size k) is positive, , and the correlation between any two pools, , are functions of the probability that an individual is positive, π, and of the correlation between any two individuals in the same cluster, δ.
From Appendix C we have that increases with k, and .
Let denote the number of positive pools in cluster l (l=1,2,…,N) and let be the probability mass function (pmf) of Zl, where
Details of how this pmf was derived are given in Appendix B. It is interesting to point out that that is, as , the pmf of Zl reduces to the binomial when there is no correlation between individuals.
Maximum likelihood estimation
Let z=(z1,…, zN) be the vector that contains the number of positive pools of N clusters analyzed. Then, since the clusters are independent, the log-likelihood is given by
Thus the ML estimators and are obtained by solving the equations
Where , and and are given in Appendix D. This system of equations can be solved iteratively using the Newton-Raphson method.
We first obtain moment estimates for and ; from this we obtain estimates for the interest parameters (π and δ) by solving a system of nonlinear equations. We define the first and second empirical moments based on the number of positive pools contained in N clusters sampled respectively by
Then, by setting these moments to their expected values and solving for and , we obtain the moment estimators for the cluster l as
where and is the total number of pools analyzed.
Now, since and are functions of π and δ (Equation 1 and Equation 2), estimates of the parameters of interest can be obtained by solving the next system of nonlinear equations
By replacing Equation 5 in Equation 6, this system of equations is reduced to
The system of nonlinear equations given by Equation 6 and Equation 7 can be solved iteratively by the Newthon-Raphson method; alternatively, given that the right side of Equations 6 and 7 involves a quantity in the interval (0,1), and the parameters are between 0 and 1, they can be approximated by graphing the contours of g(π,δ,k) and g(π,δ,2k) at levels and respectively, and then observing where this intersection is located. This can be done with the R contour command. We will denote this solution it can be used as the initial value in true maximum likelihood (π and δ).
However, these MLEs are difficult to compute with the proposed model, so we approximated them using beta-binomial distribution. We applied it directly over the pooled correlated data to obtain estimates of and ( and ) and, by equating these two estimates with the closed-form expressions derived for and , we get approximate MLEs for π and δ by solving a system of nonlinear equations. These approximate MLEs based on beta-binomial distribution produced results that are close to exact MLEs derived using the proposed pmf with cheap computational resources.
Calculations of MLEs of the parameters of interest (π and δ) using the model described in the last section are difficult due to the complexity of the derived pmf. Therefore, in this section, we propose an alternative approach for estimating the parameters required with the beta-binomial model.
As shown in the previous section, the total number of positive pools in every cluster does not have a beta-binomial distribution; however, within each cluster the pool responses are binary with a probability of success and a positive correlation which are functions of π and δ, as shown in Equation 1 and Equation 2. Alternative estimates of the parameters (π and δ) can be developed if we assume that the total number of positive pools in each cluster has a beta-binomial distribution with parameters and , and we obtain the MLEs of and . We can obtain the MLEs and ( and ) with this approach and by solving Equations 6 and 7 for π and δ with and replaced by and , respectively, or by directly maximizing the likelihood we estimate π and δ. We can obtain MLEs (for and ) using the R library VGAM and the betabinomial function . This alternative approach based on beta-binomial distribution has computational advantages over the exact solution, since for large (e.g., >20), the exact solution (Equation 3) is unstable due to the alternative sums involved in
The corresponding log-likelihood using the beta-binomial model is given by
where is the probability function of the beta binomial with parameters , nl πp and θp evaluated at ; zl specifically,
where πp and θp (δp ) are given by Equations 1 and 2 ignoring the superscripts.
We present the results of a simulation study conducted to evaluate the performance of the approximate estimators (using the binomial or beta-binomial distribution) instead of the exact distribution (Equation 3). The simulation study was performed using four values of π (0.025, 0.05, 0.075 and 0.1), four values of δ (0.025, 0.05, 0.075 and 0.1) and five values of N (10, 30, 50, 100, 200) with k=25 and nl=10, l=1,…,N. For each combination of these parameters, we obtained 2000 random samples generated using the model given in Equation 3. To estimate the relative bias (RB) and the relative mean squared error (RMSE) for each of these samples, we calculated the corresponding MLEs of the parameters using the true model, the binomial model and the betabinomial model.
We also evaluated the results of the simulation based on the use of the beta-binomial model in order to approximate the correct distribution given in Equation 3. This approach has an attractive computational advantage over the exact distribution. To evaluate the quality of the approximate estimators, we calculated the relative bias (RB) as
and the relative mean squared error (RMSE) as:
where is the MLE of π using the true model, is the usual MLE of π using the binomial or beta-binomial model, and π0 is the parameter for which the data were generated using the model given in Equation 3.
Figure 1 shows the RMSE plots assuming a binomial model. All the plots show that miss-specification of the true model (Equation 3) lowers (less than 1) the RMSE when the sample size at the cluster level is equal to 10 and larger than 1 when the sample sizes are 30, 50, 100 and 200. This means that when the number of clusters is equal to 10, the RMSE using the binomial model is smaller. However, when the sample size at the cluster level is 30, 50, 100 or 200, the RMSE with the binomial model is considerably larger and increases linearly with sample size; the performance of the binomial model is more deficient for larger values of δ (Table 1). The binomial model has worse RB (Figure 2) than the beta-binomial model and the true model (Equation 3), and underestimates the true values; this behavior is less severe as the sample size increases. Also, it is clear that increasing the correlation between individuals (δ) significantly increases the RB (Figure 2).
Table 1: Relative bias (RB) and relative mean squared error (RMSE) for the binomial model using various combinations of π, δ and N with k=25 and nl=10, l=1,2,…,N
Figure 3 depicts the RMSE plots for the same parameters as in Figures 1 and 2. All of these plots show that, each time, the approach of the beta-binomial model in Equation 3 performs well in RMSE. However, when δ increases, RMSE performance decreases somewhat, but is still reasonable for the larger values of δ. In addition, it is important to point out that for N ≥ 30, performance is good and similar in all cases studied (Table 2). For the same parameters studied, Figures 3 and 4 shows that the beta-binomial model performs well in RB, except when the number of clusters is less than 30 (N<30) but comparable with the exact model; additionally, when the correlation between individuals (δ) decreases or N increases, this performance improves substantially in a similar way for both the beta-binomial approach and the exact model. Furthermore, Figure 4 shows that in all cases the approach using the beta-binomial model has a positive off-target bias, but it gradually converges to 0 as N increases, although with different patterns in each combination. The parameter that influences RB the most (in the exact and the beta-binomial approximation) is δ. For larger values of δ, RB convergence to the desired value is slower; for example, for δ=0.025, convergence is reached approximately at N>50, while for δ=0.075, it is reached approximately at N>100. Furthermore, smaller values of π are more affected by δ because they have larger RB values, but again this is observed in both estimators of π (the beta-binomial and the exact model). Therefore, the performance in RMSE and RB of the approach based on the beta-binomial model is good and has the advantage of being more efficient than the exact distribution (Equation 3).
Table 2: Relative bias (RB) and relative mean squared error (RMSE) for the beta-binomial model and Relative bias (RBE) for the exact distribution (Equation 3), using various combinations of π, δ and N with k=25 and nl=10, l=1,2,…,N.
In this section, we give two examples to illustrate the methodology.
Example: Transgenic maize estimation
In 2009, a study was conducted to estimate the proportion of genetically modified maize plants in farmers’ fields in the Sierra Juárez region of Oaxaca, Mexico (Table 3) . Of an estimated total of 50 fields in the Santa María Jaltianguis locality, 30 fields were sampled; 300 leaves were collected from plants randomly chosen throughout each field. During leaf collection in each field, 4-mm leaf sections were bulked per field totaling 300 sections per bulk sampled. The remaining leaves were labelled and stored separately (a total of 9000 leaf samples were stored). The bulk samples comprising 4-mm sections of 300 leaves each were subdivided into six pools of 50 leaves each. DNA was extracted, and the presence of 35S and NOSt sequences was determined by polymerase chain reaction (PCR) (Table 3) .
|Location: Santa María Jaltianguis||x||0||1||2||3||4||5||6|
Table 3: Number of pools comprised by leaf samples from Oaxaca, Mexico (2009), with a positive 35S PCR band based on 30 fields and 300 maize leaves per field. x indicates the number of positive pools, Nx is the observed frequency of each category at this location and Nx.s is the frequency of each category simulated as-suming π=0.001324 and δ=0.045.
Each 300-leaf bulk was disaggregated into 50-leaf bulks (6 per field) for DNA extraction, and PCR amplification of HSP101, 35S and NOSt sequences was performed. Data on HSP101 and NOSt amplification are not shown. Results presented in Table 3 correspond to bulks that were confirmed as positive in at least two independent PCRs . Fields 6, 8, 11, 15, 25 and 27 had exactly one positive pool, field 17 had exactly 2 positive pools and field 30 had 3 positive pools.
The traditional binomial approach resulted in an estimated prevalence of transgenic plants of 0.001260; the exact MLEs were and taking into account the correlation; the beta-binomial approach gave estimates of and taking into account the correlation; the beta-binomial approach gave estimates of . Since the estimated correlation is low , this data set is not appropriate for illustrating the proposed methodology. For the purpose of illustration, we assumed that 0.001324 is the true prevalence (π), and that δ=0.045 is the true correlation between individuals, and we maintained the same number of clusters and individuals per cluster (frequencies obtained now are in row Nx.s of Table 3). Now the exact MLEs were and , while the MLEs using the beta-binomial approach were and . Again, we see that the approximate MLEs based on the beta-binomial model are very close to the exact MLEs. However, assuming there is no correlation between individuals and pools, the estimated prevalence is equal to 0.001142515. The 95% Wald and profile confidence intervals for π using the exact approach were (-0.000662, 0.003635) and (0.000367, 0.012265), respectively; using the betabinomial approach, they were (-0.000701, 0.0003711) and (0.000369, 0.012063), respectively, and using the binomial mode, they were (0.000435, 0.001850) and (0.000573, 0.002003), respectively . The similarity between the results of the exact and beta-binomial models can be observed in more detail in the profile likelihood shown in Figure 5.
Example: Seed health assay
We used the data set given in Liu et al.  for detecting seed transmission of the cucumber green mottle mosaic virus (CGMMV). They selected seed lot (1877T-2B) of bottle gourds (Lagenaria siceraria L.) cv. “S-1” for testing. Test seeds of the working samples were soaked in pure water overnight; the suspensions were then used as coating antigens to initiate the indirect enzyme-linked immunosorbent assay (ELISA) for detecting the presence of CGMMV. Fifteen sub-samples were randomly taken from the seed lot. Working samples were prepared using pool sizes (k) of 1, 2, 5, 10, and 100 seeds from each sub-sample (cluster). When k=1, 2, 5, or 10, 10 replicates (nl) of each were used in the experiment. However, if k=100 of a sample, only five replicates were used. The aim of the experiment was to estimate the proportion of infected seeds and its CI with group testing (Table 4).
|Cluster l,||kl||nl||zl||Cluster l,||kl||nl||zl||Cluster l,||kl||nl||zl|
Table 4: Data for detecting CGMMV in seed. kl is the pool size in cluster l, nl is the number of pools in cluster l, and zl is the number of positive pools in cluster l.
The MLEs based on Equation 3 were and while the approximate MLEs using the beta-binomial approach were and . With the conventional binomial model, the approximate estimate was . The approximate MLE based on the beta-binomial approach is almost identical to the exact MLE, whereas the binomial estimate is different. Also, the estimated correlation using the beta-binomial model is very close to that given by the exact MLE. Furthermore, the 95% confidence interval based on the profile likelihood of π using the exact MLE approach is (0.006897, 0.070214) and that of the beta-binomial approach is (0.006928, 0.068484), which indicates the similarity of the results of the two approaches. Indeed, the profile likelihood of this approach overlies the profile exactly, as shown in Figure 6.
Using the traditional binomial model, (0.002724, 0.009334) and (0.003181, 0.010005) are the 95% Wald and profile confidence intervals, respectively. The similarity of these confidence intervals is due to the assumption of independence among individuals (and also among pools) in each cluster and a large sample of 135 pools. Note that these confidence intervals have a narrow width because they ignore extra binomial variation.
The 95% Wald confidence interval for π is (-0.002029, 0.042472) with the exact MLE approach, while with the approximate beta-binomial approach it is (-0.001769, 0.042303). As before, the exact MLE and the approximation based on the beta-binomial approach produced similar results. It is important to point out that the width of our confidence intervals is larger than the width adjusted for overdispersion that Liu et al.  reported. This can be explained by the fact that Liu et al.  used a quasi-likelihood approach to model the number of positive pools by cluster with the assumption that the individuals within each cluster are independent binary variables having the same prevalence. In contrast, our approach is based on the assumption that the responses of all individuals within each cluster are equally correlated binary variables and, as a result, we take into account the induced correlation between individuals and pools.
When we obtained a sample of N independent clusters from a finite population of clusters, we sampled individuals within each selected cluster and randomly allocated these individuals to nl pools of size kl individuals for the detection or estimation of a particular disease (positive). To produce correct estimations, in this case it is important to take into account the correlation between units and pools. For the purpose of estimation, it is important to use the probability mass function (pmf) of the number of positive pools in a cluster derived in this study to correctly estimate the proportion of interest, because it takes into account the fact that the pools formed in each cluster are correlated. Also, we showed that if we use the binomial distribution to estimate the proportion of interest, the results will present a large bias and very inflated mean square errors when N ≥ 30. This result agrees with the paper of Hung and Swallow , who concluded that “for clustered and correlated individuals in each cluster even using a small pool size offers a little robustness.” Since our methods (exact and approximate) induce correlations between individuals with a beta distribution, they are valid for hierarchical sampling because they take into account the correlation between individuals and pools in each cluster. This is an advantage over the approach proposed by Liu et al. , which is not appropriate for a hierarchical sampling process because they assumed that the individuals in each cluster are i.i.d binomial distributed and used a quasi-likelihood approach to correct for the presence of overdispersion.
For this reason, it is important to use the pmf given in Equation 3 to obtain correct estimations of the proportion in a group testing context when the responses are correlated. However, using Equation 3 when the sample size increases is inefficient due to the term involving the sum that it contains. For this reason, we studied an approach based on the beta-binomial model, which according to the simulation study performed, produces results that are very close to those obtained using the exact distribution (Equation 3) with the great advantage that the approach based on the beta-binomial model is computationally more efficient, although we still need to use Equation 1 and Equation 2 to estimate the corresponding parameters required for the beta-binomial model. In addition, we control the induced correlation because we get a closed form of the probability of a positive pool and the correlation between any two pools.
Suppose that conditionally on p, Y has a Bernoulli distribution with parameter p, and that p has a beta distribution with parameters and . The mean and variance of ,Y respectively, are:
Then, if conditionally on p, Yi and Yj are independent Bernoulli variables with parameter p, the unconditional correlation of Yi and Yj is given by
Since Zlj is a binary random variable and ,
This last equality is true because if
Now, since all the individuals within a cluster are independent conditional on p, subsequently any two pools are as well, i.e.,
Thus the correlation between two any pools in the same cluster (l) is
where , which corresponds to the probability that a pool is positive as if it were made up of 2k individuals.
Note that conditionally on p, have binomial distribution with parameters and nl because all the individuals within a cluster are conditionally independent and . Hence the marginal distribution of Zl is given by
It is well known that . So by recursively applying these properties of the gamma function, we get
Here, it is easy to see the following:
• is increasing with respect to k
To obtain the gradient of first let c be a constant and be the first derivate of the usual gamma function. Note that
Hence, using the parameterization