Statistical Methods for Estimating Within-Cluster Effects for Clustered Poisson Data

Clustered Poisson data frequently appear in medical research. Interest often focuses on examination of an exposure effect within clusters. The objective of this paper is to compare the performance of six methods for estimating the exposure effect for clustered Poisson data: 1) independent Poisson; 2) fixed cluster effects Poisson; 3) conditional likelihood Poisson estimation; 4) Generalized Estimating Equations (GEE); 5) random cluster effects Poisson; and 6) random cluster effects Poisson, with separate betweenand within-cluster effects. Biases and standard errors of withincluster exposure effects are compared across the six statistical methods considering constant or varying exposure ratio (number of exposed to unexposed subjects), constant or varying cluster sizes, different within-cluster exposure effect, different cluster variances, and number of clusters. Simulations and theoretical results show that exposure ratio is a key quantity. With constant exposure ratio designs, maximum likelihood estimates and asymptotic standard errors were obtained in closed form. All models, except GEE, give equivalent estimates and standard errors of the within-cluster exposure effect. With varying exposure ratio designs, conditional likelihood and fixed cluster effects methods yield the same estimates and standard errors for the exposure effect. Results from the random cluster effects Poisson model with separate betweenand within-cluster effects are very similar to those from fixed cluster effects Poisson and conditional Poisson methods. We applied the above approaches to birth cohort data, to analyze incidence of Respiratory Syncytial Virus (RSV) infection in young children in Indonesia. Institute for Health Research, Kaiser Permanente Colorado, Denver, Colorado, USA Journal of Biometrics & Biostatistics J o u rn al of Bio metrics & Bistatis t i c s


Introduction
Clustered data arise often in studies of clinical and health care research. Clusters are formed either by natural matching, for example in family or twin studies, or by matching on some characteristics that may have associations with the outcomes, and the exposure (confounders). In general, subjects in a cluster are correlated, and the strength of correlation depends on the nature of outcomes and the clustering criteria.
This paper focuses on clustered Poisson data, in which some subjects are 'exposed' and some, are not in a cluster. Exposure may refer to any variable, including treatment that varies among units (typically subjects), in a cluster. A covariate in clustered data generally has two components, between-cluster component represented by the cluster mean of the covariate, and within-cluster component which is the deviation from the cluster mean [1]. Our interest is to estimate the within-cluster exposure effect. Several statistical methods can be considered to assess the effect, including 1) Independent Poisson model (IP); 2) Fixed Cluster effects Poisson model (FCP); 3) Conditional likelihood Poisson estimation (CP); 4) Generalized Estimating Equations (GEE); and two versions of random cluster effects Poisson models; one that does not attempt to estimate separate between-and within-cluster effects (5,RCP), and one that does (6,RCP_bw). Detailed descriptions of these methods are given in the next section, along with a review of relevant literature.
Several features are important in design and analysis of studies with clustered data, including number of clusters, number of subjects per cluster (cluster size), and the ratio of number of exposed to unexposed subjects in each cluster (termed "exposure ratio" in this paper). The objective of this paper is to compare the performance of above six methods, for estimating the exposure effect for clustered Poisson data in a variety of designs. Similar investigations have been performed for binary outcomes [1][2][3][4][5][6], but our theoretical and simulation results are unique for clustered Poisson data. It will be seen that the exposure ratio plays a key role in determining behavior and choice of statistical methods. Constant exposure ratio will be shown to have several advantages and can be achieved in designed studies, for example: sampling from large databases, based on subjects' characteristics or propensity scores [7,8].
The motivation for this paper was a study conducted in young children in Indonesia, to examine the association of incidence of Respiratory Syncytial Virus (RSV) infection, with characteristics of the mother. RSV is the major cause of viral lower respiratory tract infections in infants and young children worldwide. The outcome variable was the number of RSV infections, during the follow-up period. It was thought that the time during the year when the child was born could be an important factor, due to temperature and rainfall variation, across the year. To adjust for this effect, children were grouped into 24 birth cohorts of half month each, according to their birth dates.

Statistical Methods
We assume clustered Poisson data from the following scenario. Let Y ij denote the outcome for the j th subject (j=1,…,n i ) in the i th cluster (i=1,…,K), E(Y ij =λ ij , and there is an exposure indicator variable x ij equal to 1 for exposed and 0 for unexposed subjects. We assume a log link function, Where α is the intercept, β is a parameter for the exposure variable x ij , and γ i are cluster specific effects, and independent across clusters. We assume that Y ij |λ ij~P oisson (λ ij ) and Y ij are independent across clusters, and within clusters conditional on γ i . The following six statistical methods were considered for analyzing this type of clustered Poisson data.

Independent Poisson model (IP)
IP assumes that there are no cluster effects, γ i =0 for all clusters. Maximum Likelihood Estimates (MLE) can be obtained with standard softwares, using likelihood techniques. In this paper, SAS PROC GENMOD with log link function and Poisson distribution were used to fit this model. The IP model ignores the correlation among observations within a cluster, and thus serves as a standard for other models, if the correlation is zero. It also serves as a benchmark indicating how biased, inefficient or invalid it is to ignore the correlation, when the actual correlation is not zero.

Fixed Cluster effects Poisson model (FCP)
FCP assumes that the γ i is unknown, but fixed coefficients. MLEs for β and γ i are obtained from standard softwares (e.g. SAS PROC GENMOD), by including clusters as fixed effects in the model. However, the usual asymptotic theory does not hold when the number of parameters increases with the number of clusters. The FCP model cannot be used to estimate between-cluster effects, since they are collinear with the γ i .

Conditional likelihood Poisson estimation (CP)
In CP, the γ i are treated as nuisance parameters, and are conditioned out of the likelihood function [9,10]. The conditional joint density of the n i observations in cluster i is 1,..., is the sufficient statistic for γ i . The above density function is free of the cluster effects, γ i ,due to conditioning on its sufficient statistics. The cluster effect γ i are not necessarily independent of the covariates x ij , although dependencies are conditioned out along with the γ i . To obtain MLEs, SAS PROC NLMIXED was used to fit the CP model, by defining the log likelihood function and using the general (ll) statement. The CP model estimates purely within-cluster effects, and cannot estimate effects of between-cluster covariates.

Generalized Estimating Equations (GEE)
GEE is used to estimate population average parameters, where marginal models are specified. It is not a likelihood based approach. The model is specified as log(λij)=α*+β*x ij , and the correlation among subjects in a cluster is accounted for, by assuming a 'working' correlation matrix, in this case Corr(Y ij ,Y ih )=ρ for all pairs, with j≠h [11]. We used SAS PROC GENMOD to fit the model with the empirical corrected variance estimate for its robustness throughout this paper.

Random Cluster effects Poisson model (RCP)
RCP assumes that the γ i in (1) are independently and identically distributed N(0,σ 2 ), with σ 2 representing between cluster variance. The likelihood of observations in a cluster conditional on γ i is a product of Poisson densities, and the marginal likelihood is obtained by integrating over the γ i . In general, the integral cannot be evaluated in closed form. Numerical integrations were carried out, and MLEs were obtained with SAS PROC NLMIXED.

Random Cluster effects Poisson model separating betweenand within-cluster effects (RCP_bw)
If the clusters do not have the same exposure ratio, the cluster means of exposure ratios represent between-cluster information, and between-and within-cluster exposure effects are separable [1]. We refer to this model as RCP_bw. Equation (1) Where β b represents the between-cluster exposure effect, β w represents the within-cluster exposure effect, and is the focus of this paper, i x is the mean of x i . Likelihood ratio tests can be used to test, if β w =β b . If they are equal, RCP_bw reduces to RCP.
Issues related to appropriate statistical approaches for analysis of clustered data have been investigated in a number of studies with binary outcomes, as mentioned previously. The FCP method accounts for cluster differences, but has been shown to have severe biases in the case of matched pairs [14]. The CP method accounts for clustering effect, and is consistent and asymptotically efficient for matched pairs with binary outcomes [15]. It is also robust to the distribution of random effects, and to the association of random cluster effects γ i with covariates x ij . However, it cannot estimate effects of between-cluster covariates. GEE accounts for clustering of subjects, and is easy to carry out using standard software (e.g. SAS PROC GENMOD). However, it has been noted to have high type I error rates for between-cluster effects in clustered binary data [5]. RCP and RCP_bw account for clustering of subjects, and when applicable, they seem to be the best models for analyzing clustered binary outcomes [1,2]. They do make distributional assumptions on the γ i , however, it has been shown in other studies [16,17], that these assumptions do not substantially affect estimates of covariate effects.
There have been fewer studies with clustered Poisson data as outcome. It has been shown that with Poisson outcomes, FCP and CP, give the same estimate of within-cluster exposure effect, regardless of equality of exposure ratios [18]. Demidenko [19] studied clustered Poisson models and showed equality of some of the estimators described above, assuming constant and balanced (equal numbers of exposed and unexposed in each cluster) exposure ratios and equal cluster sizes. Gail [20] and Petersen and Deddens [21] showed that in designs balanced in all covariates with Poisson data as outcome, removing covariates does not affect estimates of effects of other covariates, which is related to equality of FCP and IP in our situation.

Theoretical Results
With constant exposure ratio, regardless of the number of subjects in each cluster being constant or not, IP, CP, FCP, and RCP methods The coefficients α* and β* are estimated by solving the generalized estimating equations. In the Poisson case, the marginal parameter β* from GEE corresponds to the subject-specific parameter β in (1), although the intercepts α and α* differ [12,13].
And its asymptotic variance is In (2) and (3), n 1 and s 1 are the total number of subjects, and total number of events for the exposed group (x=1), and n 0 and s 0 are the corresponding numbers for the unexposed group (x=0). Note that With constant exposure ratio designs, the results show that cluster effects γ i can be ignored, and simple Poisson regression can be used to estimate the exposure effect, which does not require cluster size being constant. Equations (2) and (3) provide straightforward sample-based expressions of the similar results, with less restrictive assumptions, compared to studies by Gail [20], Petersen and Deddens [21] and Demidenko [19]. Calculation of the marginal likelihood for the RCP method requires integration over the random effects, and in general, the integral cannot be evaluated in closed form, so a simple analytical result is unexpected. In our case, g(γ), the density function of γ, needs not be Gaussian as typically assumed, and (2) and (3) hold generally for random effects distributions, that are sufficiently smooth.
Assuming normally distributed random effects in the RCP model, the theoretical asymptotic variance of β using the expected information is It is not intuitive that v ar( ) β in equation (4) is inversely proportional to the between cluster variance σ 2 . This is due to the fact that the mean of log normal distribution is proportional to the σ 2 , E(Y)=exp(α+βx+σ 2 /2), consequently the standard error of β estimate is smaller because of larger mean of Y. Equation (4) is useful in interpreting the simulation results, and in sample size estimation for designs with constant or near constant exposure ratio across clusters.
Another common random effects model uses Poisson distribution conditional on the random effects that follow a gamma distribution [22,23]. The marginal likelihood of this model is in closed form, and the corresponding MLE and its asymptotic variance can be derived algebraically.
Note that the equalities derived and discussed above refer to estimates and standard errors for the coefficient of exposure effect, β, only. Other aspects of the models (e.g. intercept α, likelihood) may differ.

Simulations and Results
We performed simulations to study the impact of exposure ratio, cluster size, and between cluster variance on bias and precision of estimates of β, the within-cluster exposure effect. 500 data sets were simulated under each condition.
We considered the following cluster characteristics: 1) cluster size (n i ). Two designs were used, constant cluster size with n i =6 subjects per cluster, or varying cluster size with cluster sizes n i =6 to 42 subjects, by an increment of 6 among clusters. 2) Number of clusters (K), K=20, 30 or 50 clusters for constant cluster size design, and K=30 clusters for varying cluster size design. 3) exposure ratios, exposure ratios 3:3 for each cluster, describing constant ratio and balanced clusters; 1:5 (one exposed and 5 unexposed subjects) for each cluster, describing constant ratio, but unbalanced clusters; and a varying ratio design, where the exposure ratio varies randomly among clusters.

Simulation results
Bias of β : Our theoretical derivations showed that under constant exposure ratio designs four models, IP, FCP, CP, and RCP, generate the same estimate of β, which is a simple closed form as in (2). Simulation results (not shown) verified these theoretical results, and also indicated no bias of practical importance in any cases, including the varying exposure ratio and varying cluster size designs. Table 1 shows the precision of estimation of β, measured in three ways. For constant exposure ratio designs, theoretical results show that the asymptotic standard errors are the same for IP, FCP, CP and RCP, which is verified by simulations. It is also shown that avg{SE( )} β correctly depict the variability of β since they agree with  (4). For GEE approach, ( ) SD β is very similar to that of the other methods, but â vg{SE( )} β tends to be slightly smaller with larger σ 2 values, or varying exposure ratio designs, or smaller numbers of clusters (K=20, data not shown). Implications of this for type I error rates are discussed below.

Precision of β :
It also shows greater precision of β for 3:3 designs, compared to 1:5 designs with the same number of clusters and subjects. This can also be seen from equations (3) and (4), and it mirrors the usual benefits in precision from balanced designs (e.g. equal group size t-tests). For constant exposure ratio designs, precision of β increases as σ 2 increases, this agrees with equation (4). Equation (4) also shows that var( ) β is inversely proportional to β, and to the numbers of exposed and unexposed subjects, which agrees with our simulations (data not shown). Note that these results relating precision to cluster variance depend on the normality assumption of the random effects. For The six statistical methods were compared based on bias and standard error of the estimates. Bias was calculated as the average of the 500 estimates of β minus its true value. The Standard Error (SE) of the estimate of β was calculated three ways: as the average of the 500 standard errors given by the procedures, denoted  (1), the method used in the appendix can be used to derive the corresponding theoretical results for other random effect distributions. All of these patterns continue to hold when the cluster sizes are not equal. The varying and fixed cluster size results are not directly comparable, because of different numbers of subjects in a cluster.
For varying exposure ratio designs (   β of GEE tends to be slightly smaller than that of RCP. Again, this affects type I error rates. Compared to RCP and GEE, estimates of β from FCP/CP and RCP_bw tend to have slightly larger variability, as the latter are based on only the within-cluster information, while RCP and GEE use both within-and between-cluster information. Similar patterns have been noted in the binary outcome case [5]. Table 2 shows type I error rates under the null hypothesis, β=0. Type I error rates for constant exposure ratio designs remain at, or near 0.05 for IP, FCP, CP, and RCP in balanced 3:3, unbalanced 1:5, and varying cluster size designs. Type I error rates for GEE tend to be greater than 0.05, up to 0.10, with higher type I error rates occurring for larger σ 2 . This corresponds with the slightly smaller procedure standard errors

Type I error rates:
SD β represents the true empirical variability. Inflated type I error rates have also been observed for GEE with binary outcomes [5]. Patterns are similar for varying exposure ratio designs, except that IP is, as expected, invalid with very high type I error rates. â vg{SE( )} β from GEE in table 1 is the empirical standard errors, as commonly recommended for their robustness. We repeated the analyses with the model-based standard errors. We found that when the between-cluster variance is zero, the model-based standard errors are very similar to the empirical standard errors, but as the between-cluster variance increases, the model-based standard errors become substantially larger than those from other methods, resulting in loss in estimation efficiency.

Respiratory Syncytial Virus (RSV) Infection in Indonesian Infants
We applied these six approaches to the Indonesian infant RSV data. One of our objectives was to examine if child's frequency of RSV (outcome variable) is associated with mother's education, which is also a reflection of the family's socio-economic status. Mother's education level is a dichotomous 'exposure' variable, with 0 representing less than elementary education and 1 representing elementary or higher education. The data consisted of 557 subjects born in 1999, who were followed to the end of January of 2001. Children were grouped into 24 birth cohorts (clusters), of about a half month each, according to their birth dates. The clusters were used to adjust for birth cohort effects, since there is thought to be an irregular seasonal pattern in RSV incidence. Natural log of person times was included as offset in models.
The number of children in a birth cohort (cluster size) ranges from 13 to 36. The percentage of maternal education level higher than elementary education (exposure ratio) ranges from 40 to 82.6% among the birth cohorts, with 20 out of 24 cohorts having 60-80% of mothers with elementary or higher education. The RSV count for children varied from 0 to 3.
Results of analyses of the RSV data are shown in table 3. Consistent with our theoretical and simulation results, for varying exposure ratio designs, FCP and CP generated the same estimates and standard errors, and the estimate and standard error for RCP_bw are also very similar, since all of these methods use within-cluster information. Estimates from IP and RCP methods differ only slightly from those of FCP, CP, and RCP_bw methods and standard errors are only slightly smaller due to the similar exposure ratios across clusters, suggesting that there is not much between cluster information to improve the precision over within-cluster methods. GEE method provided slightly larger estimate and standard error.
Overall, children of mothers with elementary or higher education had lower risk of RSV, by a factor of about 50% (95% CI 0.29, 0.89).

Discussion
We have studied the behavior of six common methods for estimating the within-cluster exposure effects, under a variety of cluster characteristics for clustered Poisson data. Our results show that the exposure ratio plays a key role in determining the behavior of these approaches. For simplicity, we considered a single binary exposure variable, but results can be extended to a covariate with multiple levels, or to multiple within-cluster covariates provided the covariates are balanced across clusters. When exposure ratio is constant, MLE of β and its asymptotic variance are available in simple forms, and several common methods (i.e., IP, CP, FCP, and RCP) give identical estimates and standard errors. Note that neither equal cluster size, nor equal numbers of exposed and unexposed subjects within each cluster are required. When exposure ratio varies across clusters, there is also between-cluster information due to different mean exposure ratios in clusters. Our results indicate that the within-cluster exposure effect can be correctly estimated by CP, FCP and RCP_bw. This agrees with findings in studies with binary outcome. In addition, studies with binary outcomes [24,25] showed RCP_bw is an appropriate approach, even when mean exposure ratio is a potential confounder with cluster effects, i.e., exposure is correlated with cluster effect. RCP_bw also estimate between-cluster exposure effects, which is not available through CP or FCP. Inflated type I error rates for GEE in situations, including some constant exposure ratio designs raise caution in its use. GEE could be formulated with separate between-and within-cluster exposure effects, but due to the inflated type I error rates, we did not pursue this extension. However, it should be noted that GEE does offer protection against over dispersion, or unusual forms of correlation, which without modification the other methods do not.
Our results also have implications for design of studies with clustered Poisson data. When investigators have control over cluster characteristics, for example, when sampling from large databases or creating matched pairs or groups (e.g. using propensity score values), our results indicate advantages of holding the exposure ratio constant across clusters, including simplicity of analysis, robustness to the distribution of random effects, availability of simple theoretical expressions (equation (4)), that can be used for sample size estimation, and apparent improvement in precision over varying ratio designs. Expos. ratio=ratio of number of exposed to control subjects in a cluster; Varying ratio means the exposure ratios vary randomly from cluster to cluster; n i =number of subjects in each cluster; σ 2 =between cluster variance; IP=Independent Poisson model; FCP=Fixed Cluster Poisson model; CP=Conditional likelihood Poisson estimation; GEE=Generalized Estimating Equations with empirical standard error; RCP=Random Cluster effects Poisson model; RCP_bw=Random Cluster effects Poisson model where the within-and between-cluster effects were separated.  Table 3: Estimates (standard errors) of the log rate ratio from 6 models for the Indonesia RSV data.