The Use of Score Tests for Frailty Variance Components in Recurrent Event Data

In the analysis of recurrent event data, frailties are commonly used to model the dependence structure among repeated event times within an individual. Often it is of interest to test whether the variance component in a frailty model is zero. It is well-known that the usual asymptotic mixtures of chi-square distributions of the score statistics for testing constrained variance components do not necessarily hold. In this paper, we propose and explore a stochastic permutation score test based on randomly permuting the indices associated with the individuals of a survival model. An empirical study suggests that the proposed score test has approximately the correct level of significance and is more powerful than the asymptotic score test based on the mixture of chi-square distributions. The proposed test is illustrated using two sets of actual recurrence failure time data obtained from clinical experiments.


Introduction
In many applications of survival data, often the survival times are not independent. Such data can arise when different individuals share some common feature or when we observe recurrent event times within an individual. Frailties may be used to model the dependence structure in such survival data. A frailty can also be used to describe an unobservable genetic effect if individuals are in sibling groups or an environmental effect if individuals are grouped by households. Some experiments lead to repeated event times within an individual, and in such cases, frailties can be used to model the association among the repeated outcomes.
Frailties can also be used to model survival data in the presence of unobserved heterogeneity. For univariate (independent) failure times, these may be used to describe the effects of unobserved covariates in a proportional hazards model. For multivariate (dependent) failure times, these may be used to model the dependence structure among the multivariate observations, where it is usually assumed that given the frailty the multivariate failure times are conditionally independent. Frailty models are extensions of the Cox regression models [1], and provide an alternative way of modelling survival data where the hazard function is not monotonic, or where the hazards are not proportional. Frailties are also used to model survival times for grouped individuals, such as twins or family members, and recurrence survival times for the same individual. It is usually assumed that the frailties are random observations from a probability distribution with mean zero, but with an unknown variance component that needs to be estimated. A number of authors studied frailty models for describing heterogeneity in survival data, which include McGilchrist and Aisbett [2]; Klein [3]; McGilchrist [4]; Aalen [5,6]; Hougaard [7]; Klein and Moeschberger [8]; and Hougaard [9].
Often we are interested in testing whether the individuals in recurrent event data or groups in clustered survival data are homogeneous for given explanatory variables, or equivalently, whether the variance component in a frailty model is zero. In this paper, we investigate a score test for testing the homogeneity of the individuals (or groups). The score statistic is derived from a Taylor series expansion of the likelihood function about the mean of the frailty. The development of the test procedure is computationally less intensive as compared to the full likelihood ratio test in that it only requires the calculation of the score statistic under the null hypothesis of no frailty. It is well-known that for finite samples, the usual one-sided score tests based on mixtures of chi-squares often result in incorrect estimates of the level of significance (see, for example, Shephard and Harvey [10]; Shephard [11]; Pinheiro and Bates [12]; Crainiceanu, Ruppert and Vogelsang [13]; Crainiceanu and Ruppert [14]; Fitzmaurice, Lipsitz and Ibrahim [15]; and Sinha [16]). As a remedy, here we propose and explore a permutation test that approximates the p-value of the one-sided score test for the variance component. The proposed test provides approximately the correct level of significance under the null hypothesis even for small samples, and is also more powerful than tests based on mixtures of chi-square distributions.
The paper is organized as follows. Section "The Score Test" describes the proposed score statistic for testing the variance component in a frailty model. It also describes the permutation method for approximating the p-value of the score test. Section "Illustrative Example" presents an example using a survival model with a shared frailty for recurrent event data, and illustrates the calculation of the score statistic for testing the variance component of the frailty term. Section "Simulation Study" presents results from a simulation study that was carried out to investigate the finite-sample properties of the proposed permutation score test as well as the asymptotic score test based on the mixture of chi-square distributions. Section "Applications" provides applications of the proposed test with two sets of actual recurrence failure time data obtained from clinical experiments. Section "Discussion" concludes the paper with some discussion.
where the i th individual has n i repeated event times. Let T ij denote the j th event time for the i th individual and δ ij denote the censoring information (δ ij = 0 if T ij is right-censored; δ ij = 1 otherwise). We assume that the censoring is non-informative. Let x ij = (x ij1 ,..., x ijp )' denote the vector of explanatory variables associated with the (i,j)th event. Suppose, conditional on the frailty u i , the hazard rate for the j th event is of the form: is a baseline hazard function depending on a vector of unknown parameters θ and β is a vector of unknown regression coefficients. We assume that the frailties u i s are independent and follow a normal distribution with mean 0 and variance component 2 u σ . As the hazard rates within an individual share the same frailty, the event times within an individual are not independent. However, in the limit as 2 0 u σ → , they tend to be independent.
Given the data {(t ij ,δ ij );i=1,...,N;j=1,...,n i }, the marginal likelihood of the model parameters may be expressed in the form: S t being the survivor function for the (i,j)th event at time t, f ij (t), is also the corresponding density function at time t, and i u E denotes the expectation with respect to the distribution of the frailty u i . Similarly to Cox [17] (see also, Dean [18]), we expand the term where To test the homogeneity of individuals, we set the null hypothesis that there is no difference in event times between individuals for the given explanatory variables, whereas the alternative hypothesis is that the event times for a given individual share a common frailty. This is equivalent to testing the null hypothesis being the baseline cumulative hazard function at time t, depending on the parameters θ.
The score statistic is defined as a function of 0 ( ) are the ML estimators of the nuisance parameters ( , ) An approximate variance of the score function U(γ) can be derived from the observed Fisher information matrix: For calculating the Fisher information I(γ), after some algebra, we can show that for the i th individual, To test the null H 0 against the one-sided alternative H 1 , following Silvapulle and Silvapulle [19], we use a score statistic in the form where 0 ( ) The development of this statistic is motivated by the fact that in the limit as N→∞, it becomes the likelihood ratio statistic (see Silvapulle and Silvapulle [19], for details). For positive 2 u σ , the score at zero is positive, and the infimum in (7) becomes zero in {δ>0}. But when 2 u σ is negative, the score at zero is also negative, and so the infimum in (7) is attained at δ=0 and the statistic T becomes zero. As noted by Verbeke and Molenberghs [20], there should be valid models for sufficiently small but negative values of 2 u σ , even under a constrained setting.
As we noted earlier, under the null χ has a chi-square distribution with one degree of freedom.
However, for finite values of N, this mixture of chi-squares may lead to incorrect level of significance. In the case of a linear mixed model with a fixed intercept and a random group (or cluster) effect, Crainiceanu and Ruppert [14] showed that for fixed N and i n → ∞ , the likelihood ratio statistic has a mixture distribution in the form where α N is determined by the group size N. However, for survival models with frailties, the appropriate mixture of chi-squares may not be straightforward to obtain.
To approximate the distribution of the score statistic, we propose a permutation method. The proposed permutation score test provides approximately the correct level of significance under the null, even for small samples. To obtain an approximate p-value of the score test based on the permutation method, we adopt the following algorithm: 1. For given survival data, obtain estimates γ of the nuisance parameters γ under the null hypothesis 3. The approximate p-value of the stochastic permutation score test is determined by the proportion of permutation samples with r obs T T ≥ .
Note that Fitzmaurice, Lipsitz and Ibrahim [15] used a similar permutation method to approximate the p-value of a likelihood ratio test for testing the variance components in multilevel generalized linear mixed models. Here, we adopt the permutation method to approximate the p-value of the proposed score test for testing a frailty variance component in recurrent event data.
We have described the score test for a frailty variance component in the framework of recurrent event data. This approach can also be used for testing the significance of frailties in clustered survival data. We may encounter such clustered data when different individuals share some common characteristics. For example, in a multicenter clinical experiment, the survival times of patients from the same center may be more similar as compared to those from different centers. This could be due to different health care services provided to the patients in the different centers. Here, we can treat the centers as clusters and describe the homogeneity of survival times of the patients within a center using a shared frailty model.

Illustrative Example: A Proportional Hazards Model with a Shared Frailty
Consider a simple two-level exponential hazards model with two covariates x ij1 and x ij2 , and with a single frailty u i : where h 0 (t)=λ is the baseline hazard function, and the frailties u i are assumed to be independently and normally distributed with mean zero and unknown variance component represents the vector of covariates, and Also, for the Fisher information matrix, we have under The score statistic T takes the form (7)  We performed a simulation study to investigate the empirical properties of the proposed permutation score test for the variance component in (8). Details are provided in the next section.

Simulation Study
To study the finite-sample properties of the proposed permutation test, we ran a series of simulations. The "true" event times y ij with . For each combination of N=50, 100, 200 individuals and n=2, 10, 50 repeated event times within individuals, we performed a simulation study based on 1000 replicates of data sets. The censoring times c ij were assumed to be independently and identically distributed exponential random variables with mean * 1/ 1/ 0.08 λ = , and the censoring mechanism was assumed to be non-informative. The observed data were {(t ij ,δ ij );i=1,...,N;j=1,...,n}, where t ij =min(y ij ,c ij ) and δ ij =I(y ij < c ij ).
The value of the frailty variance component 2 u σ in (8) was set to zero when investigating the empirical level of significance of the proposed test. We compared the p-value of the permutation test to the asymptotic p-value obtained by the (0.5, 0.5) mixture of chisquare distributions. The permutation p-value was based on B=1000 permutation samples, and the empirical level of the test was obtained as the proportion of samples for which the estimated p-values were less than the nominal level α=0.05. Table 1 presents the estimated levels of significance of the two score tests. We note from the results that the level of the permutation test is generally much closer to the nominal 0.05 level of significance, as compared to the level based on the mixture of chi-square distributions. For the latter case, the levels get closer to the nominal 0.05 level only when the number of events n and the number of individuals N increase. The permutation test roughly provides the correct level of significance in each of the simulation configurations considered. Also, the approximate 95% normal confidence intervals of the true levels, based on the estimated standard errors, suggest that the empirical levels of the permutation tests are not significantly different from the nominal 0.05 level, whereas most of the levels from the asymptotic score tests are significantly different.
We also investigated the powers of the two score tests using the same simulation configurations as above. The empirical powers were calculated under the alternative hypothesis . We used 1000 simulation replications for each simulation configuration, and also used 1000 permutation samples to find the permutation p-value of the score test. Table 2 presents the empirical powers of the two tests. It is clear that the proposed permutation test is generally more powerful than the test based on the mixture of chi-square distributions for the simulation configurations considered. For N=200 and n=2, however, the score test based on the asymptotic mixture appears to be more powerful than the permutation test. But it should be noted that the empirical powers of the asymptotic score test may not be reliable here as the test provides incorrect level of significance for the given sample size.
The results from the simulation study demonstrate that the proposed permutation test generally has the correct level of significance and is also more powerful than the asymptotic test based on the mixture of chi-square distributions.

Bladder cancer data
Wei, Lin and Weissfeld [21] presented and analyzed some tumor recurrence data obtained from a bladder cancer study conducted by the Veterans Administration Cooperative Urological Research Group (see Byar [22] for details). In this study, all patients entering the trial had superficial bladder tumors. After removing these tumors transurethrally, patients were randomly assigned to one of three treatments: placebo, thiotepa and pyridoxine. During the study, many patients had multiple recurrences of tumors, and new tumors were removed at each visit. Due to the sparseness of the data, only the first four recurrence times were reported. One of the analyses considered by Byar [22] and Wei, Lin and Weissfeld [21] was based on the tumor recurrence times from patients in the two groups placebo and thiotepa.
Here, we revisit the tumor recurrence data, where we consider modelling the recurrence time of a patient measured from the removal of old tumors at a given visit until the recurrence of new tumors. The recurrence time T ij represents the number of months from a given visit until the next j th tumor recurrence for the ith patient (i=1,…,85; j=1, …,4). As before, δ ij represents the censoring information (δ ij =0 if T ij is right-censored; δ ij =1 otherwise). The covariates considered in the study are: TREAT i =1 if the i th patient is in the thiotepa group and 0 otherwise; NUMBER i =number of initial tumors for the i th patient; and SIZE i =size of the largest initial tumor for the i th patient. We consider a proportional hazards model with a shared frailty in the form for i=1,…,85;j=1,…,4, where t is the time from the beginning of the j th recurrence interval and u i is the random effect (frailty) for the i th patient, assumed to be independently and normally distributed with mean 0 and variance component 2 The null hypothesis to be tested is that there is no difference in recurrence times between subjects for the given explanatory variables (    Table 3. Here, the treatment thiotepa appears to decrease the risk of recurrence of tumors, whereas this risk is increased by a large number of initial tumors. Note, that as we compute the score statistics under the null hypothesis of no frailties, the permutation score test does not require much computation time even with a large number of permutation samples. For the above example, we found the permutation p-value of the score test based on 10000 permutation samples in about 4 minutes and 50 seconds using the R package on a 64-bit Operating System with AMD Turion(tm) II P540 Dual-Core Processor 2.40 GHz and with 4.00 GB RAM.

Kidney data
McGilchrist and Aisbett [2] published some recurrence data, and studied the recurrence of infection in kidney patients who were using a portable dialysis machine. The infection in patients occurs at the point of insertion of the catheter. When the infection occurs, the catheter must be removed, the infection cleared up, and then the catheter reinserted. Recurrence times are measured from insertion until the next infection. When the catheter is removed for other reasons, there is right censoring of the data. Also, the final recurrence time may be censored as each patient is followed for a predetermined number of recurrence times. The covariates considered in the study are: AGE, and binary indicators for FEMALE as well as disease types GN, AN and PKD.
We revisit the kidney data, and consider a proportional hazards model with a shared frailty in the form for i=1,…,38;j=1,2, where t is the time from the beginning of the j th recurrence interval and u i is the i th patient effect (frailty), assumed to be independently and normally distributed with mean 0 and variance component 2 Here, we are interested in testing if there is any significant difference in recurrence times between subjects for the given explanatory variables or, equivalently, if the recurrence times for the same individual share the same frailty. Initially, we conducted the score test based on a subset of the data with a fewer number of patients by considering the last 25 individuals (patients 14-38 in Table 1 of McGilchrist and Aisbett [2]) in order to investigate the performance of the two score tests under a small sample. The value of the score statistic is obtained as 0.0594. The asymptotic p-value based on the mixture of chi-squares provides a value of 0.4038, whereas the permutation test based on 10000 permutation samples produced a smaller p-value value of 0.0469. Clearly, unlike the asymptotic score test, the permutation method indicates that there is significant difference at 5% level in recurrence times between subjects for the given explanatory variables. This suggests that for small samples, the two test procedures can provide different conclusions, and the proposed permutation test may be preferable to the asymptotic score test in such a case.
In the next step, we performed the score test based on the recurrence times for all 38 patients. The score statistic produced a small value of 0.0339. The (0.5, 0.5) mixture of chi-squares produced a p-value of 0.4270, whereas the permutation test based on 10000 permutation samples produced a p-value value of 0.4115. Both methods indicates no evidence against the null that the subject-specific frailty is 0. So, we consider fitting the above hazards model with no frailty. The maximum likelihood estimates of the model parameters, and their corresponding asymptotic standard errors are presented in Table 4. Here the FEMALE group and the disease type PKD appear to have higher risk of recurrence of infection in kidney patients.

Discussion
For testing the significance of a variance component in a frailty model, the proposed permutation score test provides a simple alternative to computing the asymptotic p-values of score tests based on the (0.5, 0.5) mixture of chi-square distributions. Our limited simulation study suggests that the permutation test has approximately the correct level of significance, and is also more powerful than tests based on the mixture of chi-square distributions under finite samples. The permutation score test is easy to implement and is also attractive in that it only requires estimation of the the fixed effects parameters under the null hypothesis of no shared frailty in the proportional hazards model.
We have discussed the permutation test for testing homogeneity of individuals in two-level survival data. This test can be easily extended to multilevel survival data with more than two levels. For testing homogeneity of groups at a given level, we can permute the indices corresponding to that level. For example, consider the bladder cancer study discussed in Section "Applications". If the patients are nested within medical practices, then we can test for homogeneity at the practice level by permuting the practices that the patients are assigned to, while the recurrence times would remain with the same patients for a given practice.
Note that Sinha [16] studied a parametric bootstrap score test, based on random samples generated from the fitted model under the null, which approximates the p-value of a one-sided score test for variance components in generalized linear mixed models. This   parametric bootstrap procedure, however, may not be directly applicable to survival data, since the development of the test procedure is complicated by the fact that the survival times are often censored.