Visit for more related articles at Journal of Biometrics & Biostatistics
In the analysis of recurrent event data, frailties are commonly used to model the dependence structure among repeated event times within an individual. Often it is of interest to test whether the variance component in a frailty model is zero. It is wellknown that the usual asymptotic mixtures of chisquare distributions of the score statistics for testing constrained variance components do not necessarily hold. In this paper, we propose and explore a stochastic permutation score test based on randomly permuting the indices associated with the individuals of a survival model. An empirical study suggests that the proposed score test has approximately the correct level of significance and is more powerful than the asymptotic score test based on the mixture of chisquare distributions. The proposed test is illustrated using two sets of actual recurrence failure time data obtained from clinical experiments.
Keywords 
Failure time data; Frailty; Maximum likelihood; Permutation test; Proportional hazards model; Score test; Variance component 
Introduction 
In many applications of survival data, often the survival times are not independent. Such data can arise when different individuals share some common feature or when we observe recurrent event times within an individual. Frailties may be used to model the dependence structure in such survival data. A frailty can also be used to describe an unobservable genetic effect if individuals are in sibling groups or an environmental effect if individuals are grouped by households. Some experiments lead to repeated event times within an individual, and in such cases, frailties can be used to model the association among the repeated outcomes. 
Frailties can also be used to model survival data in the presence of unobserved heterogeneity. For univariate (independent) failure times, these may be used to describe the effects of unobserved covariates in a proportional hazards model. For multivariate (dependent) failure times, these may be used to model the dependence structure among the multivariate observations, where it is usually assumed that given the frailty the multivariate failure times are conditionally independent. Frailty models are extensions of the Cox regression models [1], and provide an alternative way of modelling survival data where the hazard function is not monotonic, or where the hazards are not proportional. Frailties are also used to model survival times for grouped individuals, such as twins or family members, and recurrence survival times for the same individual. It is usually assumed that the frailties are random observations from a probability distribution with mean zero, but with an unknown variance component that needs to be estimated. A number of authors studied frailty models for describing heterogeneity in survival data, which include McGilchrist and Aisbett [2]; Klein [3]; McGilchrist [4]; Aalen [5,6]; Hougaard [7]; Klein and Moeschberger [8]; and Hougaard [9]. 
Often we are interested in testing whether the individuals in recurrent event data or groups in clustered survival data are homogeneous for given explanatory variables, or equivalently, whether the variance component in a frailty model is zero. In this paper, we investigate a score test for testing the homogeneity of the individuals (or groups). The score statistic is derived from a Taylor series expansion of the likelihood function about the mean of the frailty. The development of the test procedure is computationally less intensive as compared to the full likelihood ratio test in that it only requires the calculation of the score statistic under the null hypothesis of no frailty. It is wellknown that for finite samples, the usual onesided score tests based on mixtures of chisquares often result in incorrect estimates of the level of significance (see, for example, Shephard and Harvey [10]; Shephard [11]; Pinheiro and Bates [12]; Crainiceanu, Ruppert and Vogelsang [13]; Crainiceanu and Ruppert [14]; Fitzmaurice, Lipsitz and Ibrahim [15]; and Sinha [16]). As a remedy, here we propose and explore a permutation test that approximates the pvalue of the onesided score test for the variance component. The proposed test provides approximately the correct level of significance under the null hypothesis even for small samples, and is also more powerful than tests based on mixtures of chisquare distributions. 
The paper is organized as follows. Section “The Score Test” describes the proposed score statistic for testing the variance component in a frailty model. It also describes the permutation method for approximating the pvalue of the score test. Section “Illustrative Example” presents an example using a survival model with a shared frailty for recurrent event data, and illustrates the calculation of the score statistic for testing the variance component of the frailty term. Section “Simulation Study” presents results from a simulation study that was carried out to investigate the finitesample properties of the proposed permutation score test as well as the asymptotic score test based on the mixture of chisquare distributions. Section “Applications” provides applications of the proposed test with two sets of actual recurrence failure time data obtained from clinical experiments. Section “Discussion” concludes the paper with some discussion. 
The Score Test 
Suppose, we observe recurrent event data from N individuals, where the i^{th} individual has n_{i} repeated event times. Let T_{ij} denote the j^{th} event time for the i^{th} individual and δ_{ij} denote the censoring information (δ_{ij} = 0 if T_{ij} is rightcensored; δ_{ij} = 1 otherwise). We assume that the censoring is noninformative. Let x_{ij}= (x_{ij1},..., x_{ijp})’ denote the vector of explanatory variables associated with the (i,j)^{th} event. Suppose, conditional on the frailty u_{i}, the hazard rate for the j^{th} event is of the form: 
(1) 
where h_{0}(t)=h_{0}(t;θ) is a baseline hazard function depending on a vector of unknown parameters θ and β is a vector of unknown regression coefficients. We assume that the frailties u_{i}s are independent and follow a normal distribution with mean 0 and variance component . As the hazard rates within an individual share the same frailty, the event times within an individual are not independent. However, in the limit as →0, they tend to be independent. 
Given the data {(t_{ij},δ_{ij});i=1,...,N;j=1,...,n_{i}}, the marginal likelihood of the model parameters may be expressed in the form: 
(2) 
where with S_{ij}(t) being the survivor function for the (i,j)^{th} event at time t, f_{ij}(t), is also the corresponding density function at time t, and denotes the expectation with respect to the distribution of the frailty u_{i}. 
Similarly to Cox [17] (see also, Dean [18]), we expand the term in (2) using a Taylor series expansion about and take expectations with respect to u_{i} to obtain the marginal likelihood for the i^{th} individual as 
(3) 
where 
To test the homogeneity of individuals, we set the null hypothesis that there is no difference in event times between individuals for the given explanatory variables, whereas the alternative hypothesis is that the event times for a given individual share a common frailty. This is equivalent to testing the null hypothesis , against the onesided alternative hypothesis . The score test for testing the null is based on the score function 
(4) 
where and is the score function of for the i^{th} individual evaluated as =0: 
(5) 
with H_{0}(t)=H_{0}(t;θ) being the baseline cumulative hazard function at time t, depending on the parameters θ. 
The score statistic is defined as a function of , where are the ML estimators of the nuisance parameters under . An approximate variance of the score function U(γ) can be derived from the observed Fisher information matrix: 
evaluated at =0, where 
and 
For calculating the Fisher information I(γ), after some algebra, we can show that for the i^{th} individual, 
and 
where 
The variance of the score function U(γ) may be approximated from 
(6) 
To test the null H0 against the onesided alternative H1, following Silvapulle and Silvapulle [19], we use a score statistic in the form 
(7) 
where . The development of this statistic is motivated by the fact that in the limit as N→∞, it becomes the likelihood ratio statistic (see Silvapulle and Silvapulle [19], for details). For positive , the score at zero is positive, and the infimum in (7) becomes zero in {δ>0}. But when is negative, the score at zero is also negative, and so the infimum in (7) is attained at δ=0 and the statistic T becomes zero. As noted by Verbeke and Molenberghs [20], there should be valid models for sufficiently small but negative values of , even under a constrained setting. 
As we noted earlier, under the null , the score statistic T for the onesided test does not follow the usual chisquare distribution, since the value of under H_{0} is on the boundary of the parameter space. As the number of individuals N tends to ∞, T has the mixture distribution:0.5 +0.5, where has a point mass at 0 and has a chisquare distribution with one degree of freedom. However, for finite values of N, this mixture of chisquares may lead to incorrect level of significance. In the case of a linear mixed model with a fixed intercept and a random group (or cluster) effect, Crainiceanu and Ruppert [14] showed that for fixed N and n_{i}→∞, the likelihood ratio statistic has a mixture distribution in the form (1−α_{N})+α_{N}, where α_{N} is determined by the group size N. However, for survival models with frailties, the appropriate mixture of chisquares may not be straightforward to obtain. 
To approximate the distribution of the score statistic, we propose a permutation method. The proposed permutation score test provides approximately the correct level of significance under the null, even for small samples. To obtain an approximate pvalue of the score test based on the permutation method, we adopt the following algorithm: 
1. For given survival data, obtain estimates of the nuisance parameters γ under the null hypothesis . Using equations (4)(7), compute the observed value of the score statistic T, denoted by T_{obs}. 
2. Hold the number of events n_{i} fixed for the i^{th} individual, and then permute the individual indices of the given data at random. Compute the score statistic T based on the permutation sample. Produce a series of test statistics (T^{1},...,T^{R}) for R permutation samples by repeating this step a large number of times R. 
3. The approximate pvalue of the stochastic permutation score test is determined by the proportion of permutation samples with T^{r} ≥ T_{obs}. 
Note that Fitzmaurice, Lipsitz and Ibrahim [15] used a similar permutation method to approximate the pvalue of a likelihood ratio test for testing the variance components in multilevel generalized linear mixed models. Here, we adopt the permutation method to approximate the pvalue of the proposed score test for testing a frailty variance component in recurrent event data. 
We have described the score test for a frailty variance component in the framework of recurrent event data. This approach can also be used for testing the significance of frailties in clustered survival data. We may encounter such clustered data when different individuals share some common characteristics. For example, in a multicenter clinical experiment, the survival times of patients from the same center may be more similar as compared to those from different centers. This could be due to different health care services provided to the patients in the different centers. Here, we can treat the centers as clusters and describe the homogeneity of survival times of the patients within a center using a shared frailty model. 
Illustrative Example: A Proportional Hazards Model with a Shared Frailty 
Consider a simple twolevel exponential hazards model with two covariates x_{ij1} and x_{ij2}, and with a single frailty u_{i}: 
(8) 
where h_{0}(t)=λ is the baseline hazard function, and the frailties u_{i} are assumed to be independently and normally distributed with mean zero and unknown variance component . Model (8) may be rewritten in the form 
where β_{0}= logλ, x_{ij} =(1,x_{ij1,}x_{ij2})′ represents the vector of covariates, and β=(β_{0},β_{1},β_{2})′ represents the vector of unknown regression coefficients. 
For model (8), the score function (4) takes the form 
(9) 
Also, for the Fisher information matrix, we have under : 
and 
The score statistic T takes the form (7) with the score function and the variance function , where and is the ML estimator of β under . We performed a simulation study to investigate the empirical properties of the proposed permutation score test for the variance component in (8). Details are provided in the next section. 
Simulation Study 
To study the finitesample properties of the proposed permutation test, we ran a series of simulations. The “true” event times y_{ij} with frailties were generated from the hazards model (8) with the values of the regression coefficients being fixed at β_{0}=log(λ)=log(0.08), β_{1}=0.5 and β_{2}=−0.25. For each combination of N=50, 100, 200 individuals and n=2, 10, 50 repeated event times within individuals, we performed a simulation study based on 1000 replicates of data sets. The censoring times c_{ij} were assumed to be independently and identically distributed exponential random variables with mean 1/λ^{*}=1/0.08, and the censoring mechanism was assumed to be noninformative. The observed data were {(t_{ij},δ_{ij});i=1,...,N;j=1,...,n}, where t_{ij}=min(y_{ij},c_{ij}) and δ_{ij}=I(y_{ij} < c_{ij}). 
The value of the frailty variance component in (8) was set to zero when investigating the empirical level of significance of the proposed test. We compared the pvalue of the permutation test to the asymptotic pvalue obtained by the (0.5, 0.5) mixture of chisquare distributions. The permutation pvalue was based on B=1000 permutation samples, and the empirical level of the test was obtained as the proportion of samples for which the estimated pvalues were less than the nominal level α=0.05. 
Table 1 presents the estimated levels of significance of the two score tests. We note from the results that the level of the permutation test is generally much closer to the nominal 0.05 level of significance, as compared to the level based on the mixture of chisquare distributions. For the latter case, the levels get closer to the nominal 0.05 level only when the number of events n and the number of individuals N increase. The permutation test roughly provides the correct level of significance in each of the simulation configurations considered. Also, the approximate 95% normal confidence intervals of the true levels, based on the estimated standard errors, suggest that the empirical levels of the permutation tests are not significantly different from the nominal 0.05 level, whereas most of the levels from the asymptotic score tests are significantly different. 
We also investigated the powers of the two score tests using the same simulation configurations as above. The empirical powers were calculated under the alternative hypothesis . We used 1000 simulation replications for each simulation configuration, and also used 1000 permutation samples to find the permutation pvalue of the score test. Table 2 presents the empirical powers of the two tests. It is clear that the proposed permutation test is generally more powerful than the test based on the mixture of chisquare distributions for the simulation configurations considered. For N=200 and n=2, however, the score test based on the asymptotic mixture appears to be more powerful than the permutation test. But it should be noted that the empirical powers of the asymptotic score test may not be reliable here as the test provides incorrect level of significance for the given sample size. 
The results from the simulation study demonstrate that the proposed permutation test generally has the correct level of significance and is also more powerful than the asymptotic test based on the mixture of chisquare distributions. 
Applications 
Bladder cancer data 
Wei, Lin and Weissfeld [21] presented and analyzed some tumor recurrence data obtained from a bladder cancer study conducted by the Veterans Administration Cooperative Urological Research Group (see Byar [22] for details). In this study, all patients entering the trial had superficial bladder tumors. After removing these tumors transurethrally, patients were randomly assigned to one of three treatments: placebo, thiotepa and pyridoxine. During the study, many patients had multiple recurrences of tumors, and new tumors were removed at each visit. Due to the sparseness of the data, only the first four recurrence times were reported. One of the analyses considered by Byar [22] and Wei, Lin and Weissfeld [21] was based on the tumor recurrence times from patients in the two groups placebo and thiotepa. 
Here, we revisit the tumor recurrence data, where we consider modelling the recurrence time of a patient measured from the removal of old tumors at a given visit until the recurrence of new tumors. The recurrence time T_{ij} represents the number of months from a given visit until the next j^{th} tumor recurrence for the i^{th} patient (i=1,…,85; j=1, …,4). As before, δ_{ij} represents the censoring information (δ_{ij}=0 if T_{ij} is rightcensored; δ_{ij}=1 otherwise). The covariates considered in the study are: TREAT_{i}=1 if the i^{th} patient is in the thiotepa group and 0 otherwise; NUMBER_{i}=number of initial tumors for the i^{th} patient; and SIZE_{i}=size of the largest initial tumor for the i^{th} patient. We consider a proportional hazards model with a shared frailty in the form 
(10) 
for i=1,…,85;j=1,…,4, where t is the time from the beginning of the j^{th} recurrence interval and u_{i} is the random effect (frailty) for the i^{th} patient, assumed to be independently and normally distributed with mean 0 and variance component . 
The null hypothesis to be tested is that there is no difference in recurrence times between subjects for the given explanatory variables (), against the alternative that the recurrence times for the same individual share the same frailty (). To perform the proposed score test, we first fit the model under the null using the maximum likelihood method. The score statistic produced a value of 12.001. Here, the permutation test based on 10000 permutation samples produced a pvalue of 0.0005. The usual (0.5, 0.5) mixture of chisquares also produced a small pvalue of 0.00027, as expected. Clearly, both methods indicate strong evidence against the null, that is, there is significant difference in recurrence times between subjects for the given explanatory variables. As is significant, we fit the above hazards model with the frailty. The maximum likelihood estimates of the model parameters, and their corresponding approximate standard errors are shown in Table 3. Here, the treatment thiotepa appears to decrease the risk of recurrence of tumors, whereas this risk is increased by a large number of initial tumors. 
Note, that as we compute the score statistics under the null hypothesis of no frailties, the permutation score test does not require much computation time even with a large number of permutation samples. For the above example, we found the permutation pvalue of the score test based on 10000 permutation samples in about 4 minutes and 50 seconds using the R package on a 64bit Operating System with AMD Turion(tm) II P540 DualCore Processor 2.40 GHz and with 4.00 GB RAM. 
Kidney data 
McGilchrist and Aisbett [2] published some recurrence data, and studied the recurrence of infection in kidney patients who were using a portable dialysis machine. The infection in patients occurs at the point of insertion of the catheter. When the infection occurs, the catheter must be removed, the infection cleared up, and then the catheter reinserted. Recurrence times are measured from insertion until the next infection. When the catheter is removed for other reasons, there is right censoring of the data. Also, the final recurrence time may be censored as each patient is followed for a predetermined number of recurrence times. The covariates considered in the study are: AGE, and binary indicators for FEMALE as well as disease types GN, AN and PKD. 
We revisit the kidney data, and consider a proportional hazards model with a shared frailty in the form 
(11) 
for i=1,…,38;j=1,2, where t is the time from the beginning of the j^{th} recurrence interval and u_{i} is the i^{th} patient effect (frailty), assumed to be independently and normally distributed with mean 0 and variance component . 
Here, we are interested in testing if there is any significant difference in recurrence times between subjects for the given explanatory variables or, equivalently, if the recurrence times for the same individual share the same frailty. Initially, we conducted the score test based on a subset of the data with a fewer number of patients by considering the last 25 individuals (patients 14–38 in Table 1 of McGilchrist and Aisbett [2]) in order to investigate the performance of the two score tests under a small sample. The value of the score statistic is obtained as 0.0594. The asymptotic pvalue based on the mixture of chisquares provides a value of 0.4038, whereas the permutation test based on 10000 permutation samples produced a smaller pvalue value of 0.0469. Clearly, unlike the asymptotic score test, the permutation method indicates that there is significant difference at 5% level in recurrence times between subjects for the given explanatory variables. This suggests that for small samples, the two test procedures can provide different conclusions, and the proposed permutation test may be preferable to the asymptotic score test in such a case. 
In the next step, we performed the score test based on the recurrence times for all 38 patients. The score statistic produced a small value of 0.0339. The (0.5, 0.5) mixture of chisquares produced a pvalue of 0.4270, whereas the permutation test based on 10000 permutation samples produced a pvalue value of 0.4115. Both methods indicates no evidence against the null that the subjectspecific frailty is 0. So, we consider fitting the above hazards model with no frailty. The maximum likelihood estimates of the model parameters, and their corresponding asymptotic standard errors are presented in Table 4. Here the FEMALE group and the disease type PKD appear to have higher risk of recurrence of infection in kidney patients. 
Discussion 
For testing the significance of a variance component in a frailty model, the proposed permutation score test provides a simple alternative to computing the asymptotic pvalues of score tests based on the (0.5, 0.5) mixture of chisquare distributions. Our limited simulation study suggests that the permutation test has approximately the correct level of significance, and is also more powerful than tests based on the mixture of chisquare distributions under finite samples. The permutation score test is easy to implement and is also attractive in that it only requires estimation of the the fixed effects parameters under the null hypothesis of no shared frailty in the proportional hazards model. 
We have discussed the permutation test for testing homogeneity of individuals in twolevel survival data. This test can be easily extended to multilevel survival data with more than two levels. For testing homogeneity of groups at a given level, we can permute the indices corresponding to that level. For example, consider the bladder cancer study discussed in Section “Applications”. If the patients are nested within medical practices, then we can test for homogeneity at the practice level by permuting the practices that the patients are assigned to, while the recurrence times would remain with the same patients for a given practice. 
Note that Sinha [16] studied a parametric bootstrap score test, based on random samples generated from the fitted model under the null, which approximates the pvalue of a onesided score test for variance components in generalized linear mixed models. This parametric bootstrap procedure, however, may not be directly applicable to survival data, since the development of the test procedure is complicated by the fact that the survival times are often censored. 
Acknowledgements 
This research was partially supported by a grant from the Natural Sciences and Engineering Research Council of Canada. 
References 

Table 1  Table 2  Table 3  Table 4 