A New One-Sample Log-Rank Test

The one-sample log-rank test has been frequently used by epidemiologists to compare the survival of a sample to that of a demographically matched standard population. Recently, several researchers have shown that the one-sample log-rank test is conservative. In this article, a modified one-sample log-rank test is proposed and a sample size formula is derived based on its exact variance. Simulation results showed that the proposed test preserves the type I error well and is more efficient than the original one-sample log-rank test. Journal of Biometrics & Biostatistics J o u rn al of Bio metrics & Bistatis t i c s


Introduction
Two-sample log-rank tests are frequently used to design and make inferences for randomized phase III survival trials with two treatment arms. The primary aim of such a study is to compare the survival distributions between two treatment groups. In some cases, it is also interested in comparing the survival distribution of a single sample to that of a standard population. Such comparison arises naturally in epidemiologic studies and clinical trials. For example, in an epidemiologic study, in which the survival data of patients with a life-threatening disease have been prospectively collected, it may be of interest to know if the study sample experiences better survival than the demographically matched standard population. It is not appropriate to use the two-sample log-rank test to make this comparison because the variance could be overestimated; thus, the p-value from the twosample log-rank test is invalid. However, an analog test statistic called the one-sample log-rank test [1] can be used for such study design and comparison.
There is relatively little literature available to design and make inferences for comparing the survival of a sample to a standard population. The one-sample log-rank test was first introduced by Breslow [2]. Its asymptotic property has been studied by Hyde [3], Anderson et al. [4], and Gill and Ware [5], and applications can be found in Finkelstein et al. [1], Berry [6], Woolson [7], and Anderson et al. [4]. Study designs using the one-sample log-rank test were considered by Finkelstein et al. [1]. Kwak and Jung [8], Jung [9], and Sun et al. [10] applied it to single-arm phase II clinical trial designs.
If a study is planned to determine whether the survival of the new study participants better than that of a standard population, then the study must be carefully designed to ensure sufficient power to detect a specific difference of the survival distributions. For the study design, a sample size formula of the one-sample log-rank test is given by Finkelstein et al. [1]. Kwak and Jung [8] proposed another sample size formula for single-arm phase II clinical trial design using the one-sample log-rank test. Wu [11] recently derived a new sample size formula based on its exact variance. However, simulation results done by Kwak and Jung [8], Sun et al. [10] and Wu [11] have shown that the one-sample log-rank test is conservative, even when the sample size is relatively large. Thus, it is necessary to develop a new test statistic that preserves the type I error rate and keeps the power as high as possible. Sun et al. [10] derived two corrections of the one-sample log-rank test statistics based on its Edgeworth expansion. However, a major drawback of their corrected tests is that they are more complicated test statistics involving higher-order moment estimations, which makes it difficult to derive their distributions under the alternative. Thus, they can't be used for the study design.
Here we propose a new and simple one-sample log-rank test to correct the conservativeness of the original one-sample log-rank test. A sample size formula is also derived for the new test for the purpose of the study design. The rest of the article is organized as follows. In Section 2, a new one-sample log-rank test is proposed. A sample size formula is derived in Section 3. In Section 4, simulation studies are conducted to compare the empirical type I error and power among four test statistics. An example is given in Section 5. Concluding remarks are given in Section 6.

One-Sample Log-Rank Tests
The one-sample log-rank test was first introduced by Breslow [2], and it has been used frequently by epidemiologists [3]. To introduce the one-sample log-rank test, let 0 ( ) x Λ and S 0 (x) be the known cumulative hazard and survival functions for the standard population, and let ( ) x Λ and S(x) be the unknown cumulative hazard and survival functions for the new study. Then the study may consider the following hypothesis of interest: or an equivalent to the hypothesis, in terms of cumulative hazard function Suppose during the accrual phase of the trial n subjects are enrolled in the study. Let T i and C i denote, respectively, the failure time and censoring time of the i th subject. We assume that the failure time T i and censoring time C i are independent and {T i ,C i ,i=1,...,n} are independent and identically distributed. Then the observed failure time and failure indicator are ∑ as the expected number of events (asymptotically), then the one-sample logrank test is defined by To study the asymptotic distribution of the one-sample log-rank test statistic, we formulate it using counting-process notations [12].
be the failure and at-risk processes, respectively, then Thus, the counting-process formulation of the one-sample logrank test is given by Under the null hypothesis H E W = Therefore, by counting process central limit theorem [12], under the null hypothesis, L 1 is asymptotically standard normal distribution. Hence, we reject the null hypothesis H 0 with one-sided type I error α if 1 is the 100 (1 − α) percentile of the standard normal distribution.
Simulation results showed, however, that the one-sample log-rank test L 1 is conservative, even when the sample size is relatively large [8][9][10][11]. For example, the empirical type I error of L 1 could be as low as 0.036 for a one-sided type I error rate of 0.05 (Table 1). To preserve the type I error, Sun et al. [10] derived two corrections based on Edgeworth expansion which are given below. Let n Note that Sun et al. [10] defined K n =−L 1 , whereas our simulation results showed that it should be K n =L 1 . A major drawback of the two corrected tests is that they are more complicated test statistics involving higher-order moment estimations, which makes it difficult to derive their distributions under the alternative. Thus, they cannot be used for the study design.

( ( )) and ( ),
as shown in the Appendix, thus, to correct the conservativeness of the original one-sample log-rank test L 1 , we propose a new one-sample logrank test which is defined as In counting-process formulation, it is given by As shown in the Appendix, under the null hypothesis, Therefore, again by counting-process central limit theorem under the null hypothesis, L 4 is asymptotically standard normal distribution. Hence, we reject the null hypothesis Simulation studies are conducted in Section 4 to compare the empirical type I error and power of the original one-sample log-rank test L 1 to that of the two corrections L 2 and L 3 , and the new test L 4 .

Sample Size Calculation
To design the study, sample size must be calculated to detect a specified survival difference at the alternative < Λ given the type I error α and power 1−β. For the sample size calculation, the exact variance of W has been derived by Wu [11]. Let the exact mean and variance of W at the alternative be is approximately standard normal distribution under H 1 . Under the alternative hypothesis, and the power of the one-sample log-rank test 1/ L W σ = should satisfy the following equations: Therefore, the required sample size for the test statistic L 1 is given by Similarly, under the alternative, should satisfy the following equations: Therefore, the required sample size for test statistic L 4 is given by 2 2 , , and σ σ ω are the same as given above.

Simulation Studies
To study the performance of the two one-sample log-rank tests and their sample size formulas, we conducted simulation studies to compare the empirical power and type I error under different scenarios. In simulation studies, the survival distribution of the standard population was taken as the Weibull distribution We assumed that subjects were recruited with a uniform distribution over the accrual period t a and followed for t f . We further assumed that no subject was lost to follow-up or drop-out during the study. Then the censoring time is uniformly distributed on the interval [t f ,t a +t f ]. Thus, under the Weibull model, quantities p 0 , p 1 , p 00 , and p 01 , hence 2 2 2 0 1 , , , σ σ ω σ can be calculated by numerical integrations. Given the nominal significance level of 0.05 and power of 90%, the required sample sizes for each design scenario were calculated for test statistics L 1 and L 4 ( Table 1). The empirical type I error and power for the corresponding design were also simulated based on 100,000 samples generated from the Weibull distribution (Table 1). To compare the four test statistics, we also simulated the empirical type I error and power of the four test statistics L 1 −L 4 given the same sample size n=30, 50, 100, and 200 ( Table 2).
The sample size calculation (Table 1) showed that the original onesample log-rank test L 1 required a larger sample size than that of the new test L 4 . The simulated empirical type I errors for the corresponding sample size showed that the type I error of L 1 was always less than the nominal level. Thus, the original one-sample log-rank test L 1 was conservative. The empirical type I errors of the new test L 4 were close to the nominal level in most scenarios and were slightly liberal when the sample size was small. The simulation results in Table 2 with the same sample size further confirmed that the test L 1 was conservative and that L 4 preserved the type I error well and had a higher power than that of the L 1 . It is consistent with the results from sample size calculations that L 4 had a smaller sample size than did L 1 . Simulations were also done for the two corrected tests L 2 and L 3 . The results showed that L 2 preserved the type I error well and had a higher power than L 1 and L 2 , and L 3 was slightly conservative when sample size was small. Furthermore, the empirical type I error and power of test L 4 were also comparable to the two corrections L 2 and L 3 .
To compare the null distribution functions of the four test statistics to the standard normal for small sample sizes, we conducted 100,000 simulation runs to simulate the empirical distribution functions of L 1 − L 4 under the null with sample size n=30 to 200 ( Table 3). The simulation results showed that the distribution of L 1 had a light left tail, while L 4 had a slightly heavier left tail than a standard normal distribution function. The results explained the observations from previous simulations that the test L 1 was conservative and L 4 was slightly liberal when the sample size was small. The distribution of L 2 was almost the same as the standard normal distribution function, and the distribution of L 3 had a slightly lighter left tail when sample size was small. Overall, L 4 preserved type I error well and had power higher than that of L 1 -L 3 . The distribution function of L 4 was also close to the standard normal and comparable to that of L 2 and L 3 . The major advantage of L 4 is its simplicity and ease with which it derives the asymptotic distribution under the alternative. Therefore, the proposed new one-sample logrank test L 4 is preferred for the study design and data analysis of a study comparing the survival of a sample to that of the standard population.   or a p-value of 0.042; thus, we can claim that the mortality from other causes among patients with melanoma is significantly lower than that of the Danish general population.

Conclusions
A simple one-sample log-rank test is proposed, and its sample size formula is derived. Simulation results showed that the new test L 4 preserves the type I error well and is comparable to the two corrections based on Edgeworth expansion [10]. The proposed new test L 4 had power higher than that of the original test L 1 and the two corrections L 2 and L 3 . The sample size formula derived from the new test statistic L 4 provides adequate power for the study design. To use the one-sample log-rank test to design a study and make inferences, the underlying distribution or hazard function of the standard population has to be correctly specified, b ecause b oth s tudy design a nd i nference d epend on the validity of this assumption. In an epidemiologic study, the standard population is often well defined. Therefore, one can use the method proposed by Finkelstein et al. [1] to calculate the expected number of events and estimate the survival distribution of the standard population. In a phase II clinical trial, the survival function of the historical control can be estimated from meta-analysis or other sources [10]. Nevertheless, a simple one-sample log-rank test is proposed, and its sample size formula is derived to provide a study design that preserves the type I error and ensures sufficient power to detect the difference of survival distributions between a sample and a standard population.