Medical, Pharma, Engineering, Science, Technology and Business

Department of Mathematics and Statistics, Memorial University, St. John’s, A1C 5S7, Newfoundland, Canada

- *Corresponding Author:
- Zhaozhi Fan

Department of Mathematics and Statistics

Memorial University

St. John’s, A1C 5S7

Newfoundland, Canada

**Tel:**1-709-864-8076

**Fax:**1-709-864-3010

**E-mail:**[email protected]

**Received date:** August 17, 2012; **Accepted date:** September 20, 2012; **Published date:** September 25, 2012

**Citation:**Granville K, Fan Z (2012) Accelerated Failure Time Models with Auxiliary Covariates. J Biom Biostat 3:152. doi:10.4172/2155-6180.1000152

**Copyright:** © 2012 Granville K, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

**Visit for more related articles at** Journal of Biometrics & Biostatistics

In this paper we study semi-parametric inference procedure for accelerated failure time models with auxiliary information about a main exposure variable. We use a kernel smoothing method to introduce the auxiliary covariate to the likelihood function. The regression parameters are then estimated through maximization of the estimated likelihood function. A consistent estimator of the variance of the estimator of the regression coefficients is proposed. Simulation studies show that the efficiency gain is remarkable when compared to just using the validation sample. The method is applied to the PBC data from the Mayo Clinic trial in primary biliary cirrhosis as an illustration.

Kernel smoothing; Estimated likelihood function; Accelerated failure time models; Measurement errors; Auxiliary covariates

It is quite common when attempting statistical analysis on a set of data that researchers run into the problem of missing or mismeasured observations. This is often the case in medical studies where the tests to get an accurate measurement may be particularly expensive or invasive for the patient. The medical researchers can opt to examine another relevant variable which may be cheaper or easier to measure, even if it does not provide as much information. This can be tested for in place of the original variable or along side when it is possible to do so. Researchers then have the choice to work with either a smaller sample size using just the samples with measurements for the variable of interest or to include the imperfect data in the analysis with the goal of gaining a higher efficiency. For example, in the Primary Biliary Cirrhosis (PBC) study conducted at Mayo Clinic between 1974 and 1984, Aspartate Aminotransferase (AST) was an important predictive variable to the survival time of PBC patients, which was only collected for patients registered to the double blind clinical trial, due to reasons similar to those previously mentioned. But another closely related variable, bilirubin, is recorded for all PBC patients [1]. In order to enhance the efficiency of the statistical analysis regarding the relationship between AST and the patients’ survival, it might be worthy to have the available information from all the patients included. Motivated by this example, in this article we propose an inferential method for this kind of survival data, where we replace the missing or mismeasured data using kernel smoothing based on an auxiliary covariate, which is measured for each subject.

When it is possible to have some of the desired data measured accurately, these cases form a validation set. The validation set contains measurements for both the variable of interest and the auxiliary covariate. The rest of the cases are placed into the non-validation set where only the auxiliary covariate is available. In the analysis of this data, if the auxiliary covariate is just the original variable with measurement error, one could be inclined to use it in place of the missing data. Unfortunately this naive method will lead to estimation bias for all regression coefficients in the model which, depending on the magnitude of the error, can be quite large [2]. Hence it is very important for researchers to include as many subjects as possible in their analysis, to aim at a higher efficiency, as well as to correct the estimation bias caused by measurement errors.

Much research has been done in this area in the past. Some research on how to incorporate missing or mismeasured data in models includes the works of Rubin [3], Fuller [4], Carrol et al. [5], Wang et al. [6], Meng and Schenker [7], Cheng and Wang [8] and Yu and Nan [9], to list a few. A common specific statistical model chosen for these situations is the Cox model [10]. For details see Cox and Oakes [11], Kalbfleisch and Prentice [12], Hu et al. [13] and Hu and Lin [14], among others. In this article however, we focus on the parametric accelerated failure time models. When an auxiliary covariate is included in the analysis through an estimated likelihood, the AFT model is more efficient if an appropriate distribution of the failure time is known. Research work based on an estimated partial likelihood function has been conducted by many authors such as Pepe and Flemming [15], Pepe [16], Zhou and Pepe [17], Zhou and Wang [18], Zhou et al. [19], Jiang and Zhou [20], Fan and Wang [21] and Liu et al. [22]. Recently He et al. [23] proposed to use SIMEX method to handle the accelerated failure time models when covariates are subject to measurement error. But investigation about the performance of accelerated failure time models with auxiliary covariates is still limited and deserves to be carried out, due to some reasons such as (1) the AFT models have direct physical interpretation, (2) the AFT models can better predict the survival function of a patient and (3) the AFT models are robust to model misspecification in the sense that ignoring a covariate will not lead to much biased estimates of other regression coefficients [11].

The rest of this article is organized as follows. Section 2 presents the general accelerated failure time model and some special cases which we use in our calculations. Section 3 covers the estimation method. Section 4 discusses the asymptotic properties of our estimator. Section 5 shows the simulation results for finite samples as well as the results from analyzing data from the Mayo Clinic trial in PBC. In Section 6 we put forth our concluding remarks. Finally, we outline the regularity conditions and proof for the theoretical results from Section 4 in the Appendix.

**The accelerated failure time model**

Let {X_{i}, Z_{i}} denote the covariate vector where X_{i} is the component which is only observed in the validation set and Z_{i} is the component that is always observed. In this case we assume that X_{i} is scalar and that Z_{i} is a vector. For every X_{i}, let W_{i} be the corresponding auxiliary covariate of the form W_{i} = X_{i} + U_{i} where U_{i} is the measurement error incurred when attempting to observe X_{i}. We assume that U_{i} follows a normal distribution such that U_{i}~ N(0, σ_{u}^{2}). Let T_{i}, C_{i} and δ_{i} repressent the i^{th} failure time, censoring time and censoring indicator, δ_{i}=I_{[Ti ≤Ci]}.We assume that out of the n subjects, the sample size for the validation sample where the X_{i}’s are correctly observed is n_{V} and the sample size for the non-validation sample where we do not observe X_{i}’s is . The observed data is therefore {S_{i},δ_{i},Z_{i}, X_{i}, W_{i}}for the validation sample and{S_{i},δ_{i},Z_{i},W_{i}} for the non-validation sample, where S_{i} = min (T_{i},C_{i}).

The accelerated failure time model can be expressed as

Y_{i} = log(T_{i})=β_{1}X_{i} + β′_{2}Z_{i} + ε_{i}, (1)

where β′= ( β_{1}, β′_{2})is a vector of unknown parameters that we must estimate and ε_{i} is the random error which has pdf f_{ε}(ε).

Note that the random error term “ in model (1) is in its general form. When standardized the scale parameter 1/σ or b should be included, as in Lawless [24]. Also the equation (1) assumes automatically that if we are given (X_{i}, Z_{i}), W_{i} gives us no additional information about the failure time.

The pdf f_{T} (t_{i}; β, X_{i}, Z_{i}) of T_{i} depends on the form of f_{ε}(ε). Once we have f_{T} (t_{i}; β, X_{i}, Z_{i}), we are able to calculate the survival and hazard functions for failure time T_{i} as shown below.

(2)

and (3)

The maximum likelihood estimator of the parameters is the maximizer

where

which using (3) can be rewritten as

(4)

**Some special cases:** There are some special distributions of the survival time which are of specific interests to practitioners in medical research.

**The generalized gamma distribution**

We begin by demonstrating how to obtain the likelihood function and estimating equations for the generalized gamma distribution model. This is a very useful distribution. It can be reduced into the Weibull, exponential, or log normal models. We may write the general model as

Y_{i} = log (T_{i})= μ + β_{1}X_{i} + β′_{2}Z_{i} + σV_{i}, (5)

where σV_{i} takes the place of ε_{i} from equation (1) and follows the generalized gamma distribution. The likelihood function is given as

(6)

where the function I[a,x] is the incomplete gamma function, defined as

.

So

.

**A reduced case, the exponential regression model**

When μ= 0, θ= 1, and σ= 1, the likelihood function (6) is reduced to

(7)

**A proportional odds model**

When modeling AFTs, proportional hazards and proportional odds models are frequently used. The above reduced case is a proportional hazards model. Now let us look at the alternative. Letting μ= 0 again, we then let V_{i} in equation (5) follow the standard logistic distribution.

The likelihood function is

(8)

**Remark 2.1**

A suitable model can be chosen by following the routine procedure based on the validation sample. The auxiliary information can be utilized based on the selected parametric model following the estimation method introduced in the following section.

The regression parameters can be estimated through the use of the maximum likelihood method, say which solves the estimating equations

where

(9)

Both the hazard and survival functions depend on X_{i}, which is available only for the validation sample. The non-validation sample does not contain the X_{i} measurements. However, there is auxiliary information available. In order to enhance the efficiency of the data analysis, one should take the auxiliary variables into consideration. In this paper we propose to predict the unobserved X_{i}'s from their corresponding auxiliary covariates, the W_{i}'s, by using kernel smoothing and then using these to estimate the hazard and survival functions. For details about kernel smoothing, one can see Nadaraya [25], Watson [26] and Wand and Jones [27]. The equation to estimate the unobserved values is

(10)

where k( ) is the kernel function and h is the chosen bandwidth for smoothing. Note that the selection of the bandwidth should be such that the bandwidth conditions of Theorem 1 be satisfied. Here the optimal bandwidth is chosen as h=2σ_{u}n^{-1/3}, as suggested by Zhou and Wang [18].

We can therefore write the estimated likelihood and estimated log-likelihood as

and

(11)

where and .

For our reduced case in Section 2.1.2, equation (11) becomes

(12)

and for our proportional odds example in Section 2.1.3, we have

(13)

The estimates of the regression parameters are then the maximizers of the estimated log-likelihood function,

,

Which can be obtained by solving the estimating equations

.

When the unobservable X_{i}’s, i = 1,…, n_{V}, are replaced using the proposed kernel smoothing, the unknown parameters can be estimated with existing programs, such as those written in R or SAS. However, the corresponding variance estimates are going to fail due to the estimated unknown X_{i}’s. Hence in this paper we propose to use the Newton-Raphson algorithm to estimate the regression parameters. The variance and covariance matrix of the estimator can be estimated from the calculation process.

**Remark 3.1**

1. The distribution of the failure time needs be specified in this procedure. An appropriate one can be chosen based on the validation sample by using the routine procedure for parametric model selection. See, for example, Lawless [24].

2. The direct imputation of the unobservable covariate with its kernel smoothing estimation is due to the consideration of model robustness. If there exists slight misspecification of the model, the maximum likelihood estimator of the regression parameters based on the kernel smoothing of unknown expectation in the likelihood function will be inconsistent. This was also observed in our simulation studies but the results will not be reported.

3. The scale parameter, if unknown, can be estimated based on the validation sample routinely, or by adding another equation which is obtained by differentiating the estimated log likelihood with respect to this scale parameter, say,

.

4. This method can accommodate both missing covariate and mismeasured covariate problems.

5. This method can be extended by using local linear approximation (see Fan and Wang 2009) instead of the equation (10). In nonparametric smoothing, local linear approximation usually performs better than kernel smoothing. The method also accommodates models with instrumental variables.

**Asymptotics**

Under the regularity conditions (a), (b), (c), and the condition (d) listed in the appendix, our proposed estimates of the regression parameters by maximizing the estimated likelihood function are jointly consistent and asymptotically normally distributed, as described in the following theorem.

Suppose that the order of the kernel function K is, say

.

**Theorem 4.1**

Under the conditions (a), (b), (c), and (d) in the appendix, and the bandwidth condition that nh^{2α} → 0, nh^{2} → ∞ we have

1. is a consistent estimator of β.

in distribution, as n → ∞,

where

,

,

,

and is the ratio of the sample size of the validation sample and the total sample size.

The variance and covariance matrix of can be consistently estimated by their sample counterpart from the estimated log-likelihood function,

. (14)

In equation (14), is the observed Fisher information matrix with elements

,

when replacing the unknown regression parameters with their estimates. is the sample variance-covariance matrix of the non-validation half of the estimating function U(β) which estimates its corresponding population counterpart. The proof of the theorem is deferred to the appendix.

**Simulations**

In this section we investigate the small sample performance of our proposed estimator. We carry out extensive simulations in order to compare its efficiency and accuracy with other alternative estimation methods. We compare the proposed estimator based on the estimated likelihood method previously discussed with three different estimators. The first estimator is based only on the validation sample, ignoring the observations with missing values for X_{i}. This does not require the estimation of the unobserved data but as a trade-off must deal with a smaller sample size. The second estimator is based on the naive use of the auxiliary covariate as the true covariate in the sample. In this case we assume that for the non-validation sample, the unobserved X_{i} values are equal to the observed W_{i} values, ignoring the measurement error. The third estimator is based on a complete knowledge of the data. This is the best case scenario that would exist if we actually observed the X_{i} values for the non-validation sample and thus are working with a validation sample of the full sample size. We expect the efficiency and accuracy of to be better than that of and close to that of .

Simulations are done for the cases in Sections 2.1.2 and 2.1.3. For both, the random X_{i} and Z_{i} data are generated from a uniform distribution with a lower limit of 0 and an upper limit of 5, X_{i}, Z_{i} ~uniform (0,5). The auxiliary covariate W_{i} is defined as W_{i} = X_{i}+ U_{i} where U_{i} ~ N(0,σ_{u}^{2}) and σ_{u}^{2} determines the size of the measurement error in our sampling. Given X_{i} and Z_{i}, the random failure times T_{i} for the first case are generated from the equations

T_{i}= exp{Y_{i}};

and

Y_{i}= β_{1}X_{i} +β_{2}'Z_{i}+ ε_{i};

where the ε_{i}'s are iid and are following a standard extreme value distribution as discussed in Section 2.1.2. For the proportional odds model, we have

Y_{i}= β_{1}X_{i} + β_{2}'Z_{i}+ σ V_{i};

where V_{i} follows the standard logistic distribution as shown in Section 2.1.3, and we let = 1. The parameters β′= ( β_{1}, β′_{2}) are chosen prior to the simulations. The random censoring times C_{i} are generated from a uniform distribution, C_{i}~ uniform[0; c_{lim}], where c_{lim} is chosen such that the results have approximately 30% or 50% of the failure times censored.

For each set of simulations, there are pre-determined n and n_{V} values and the X_{i}, W_{i}, Z_{i}, T_{i}, and C_{i} data is generated as outlined above. We estimate the X_{i} values for the non-validation set for use in the estimated likelihood method from the n_{v}X_{i}’s in the validation set and the n W_{i}’s by using kernel smoothing as depicted in equation (10). For our calculations, we use the Gaussian kernel function, which has an order of 2,

,

where u= (W_{i}-W_{j})/h and we take bandwidthh = 2σ_{u}n^{-1/3} as used by Zhou and Wang [18]. Then we calculate all of the through the Newton-Raphson Method using the appropriate sets of data for each estimator. By using this method, we are able to solve the equations

for , and and solve

for .

For each set of simulations, we calculate the standard error (SE), standard deviation (SD), and the percent of estimators covered when using a 95% confidence interval, the coverage probability (CP). The standard errors are obtained by calculating the sample variance-covariance matrix of the maximum likelihood estimates for the parameters estimated over all simulations. The standard deviations are obtained from the estimated variance using equation (14). The values for CP are obtained by keeping track in each simulation if the true β values are within a 95% confidence interval surrounding the estimates using that simulation’s estimated SD value.

The parameter values used in our simulations were β′=(β_{1},β_{2}) = (log(2), log(1.5)). We tested with these values in a few different situations. We used σ_{u}= 0.2 and σ_{u}= 0.8, sample sizes n = 200 and n = 500 and censoring rates of 30% and 50%. We chose a constant validation ratio of and each simulation is repeated 1000 times. The simulation results are summarized in **Table 1** for the exponential regression model, and in **Table 2** for the proportional odds model.

n | Censor Rate | σ_{u} |
|||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

200 | 0.3 | 0.2 | V | 0.694 | 0.068 | 0.067 | 0.949 | 0.403 | 0.063 | 0.062 | 0.947 |

N | 0.690 | 0.047 | 0.046 | 0.939 | 0.408 | 0.044 | 0.043 | 0.950 | |||

EL | 0.693 | 0.048 | 0.047 | 0.938 | 0.405 | 0.044 | 0.043 | 0.940 | |||

C | 0.694 | 0.047 | 0.047 | 0.942 | 0.404 | 0.044 | 0.043 | 0.950 | |||

0.8 | V | 0.695 | 0.065 | 0.067 | 0.947 | 0.403 | 0.063 | 0.062 | 0.952 | ||

N | 0.644 | 0.048 | 0.043 | 0.748 | 0.459 | 0.048 | 0.042 | 0.715 | |||

EL | 0.699 | 0.051 | 0.050 | 0.950 | 0.407 | 0.048 | 0.046 | 0.943 | |||

C | 0.694 | 0.046 | 0.047 | 0.946 | 0.405 | 0.043 | 0.043 | 0.953 | |||

0.5 | 0.2 | V | 0.694 | 0.085 | 0.085 | 0.955 | 0.403 | 0.076 | 0.074 | 0.937 | |

N | 0.691 | 0.060 | 0.059 | 0.954 | 0.407 | 0.054 | 0.052 | 0.940 | |||

EL | 0.694 | 0.060 | 0.059 | 0.949 | 0.404 | 0.054 | 0.052 | 0.939 | |||

C | 0.695 | 0.060 | 0.059 | 0.956 | 0.404 | 0.054 | 0.052 | 0.940 | |||

0.8 | V | 0.695 | 0.086 | 0.085 | 0.946 | 0.406 | 0.075 | 0.075 | 0.951 | ||

N | 0.637 | 0.060 | 0.054 | 0.788 | 0.463 | 0.055 | 0.050 | 0.782 | |||

EL | 0.695 | 0.063 | 0.061 | 0.937 | 0.404 | 0.056 | 0.053 | 0.938 | |||

C | 0.695 | 0.060 | 0.059 | 0.944 | 0.406 | 0.052 | 0.052 | 0.957 | |||

500 | 0.3 | 0.2 | V | 0.692 | 0.043 | 0.042 | 0.941 | 0.406 | 0.038 | 0.039 | 0.957 |

N | 0.688 | 0.029 | 0.029 | 0.940 | 0.410 | 0.027 | 0.027 | 0.953 | |||

EL | 0.691 | 0.030 | 0.029 | 0.944 | 0.407 | 0.027 | 0.027 | 0.954 | |||

C | 0.691 | 0.029 | 0.029 | 0.944 | 0.406 | 0.027 | 0.027 | 0.957 | |||

0.8 | V | 0.696 | 0.043 | 0.042 | 0.942 | 0.402 | 0.040 | 0.039 | 0.935 | ||

N | 0.644 | 0.032 | 0.027 | 0.530 | 0.458 | 0.031 | 0.026 | 0.473 | |||

EL | 0.700 | 0.033 | 0.031 | 0.937 | 0.405 | 0.031 | 0.029 | 0.935 | |||

C | 0.694 | 0.031 | 0.029 | 0.932 | 0.403 | 0.028 | 0.027 | 0.946 | |||

0.5 | 0.2 | V | 0.697 | 0.052 | 0.053 | 0.958 | 0.404 | 0.046 | 0.046 | 0.949 | |

N | 0.691 | 0.036 | 0.037 | 0.949 | 0.409 | 0.032 | 0.033 | 0.945 | |||

EL | 0.694 | 0.037 | 0.037 | 0.948 | 0.406 | 0.033 | 0.033 | 0.937 | |||

C | 0.695 | 0.036 | 0.037 | 0.953 | 0.405 | 0.032 | 0.033 | 0.945 | |||

0.8 | V | 0.696 | 0.053 | 0.053 | 0.946 | 0.404 | 0.047 | 0.046 | 0.952 | ||

N | 0.637 | 0.037 | 0.034 | 0.589 | 0.462 | 0.036 | 0.031 | 0.566 | |||

EL | 0.695 | 0.039 | 0.038 | 0.950 | 0.404 | 0.036 | 0.034 | 0.937 | |||

C | 0.695 | 0.037 | 0.037 | 0.957 | 0.405 | 0.033 | 0.033 | 0.952 |

**Table 1:** Results after 1000 simulations for β′= (log(2),log(1.5)) = (0.693, 0.405) with ρ= 0.5 and h = 2σ_{u}n^{-1/3} using the exponential regression model.

n | Censor Rate | σ_{u} |
|||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

200 | 0.3 | 0.2 | V | 0.694 | 0.098 | 0.097 | 0.948 | 0.408 | 0.097 | 0.096 | 0.955 |

N | 0.691 | 0.067 | 0.068 | 0.948 | 0.407 | 0.066 | 0.067 | 0.948 | |||

EL | 0.694 | 0.068 | 0.069 | 0.957 | 0.405 | 0.067 | 0.067 | 0.950 | |||

C | 0.694 | 0.067 | 0.068 | 0.947 | 0.405 | 0.066 | 0.067 | 0.951 | |||

0.8 | V | 0.692 | 0.098 | 0.097 | 0.953 | 0.405 | 0.095 | 0.096 | 0.943 | ||

N | 0.639 | 0.069 | 0.066 | 0.848 | 0.447 | 0.066 | 0.066 | 0.909 | |||

EL | 0.692 | 0.073 | 0.070 | 0.936 | 0.405 | 0.069 | 0.069 | 0.952 | |||

C | 0.692 | 0.068 | 0.068 | 0.953 | 0.407 | 0.065 | 0.067 | 0.961 | |||

0.5 | 0.2 | V | 0.697 | 0.112 | 0.109 | 0.938 | 0.407 | 0.106 | 0.103 | 0.943 | |

N | 0.689 | 0.081 | 0.076 | 0.937 | 0.412 | 0.076 | 0.072 | 0.943 | |||

EL | 0.692 | 0.081 | 0.076 | 0.941 | 0.409 | 0.076 | 0.073 | 0.941 | |||

C | 0.693 | 0.081 | 0.076 | 0.937 | 0.409 | 0.076 | 0.072 | 0.945 | |||

0.8 | V | 0.697 | 0.108 | 0.109 | 0.954 | 0.405 | 0.102 | 0.104 | 0.956 | ||

N | 0.642 | 0.076 | 0.073 | 0.874 | 0.447 | 0.073 | 0.071 | 0.912 | |||

EL | 0.693 | 0.080 | 0.078 | 0.948 | 0.403 | 0.076 | 0.074 | 0.946 | |||

C | 0.694 | 0.077 | 0.076 | 0.949 | 0.407 | 0.074 | 0.072 | 0.953 | |||

500 | 0.3 | 0.2 | V | 0.695 | 0.061 | 0.061 | 0.948 | 0.401 | 0.060 | 0.060 | 0.951 |

N | 0.690 | 0.043 | 0.043 | 0.956 | 0.406 | 0.042 | 0.042 | 0.948 | |||

EL | 0.693 | 0.043 | 0.043 | 0.952 | 0.403 | 0.042 | 0.042 | 0.948 | |||

C | 0.694 | 0.043 | 0.043 | 0.954 | 0.403 | 0.042 | 0.042 | 0.950 | |||

0.8 | V | 0.696 | 0.062 | 0.061 | 0.950 | 0.405 | 0.061 | 0.060 | 0.944 | ||

N | 0.643 | 0.042 | 0.042 | 0.776 | 0.447 | 0.041 | 0.042 | 0.830 | |||

EL | 0.696 | 0.045 | 0.044 | 0.945 | 0.404 | 0.043 | 0.043 | 0.948 | |||

C | 0.696 | 0.043 | 0.043 | 0.950 | 0.406 | 0.041 | 0.042 | 0.958 | |||

0.5 | 0.2 | V | 0.694 | 0.069 | 0.068 | 0.952 | 0.402 | 0.067 | 0.065 | 0.934 | |

N | 0.691 | 0.048 | 0.048 | 0.943 | 0.407 | 0.047 | 0.045 | 0.939 | |||

EL | 0.694 | 0.049 | 0.048 | 0.947 | 0.404 | 0.047 | 0.046 | 0.940 | |||

C | 0.695 | 0.048 | 0.048 | 0.944 | 0.404 | 0.047 | 0.046 | 0.940 | |||

0.8 | V | 0.696 | 0.066 | 0.068 | 0.961 | 0.406 | 0.066 | 0.065 | 0.942 | ||

N | 0.640 | 0.048 | 0.046 | 0.753 | 0.449 | 0.047 | 0.045 | 0.826 | |||

EL | 0.691 | 0.051 | 0.049 | 0.942 | 0.405 | 0.049 | 0.046 | 0.942 | |||

C | 0.694 | 0.049 | 0.048 | 0.953 | 0.408 | 0.047 | 0.046 | 0.956 |

**Table 2:** Results after 1000 simulations for β′= (log(2), log(1.5)) = (0.693, 0.405) with ρ= 0.5 and h = 2σ_{u}n^{-1/3} using the log-logistic regression model.

We have also conducted simulations for other parameter settings, such as (1)σ_{u}= 0.6; (2) a lower validation rate of 30%; (3) with an unknown but estimated measurement error variance ; (4) with an estimated σ in the AFT model. The results were all similar to those reported and are hence skipped.

From **Tables 1 and 2**, we make the following observations:

Both and are performing very well. The naive estimator is biased at higher values of measurement error, σ_{u}.

The estimator is more efficient than theestimator in the sense that the latter has bigger standard errors.

If ρ were to increase to 1, the relative efficiencies would go to 1 since aside from having to estimate the unobserved X_{i}’s for the non-validation set versus excluding all of the non-validation data, the methods of estimation are the same.

The proposed variance estimator (14) for results in a good estimate of the true variance, , for both models.

The coverage probabilities of the 95% confidence interval are good for all estimators except when σ_{u} is large. In the case where σ_{u}= 0.8 they were bad and got worse as we increased the sample size but kept the same ρ since it increased the total data with error in each estimation without lessening its effect with a larger proportion of known X_{i} values, while the width of the confidence interval is shortened by the increasing sample size.

In comparing the two models, the exponential regression model appears to have smaller SE and SD values for all four estimators, but the log-logistic regression model does not experience such a dramatic decrease in CP for the estimator when σ_{u} was increased. This is likely due to the mentioned larger SD values used in the calculations.

**Application to PBC data**

We apply the proposed method to analyze data from the Mayo Clinic trial in PBC of the liver. PBC is a chronic liver disease that inflames and slowly destroys the bile ducts in the liver. Bile is a liquid produced in the liver which travels through these bile ducts to assist digestion in the small intestines. When these ducts are damaged, the bile builds up within the liver, causing damage and leading to cirrhosis. Scar tissue will then start to replace healthy liver tissue, impairing its ability to function properly. While the cause of PBC is unknown, it is believed to be a type of autoimmune disorder where the immune system attacks the bile ducts. Approximately 90% of patients who develop PBC are women, most often between the ages of 40 and 60. It is typical for those with PBC to not have any symptoms when diagnosed because it is often diagnosed early from routine blood tests checking the liver. Since it is a slow acting disease, if it is found early the patient may slow the progression of cirrhosis through treatment and still have many years with a healthy lifestyle, and possibly even have a normal life expectancy if their case is not too dire. However, there is currently no known cure for the disease. The only known way to effectively remove PBC is through a liver transplant. If the patient is deemed appropriate for a transplant, steps need to be taken to prevent the immune system from damaging the new liver [28,29].

In the random Mayo Clinic trial, a total of 418 patients were eligible. Of these 418, mostly complete data was obtained from the first 312 patients. The other 106 patients were not part of the actual clinical trial but agreed to have some basic measurements taken and to be followed for survival. The variables that we used for our analysis were time, the number of days between registration and the earlier of death, transplantation, or the study analysis date; status, the indicator of a patient’s status at their endpoint in the trial, denoted as 0, 1, or 2, corresponding to censored, transplant, or dead, respectively; Aspartate Aminotransferase (in U/ml), once referred to as SGOT; bili, serum bilirubin (in mg/dl); albumin, serum albumin (in mg/dl); age, patient’s age (in years); protime, standardized blood clotting time.

In this clinical trial, one of the variables that were measured only for the first 312 cases was aspartate aminotransferase, due to some difficulties. We are extremely interested in knowing its relationship with the patients’ survival. In order to estimate the unobserved AST values for the other 106 patients, which form the non-validation sample in this analysis; we chose serum bilirubin to act as the auxiliary covariate, W. There is data observed for serum bilirubin for every patient and it was therefore available to be used in kernel smoothing. To determine an estimate for σ_{u} to use in the calculation of the bandwidth, we used the least squares method to the regression equation X_{i} = β_{0} + β_{1}W_{i} +ε_{i} and calculated the MSE so that . The scale parameter for the AFT models were estimated based only on the validation data and then applied to the analysis using the proposed approach, where we calculated σ= 0.873 for proportional hazards model and σ= 0.676 for proportional odds model.

To test along side of AST, we included the variables serum albumin, age and protime in vector Z. These variables were measured for most of the patients, and thus were good choices for Z. There were two cases in the non-validation set with missing values for protime, so they were omitted. This left us with a validation set of 312 patients and a non-validation set of 104 patients. We decided to not include edema, even though it was measured for all patients, because there was not a single patient in the non-validation set that had edema despite diuretic therapy. For our calculations, we took the logarithms of the data for AST, serum bilirubin, serum albumin, and protime. Also, we treated having a transplant the same as being censored, so a status of 0 or 1 resulted in δ = 0, and thus a status of 2 resulted in δ = 1.

The proportional hazards and the proportional odds models Fit this part of the data equally well, in the sense that we obtained very close AIC values for both. The results of applying these models are hence provided below.

**Tables 3 and 4** show the results of the analysis on the PBC data using our estimated likelihood method on all 416 observations and the validation set method on just 312 observations, using both of the previously discussed models. Since we use a separate variable for our auxiliary covariate not just a measurement of X containing error, the naive method is not appropriate for this example. The estimates of the variables’ coefficients, their estimated standard deviations, and p-values are listed in the tables.

Method | Variable | SD | P-Value | |
---|---|---|---|---|

VA | log(AST) | -0.269 | 0.160 | 0.093 |

log(albumin) | 5.128 | 0.413 | <0.001 | |

age | 0.017 | 0.008 | 0.035 | |

log(protime) | 1.766 | 0.474 | < 0.001 | |

EL | log(AST) | 0.342 | 0.145 | 0.018 |

log(albumin) | 4.737 | 0.436 | < 0.001 | |

age | 0.016 | 0.007 | 0.027 | |

log(protime) | 2.112 | 0.435 | < 0.001 |

**Table 3:** AFT model analysis of PBC data using validation set and estimated
likelihood methods using the exponential regression model.

Method | Variable | SD | P-Value | |
---|---|---|---|---|

VA | log(AST) | -0.384 | 0.181 | 0.034 |

log(albumin) | 6.455 | 0.554 | <0.001 | |

age | 0.022 | 0.008 | 0.008 | |

log(protime) | 1.252 | 0.527 | 0.017 | |

EL | log(AST) | 0.460 | 0.160 | 0.004 |

log(albumin) | 5.970 | 0.506 | < 0.001 | |

age | 0.021 | 0.007 | 0.003 | |

log(protime) | 1.656 | 0.462 | < 0.001 |

**Table 4:** AFT model analysis of PBC data using validation set and estimated likelihood methods using the log-logistic regression model.

In **Table 3**, we see that except for the case of log (albumin), the standard deviations are all smaller in the estimated likelihood method than the validation set method, while every standard deviation is smaller for the estimated likelihood method in **Table 4**. In each case, the magnitudes of the estimated coefficients vary between estimation methods, but they show the same relationships between the covariates and time of death. Most importantly however, is that the significance of one of the coefficients differs between estimation methods. For the exponential regression model, we note that the p-value for log (AST) is less than 0.05 only for the estimated likelihood method. Therefore, when using the smaller sample sizes in the validation set method we are unable to conclude that all of the coefficients are significantly different from zero for either model, but all four coefficients become significant when using the estimated likelihood method. This emphasizes the importance of not omitting some of your data since as we have seen, it is possible to accidentally conclude that a significant variable from your analysis is in fact, not significant.

In this paper we proposed to use the kernel smoothing method to include the informative auxiliary covariate into the statistical inference of failure time data based on parametric AFT models. An estimator of the regression parameters is obtained through the maximization of an estimated likelihood function. The asymptotics of the proposed estimator is investigated. A consistent estimator of the estimation variance is also proposed. Simulation studies are conducted for the case when the error of the AFT model follows a standard extreme value distribution, as well as a standard logistic distribution. The proposed method is then applied to the PBC data as an illustration.

The motivation of conducting this study is twofold. It is well known that the AFT models are robust to mis-specifications when some of the predictive regressors are ignored. The regression coefficients are invariant, at least for distributions within the Weibull family. Secondly, the partial likelihood method is less efficient in the case of small sized samples, although it is asymptotically efficient when the sample size goes to infinity [11].

The authors are currently investigating semi-parametric AFT models with auxiliary covariates. The outcome is going to be reported in a forthcoming paper.

The research of both authors was supported in part by National Sciences and Engineering Research Council of Canada.

- FlemingTR, Harrington DP (1991) Counting processes and survival analysis, John Wiley & Sons, Inc. New York.
- Prentice RL (1982) Covariate measurement errors and parameter estimation in failure time regression model. Biometrika 69: 331-342.
- RubinDB (1976) Inference and missing data. Biometrika 63: 581-592.
- Fuller WA (1987) Measurement error models. John Wiley & Sons, New York.
- Carrol RJ, Rupert D, Stefanski LA (1995) Measurement Error in Nonlinear Models. Chapman and Hall, London.
- WangNY, Lin XH, Gutierrez RG, Carrol RJ (1998) Bias analysis and SIMEX approach in generalized linear mixed measurement error models. J Am Statist Assoc 93: 249-261.
- MengX, Schenker N (1999) Maximum likelihoodestimation for linear regression models with right censored outcomes andmissing predictors. Comput Stat Data Anal 29: 471-483.
- Cheng S, Wang N (2001) Linear transformation models for failure time data with covariate measurement error. J Am Statist Assoc 96: 706-716.
- Yu M, Nan B(2010) Regression calibration in semiparametric accelerated failure time models. Biometrics 66: 405-414.
- Cox DR (1972) Regression models and life-tables. J R Stat Soc Ser B 34: 187-220.
- Cox DR, Oakes D (1984) Analysis of survival data. Chapman and Hall, London.
- Kalbfleisch JD, Prentice RL (2002) The statistical analysis of failure time data, 2nd edn. New York: John Wiley & Sons, Inc.
- Hu P, Tsiatis AA, Davidian M (1998) Estimating the parameters in the Cox model when covariate variables are measured with error. Biometrics 54: 1407-1419.
- Hu C, Lin DY (2002) Cox regression with covariate measurement error. Scandinavian Journal of Statistics 29: 637-655.
- Pepe MS, Flemming TR (1991) A nonparametric method for dealing with mismeasured covariate data. J Am Statist Assoc 86: 108-113.
- PepeMS (1992) Inference using surrogate outcome data and a validation sample. Biometrika 79: 355-365.
- ZhouH, Pepe MS (1995) Auxiliary covariate data in failure time regression. Biometrika 82: 139-149.
- Zhou H, Wang C-Y (2000) Failure time regression with continuous covariates measured with error. J R Stat Soc Ser B 62: 657-665.
- Zhou H,Chen J, Cai J (2002) Random effects logistic regression analysis with auxiliary covariates. Biometrics 58: 352-360.
- JiangJ, Zhou H (2007) Additive hazard regression with auxiliary covariates. Biometrika 94: 359-369.
- Fan Z, Wang X(2009) Marginal hazards model for multivariate failure time data with auxiliary covariates. J Nonparametr Stat 21: 771-786.
- Liu Y, Zhou H, Cai J (2009) Estimated pseudo-partial-likelihood method for correlated failure time data with auxiliary covariates. Biometrics 65: 1184-1193.
- He W, Yi GY, Xiong J (2007) Accelerated failure time models with covariates subject to measurement error. Stat Med 26: 4817-4832.
- Lawless JF (2003) Statistical models and methods for lifetime data, 2nd edn, Wiley & Sons Inc., Hoboken, NJ.
- Nadaraya EA (1964) On estimating regression. Theory Probab Appl 9: 141-142.
- Watson GS (1964) Smooth regression analysis. Sankhya: The Indian Journal of Statistics Ser A 26: 359-372.
- Wand M, Jones M(1995) Kernel Smoothing. Chapman and Hall, London.
- (2008) "Primary Biliary Cirrhosis (PBC)." National Digestive Diseases Information Clearinghouse (NDDIC). National Institute of Diabetes and Digestive and Kidney Disease, National Institute of Health.
- (2011) "Primary Biliary Cirrhosis (PBC)." American Liver Foundation.
- Cramer H (1951) Mathematical Methods of Statistics. Princeton: Princeton University Press.

Select your language of interest to view the total content in your interested language

- Adomian Decomposition Method
- Algebra
- Algebraic Geometry
- Algorithm
- Analytical Geometry
- Applied Mathematics
- Artificial Intelligence Studies
- Axioms
- Balance Law
- Behaviometrics
- Big Data Analytics
- Big data
- Binary and Non-normal Continuous Data
- Binomial Regression
- Bioinformatics Modeling
- Biometrics
- Biostatistics methods
- Biostatistics: Current Trends
- Clinical Trail
- Cloud Computation
- Combinatorics
- Complex Analysis
- Computational Model
- Computational Sciences
- Computer Science
- Computer-aided design (CAD)
- Convection Diffusion Equations
- Cross-Covariance and Cross-Correlation
- Data Mining Current Research
- Deformations Theory
- Differential Equations
- Differential Transform Method
- Findings on Machine Learning
- Fourier Analysis
- Fuzzy Boundary Value
- Fuzzy Environments
- Fuzzy Quasi-Metric Space
- Genetic Linkage
- Geometry
- Hamilton Mechanics
- Harmonic Analysis
- Homological Algebra
- Homotopical Algebra
- Hypothesis Testing
- Integrated Analysis
- Integration
- Large-scale Survey Data
- Latin Squares
- Lie Algebra
- Lie Superalgebra
- Lie Theory
- Lie Triple Systems
- Loop Algebra
- Mathematical Modeling
- Matrix
- Microarray Studies
- Mixed Initial-boundary Value
- Molecular Modelling
- Multivariate-Normal Model
- Neural Network
- Noether's theorem
- Non rigid Image Registration
- Nonlinear Differential Equations
- Number Theory
- Numerical Solutions
- Operad Theory
- Physical Mathematics
- Quantum Group
- Quantum Mechanics
- Quantum electrodynamics
- Quasi-Group
- Quasilinear Hyperbolic Systems
- Regressions
- Relativity
- Representation theory
- Riemannian Geometry
- Robotics Research
- Robust Method
- Semi Analytical-Solution
- Sensitivity Analysis
- Smooth Complexities
- Soft Computing
- Soft biometrics
- Spatial Gaussian Markov Random Fields
- Statistical Methods
- Studies on Computational Biology
- Super Algebras
- Symmetric Spaces
- Systems Biology
- Theoretical Physics
- Theory of Mathematical Modeling
- Three Dimensional Steady State
- Topologies
- Topology
- mirror symmetry
- vector bundle

- Total views:
**11682** - [From(publication date):

October-2012 - Jan 20, 2018] - Breakdown by view type
- HTML page views :
**7912** - PDF downloads :
**3770**

Peer Reviewed Journals

International Conferences
2018-19