Correction of Retransformation Bias in Nonlinear Predictions on Longitudinal Data with Generalized Linear Mixed Models

Researchers often encounter discrete response data in longitudinal analysis. Generalized linear mixed models are generally applied to account for potential lack of independence inherent in longitudinally data. When parameter estimates are used to describe longitudinal processes, random effects, both between and within subjects, need to be retransformed in nonlinear predictions on the response data; otherwise, serious retransformation bias can arise to an unanticipated extent. This study attempts to go beyond existing work by developing a retransformation method deriving statistically robust longitudinal trajectory of nonlinear predictions. Variances of population-averaged nonlinear predictions are approximated by the delta method. The empirical illustration uses longitudinal data from the Asset and Health Dynamics among the Oldest Old study. Our analysis compares three sets of nonlinear predictions of death rate at six time points, from the retransformation method, the best linear unbiased predictor, and the fixed-effects approach, respectively. The results demonstrate that failure to retransform the random components in generalized linear mixed models results in severely biased nonlinear predictions, as well as much reduced standard error approximates. Journal of Biometrics & Biostatistics J o ur al of Bio metrics & Bistatis t i c s


Introduction
In longitudinal data analysis, researchers frequently encounter discrete response data. When the distribution of the response variable is not normal or the variance/covariance matrices are not homogeneous longitudinally, the use of linear mixed models can lead to erroneous parameter estimates and unrealistic predictions of the response. There are different types of discrete longitudinal data, such as binary, ordinal, count, and multinomial. While there are particular model specifications and estimating procedures for each data type, the underlying expressions and statistical inferences in many of those models can be generalized by following the tradition of generalized linear models, referred to as generalized linear mixed models. Generalized linear mixed models incorporate subject-specific random effects, thereby addressing dependence among subject-specific observations. A variety of approaches have ben advanced to yield statistically efficient, robust, and consistent estimators on parameters of generalized linear mixed models [1][2][3][4][5][6][7].
In displaying analytic results of generalized linear mixed models, parameter estimates are often not directly interpretable due to transformation of distributional functions. With specification of the subject-specific random effects, covariates' regression coefficients in generalized linear mixed models do not necessarily describe changes in the mean response in the study population, and the actual effects must be evaluated by averaging over the distribution of specified random components [8]. The issue of interpretability in the analytic results of generalized linear mixed models thus calls for nonlinear predictions given the covariates' values, the estimated regression coefficients, and the average of random effects. When one converts a transformed linear function to predict the marginal mean at the untransformed scale, normality of the random components in the linear predictor must be retransformed to a non-normal distribution [9,10]; otherwise serious retransformation bias can arise. Even if true values of the parameters are known, the analytic results from generalized linear mixed models cannot be converted to unbiased nonlinear predictions without appropriately retransforming the random components.
In this article, we display an efficient, robust method for nonlinear predictions in generalized linear mixed models that corrects for retransformation bias. In Section 2, we briefly review general specifications of generalized linear mixed models. In Section 3, we present the classical best linear unbiased predictor for nonlinear predictions as an ancillary presentation of generalized linear mixed models. This is followed by a description of the retransformation method in nonlinear predictions. Section 5 presents approximation of the variance-covariance matrix for nonlinear predictions. An example is provided in Section 6 comparing results of nonlinear predictions from various methods. In the final section, we discuss merits and remaining issues in the methods described in this article.

Specifications and Inferences
Generalized linear mixed models are simply an extension of the classical generalized linear modeling from univariate data to clustered measurements. In the longitudinal setting, let i denote subject i in a random sample of N subjects and j=1, ..., n i be the number of repeated measurements nested within i. The response variable Y ij can be regarded as a discrete realization of a random variable Y ij with mean μ ij and variance var(y ij ). The function g(y ij , x ij ) is a nonlinear function linking Y ij to covariate vector X ij and a 1 × q vector of subject-specific random effects . As the random components are unobservable, it is desirable to specify generalized linear mixed models based on expectation [1], written as  [2,4,[5][6][7]. The variance-covariance matrix of withinsubjects uncertainty, if specified, can be approximated by using the Hessian of the log likelihood function or of the log restricted likelihood function. The analytic results from such procedures, however, are not sufficiently interpretable until they are converted into nonlinear predictions.

Best Linear Unbiased Predictor
There is a variety of approximation methods for nonlinear predictions on longitudinal data [1,7,11]. Among those approximation methods, a popular approach is the best linear unbiased predictor, which uses estimates of the fixed effects and the predicted values of the random effects. In this method, the linear predictor in generalized linear mixed models can be written as the mean , β is the maximum likelihood or the restricted maximum likelihood estimate of β, and ˆi b is the prediction of b i . The marginal likelihood can be fitted by averaging over the distribution of the unobserved random effect b i , with the corresponding joint likelihoods over N subjects written as The maximum likelihood or the restricted maximum likelihood estimates β , Ĝ , and ˆ are the values of β, G, and φ that maximize the above likelihood function. The random effect predictions can be obtained as the conditional mean of b i given,β , Ĝ , and φ : Among various approximation techniques for ˆi b , quadrature methods are regarded to generate accurate approximations of the integral specified in the marginal likelihood function [4,5,8,12]. This popular procedure first uses the likelihood calculations given in McGilchrist [11], and then the empirical Bayes estimates of the random effects to approximate the Gaussian quadrature integral [13]. The approximated integral as a marginal likelihood is then optimized for the fixed effects, and the fixed effect estimates are applied to produce the final prediction. In this approach, as the random effect prediction is treated as the fixed effect, some inherent variability is overlooked thereby causing retransformation bias in nonlinear predictions. Therefore, the best linear unbiased predictor is a partial empirical Bayes method in nonlinear predictions. Furthermore, if there is evidence that uncertainty is not negligible, the estimator The marginal mean for a population subgroup can be predicted from subject-specific predictions by creating a scoring dataset that represents an actual or a hypothetical population group taking selected values of covariates. By retaining the predicted random effect for each subject, the mean and the standard deviation of subject-specific predictions in the scoring dataset approximate the populationaveraged prediction and its standard error by use of the best linear unbiased predictor.

Retransformation Method
One limitation in the best linear unbiased predictor is that ˆi b is Where E(y ij ) is the expected value of Y ij , β is an M × 1 vector of unknown regression parameters to be estimated including a time factor, and Z ij is a design block matrix. The 1 × q random effects vector b i is distributed as N(0,G), where G is a q × q covariance matrix. The matrix Z ij can contain time or other covariates whose association with the response is assumed to vary across subjects. With the specification of β and b i , the elements in vector y i =y i1 ,….y ni , are thought to be conditionally independent. The linear predictor, ( ) ( ) With the specification of a link function in generalized linear mixed models, random errors for nonlinear functions depend on the mean function, and accordingly, the variance of y ij is written as Where ν is a specific variance function, andφ represents a scale factor for over-dispersion. Given this flexible specification, y ij can follow a probability distribution other than multivariate normality. Equation (2) includes two distinctive variance components, betweensubjects and within-subject. Given the specification of the variance function, the within-subjects variance cannot be specified freely in non-normal longitudinal data.
Equation (1) does not specify a within-subject error term, implying zero uncertainty given β and b i . From Equation (2), however, it seems desirable to express y ij as a conditional function by including an error term for addressing uncertainty [9]: Where µ  ij is the conditional mean after accounting for the withinsubject random error ε ij . Correspondingly, ij η  is defined as Specification of the variance matrix for within-subjects random errors depends on a specific link function, and this random term may be conveniently assumed to have local normality with property . Due to retransformation of ε ij in expressing the expected value of ij y , very often µ µ α be a correlation parameter, and ( )  R a be the n i × n i matrix of  α describing the correlation pattern within subject i. Then, if within-subjects uncertainty is considered, the within-subject variance-covariance structure can be written as where  i µ is the vector of means over n i time points, ( ) × n i diagonal within-subject variance matrix containing elements ν, evaluated at  i µ , and R i is an unknown matrix to be estimated. For analytic convenience and simplicity, the matrices in Equation (5) are assumed to be common to all subjects. When the subject-specific random effects are specified, The variance-covariance matrix of the linear predictor, denoted by ( ) , can be written as The parameters β and G can be estimated by the maximum likelihood or the restricted maximum likelihood, or other Bayes-treated as the fixed effect in nonlinear predictions. Consequently, a portion of variability is ignored in the retransformation process. For a population group, the entire set of random components needs to be retransformed for correctly approximating the population-averaged mean. We refer to this retransforming of random components in nonlinear predictions as the retransformation method. In this section, we present two such methods, one without and one with considering within-subjects uncertainty, respectively.

Retransformation method without considering intrasubjects variability
We start the description with the case of a log link. Let subject i be a typical case in a population of interest characterized by covariate vector X 0 . The predicted value for the typical subject at time j can then be considered the population-averaged prediction for the population taking covariate values X 0 . Given Equation (1), the marginal mean of y ij given X 0j and b i is is defined as the moment generating function Therefore, the prediction at time j for the population with covariate X 0j is given by Equation (10) indicates that unless all elements in Ĝ take value zero. Therefore, given the log link, the nonlinear response y ij will be under-predicted if retransformation of between-subjects random effects is completely or partially neglected, with the magnitude of such retransformation bias depending on the size of 0 Φ j .
Next, let g(·) represent the popular logit link. Given the logit function, neglect of retransformation of random effects can also lead to tremendous retransformation bias on the binary data. Let y ij denote a dichotomous response variable taking value 0 or 1 for subject i at time j. Then the predicted probability that y ij = 1 for person i at time j given X 0j and b i can be written as where; in the construct of the lognormal distribution, Let the fixed-effects estimator ( ) ( ) which is generally the case in longitudinal data analysis, we have . Therefore, in nonlinear predictions with mixed-effects logit models, ignoring retransformation of between-subjects random effects can lead to strong downward bias in the predicted probability. For other discrete data types, such as the multinomial, the retransformation bias in nonlinear predictions can be equally impactful [9].

Retransformation method with intra-subjects variability
The equations specified in the above section do not specify a random term accounting for within-subjects uncertainty. In the application of generalized linear mixed models, between-subjects random effects are often perceived to reflect individual differences in unspecified characteristics thereby addressing within-subjects variability simultaneously [14]. In certain situations, however, this assumption does not reflect the true experiences generated by the stochastic longitudinal process.
If within-subjects variability is taken into account, the expectation of nonlinear response y for subject i at time j can be written as where ( ) can be understood as a second-order smearing estimate evaluated at ( ), Empirically, the within-subjects random term can be approximated from the partial derivative of the log likelihood function with respect to β in estimating a generalized linear mixed model, given by y is the local approximation of the within-subjects random error with local normality. Such an approximate is model-based; different from linear mixed models, its specification depends on the marginal mean. In nonlinear predictions, the specification of this local approximation step can be ignored only when there is strong evidence that between-subjects variability completely captures within-subjects uncertainty.
Some researchers recommend the application of the latent variable approach to estimate within-subjects random errors in generalized linear mixed models [15]. This standardized approach specifies a constant variance of within-subjects random errors, regardless of the response type and the number of covariates utilized in a particular longitudinal study. It is argued that, when the between-subjects random effects are specified, not that much variability remains in a binary or a multinomial response [15]. Furthermore, this approach does not specify a covariance structure when the data type is multinomial thereby overlooking the multivariate nature of the nonlinear data structure.
If within-subjects random errors are considered non-ignorable in specifying a generalized linear mixed model, the marginal mean of the nonlinear response y ij , taking the logit link, is

Variance Matrix for Nonlinear Predictions
We propose to use the delta method to approximate the standard errors of nonlinear predictions, as following the tradition in nonlinear predictions [12][13][14]16,17]. Suppose that the discrete response is binomial. Let ˆi j L be a random variable of the linear predictor for subject i at time point j from Equation (4) and; of variance In Equation (15) L , respectively, where K denotes the number of non-reference response levels [9]. Correspondingly, the variance- where; ( ) ( ) ( ) In generalized linear mixed models, ( ) var ij L may consist of two components, between-subjects and within-subjects variances, as indicated earlier. Whereas the between-subjects variance is specified in a generalized linear mixed model and thereby estimable, the withinsubjects random component can be approximated from the variance of the intercept after the covariates are rescaled to be centered at selected values [9]. The rationale is that if covariates are centered at some specified values, the intercept represents the transformed margin, and therefore, the variance of the estimated intercept plus the corresponding variance for the between-subjects random effects can be considered the approximation of the variance for the marginal mean.

Illustration
Data used for the illustration come from the Survey of Asset and Health Dynamics among the Oldest Old, a nationally representative investigation of older Americans. This survey, conducted by the Institute for Social Research, University of Michigan, is funded by National Institute on Aging as a supplement to the Health and Retirement Study. To date, the survey consists of nine waves of data collection. The Wave I survey was conducted between October 1993 and April 1994, identifying 9,473 households and 11,965 individuals in the target area range. The survey obtains detailed information on a number of domains, including demographics, health status, health care use, disability, and health and life insurance. Survival information throughout the follow-up waves is obtained through a link to the data of National Death Index. Because the first two waves (1993 and 1995) were based on a slightly different questionnaire from those of the succeeding waves, we use data from the six waves starting with the 1998 wave (1998, 2000, 2002, 2004, 2006, and 2008). For details about the study design, see the Health and Retirement Study website (hrsonline. isr.umich.edu).
In this illustration, the outcome variable is survival status, with 1=death and 0=else. Operationally, we analyze the probability of death between two successive waves, defined as pr(Y ij =1) where j indicates a time interval between time point j-1 and j. Given the focus on nonlinear predictions on longitudinal trajectory of death rate, the main explanatory variable is time, measured as the number of years elapsed since the 1998 survey (t = 0, 2, 4, 6, 8, 10). Among the control variables considered in this illustration, gender is a dichotomous variable with 1=women and 0=men, and age and education are both continuous variables. In estimating the model parameters, the control variables are rescaled to be centered about sample means. The interaction between time and each of the controls was considered for capturing possible convergence of the covariate's effect, but its inclusion did not contribute significantly to the model fit and thus was removed.
Given the binary outcome data, we apply the mixed-effects binary logit model. For illustrative simplicity without loss of generality, we use the random intercept logit model, assuming the effects of covariates on the logit to be fixed over time. The SAS PROC NLMIXED procedure (SAS Institute Inc., Cary, NC) is applied to compute the fixed and the random effects given its tremendous flexibility in estimating and predicting parameters in generalized linear mixed models [16]. We use adaptive Gaussian quadrature to approximate the integral of the likelihood over the random intercept with its advantage over other approximation methods in deriving robust random effect estimates and the model fit statistic [4,12]. With the specification of betweensubjects random intercepts, time is treated as a continuous variable. A combination of time and time × time, the so-called quadratic polynomial time function, was found to best describe the longitudinal trajectory of death rate. Given high correlation between the two time components, the time variable t was rescaled as a centered covariate to reduce numeric instability and collinearity.
To compare statistical efficiency and robustness of different methods for nonlinear predictions in longitudinal analysis, we first create a "full" logit model consisting of all covariates and two random components in the estimation process, assuming the populationaveraged predictions and the corresponding variances/covariances derived from this model to be exact. Then we purposefully exclude the variable "education" from the logit model. As education has a statistically significant effect on death rate, there is definitely additional clustering in the outcome data after its removal, and therefore such a reduced model is actually a misspecified model. We have three operational objectives in this illustration. First, we examine whether the retransformation method can capture the random effect after an important predictor is removed, relative to the results from the full model. Next, we assess whether the retransformation method yields much smaller retransformation bias than the best linear unbiased predictor in nonlinear predictions. Third, we demonstrate how the fixed-effects approach, though tending to generate unbiased estimates of regression coefficients [18], results in tremendous retransformation bias in nonlinear predictions. The best linear unbiased predictor and the retransformation method are actually based on the same model eliminating education from the estimating process, and therefore, we develop three logit models in the illustration: the full mixed-effects logit model, the reduced mixed-effects logit model removing education from regression, and the reduced fixed-effects logit model.
While the best linear unbiased predictor approximates death rate for each observation, we obtain the population-averaged predictions from this method by creating a scoring dataset that specifies the outcome variable as missing and the independent variables as zero, with covariates rescaled to be centered at selected values corresponding to a target population group. In the scoring data, the random effect per individual should be present, and therefore, the mean of the predictions with y=missing and X's=0 in the scoring dataset generates the predicted death rate for the population or a typical subject taking selected values of covariates. Table 1 displays the analytic results of three models. All regression coefficients, full or reduced, random-or fixed-effects, are statistically significant at α=0.05. In each model, the regression coefficient of time on the logit is positive, while that of time × time is negative, which, combined, suggest a decelerating time trend in old-age mortality. The between-subjects random effect on the logit is 0.381 for the full model and 0.380 for the reduced, both statistically significant. The intercept, the regression coefficients, and the standard errors in the reduced fixed-effects model are close to those from the two randomeffects logit models. Such a similarity indicates that in this analysis, the large-sample behavior follows, and consequently, the asymptotic process ( ) 1/ 2ˆ0 j n -β β , where β 0 is the true parameter vector, tends to converge to a normal vector with mean 0 and the covariance matrix as approximated by the inverse of the observed information matrix. The fixed-effects logit model, however, does not have capability to yield a robust and consistent estimator for nonlinear predictions in the longitudinal setting, as will be shown in Table 1.
The predicted death rate at each time point and its variance can be estimated by applying Equations (14) and (15), respectively. In the construct of binary response data, the partial derivative of death rate with respect to the logit function is The variance of the predicted population-averaged death rate ˆi j p can be approximated by L consists of two components if within-subjects random errors are considered. Table 2 presents four sets of the predicted death rates and the standard errors at six time points, computed from the full model, the retransformation method, the best linear unbiased predictor, and the fixed-effects reduced model, respectively.
In Table 2, death rate is shown to increase exponentially over time at the early and the middle stage, and then the increase slows near the end of the observation period, reflecting a "selection of the fittest" process. The retransformation method, transferring both the between-subjects and the within-subject random components in this analysis, generates the closest predictions to those from the full model, indicating its high efficiency and coverage in handling retransformation bias in nonlinear predictions. In the first two panels, the predicted Note: Randomness of the intercept is parameterized by the variance of the random effects. The best linear unbiased predictor provides close predictions at the early times; in the last three time periods, however, the predicted death rate deviates markedly from those from the first two methods. Furthermore, this method results in severely underestimated standard errors of the predictions. The fixed-effects logit model generates the least valid predictions with massive deviations from those of the mixedeffects models. More significantly, the fixed-effects approach severely underestimates standard errors, much more so than the best linear unbiased predictor. With severe underestimation of standard errors, all predicted death rates from the best linear unbiased predictor and the fixed-effects model are very strongly statistically significant, thereby yielding erroneous test results. Figure 1 plots three longitudinal trajectories of death rate and the corresponding 95% confidence limits, approximated from the retransformation method, the best linear unbiased predictor, and the fixed-effects approach, respectively. The solid line and the shaded area in each panel are the trajectory and the 95% confidence region from the full model, used as a standard for comparison. Panel A demonstrates how well the retransformation method predicts old-age mortality and the confidence limits, after one theoretically important, statistically significant predictor is removed. In Panel B, the best linear unbiased predictor estimates death rate fairly nicely in the early stage, but it then falls off systematically from the solid line, with the 95% confidence limits dramatically contracted. In Panel C, the predicted time trend of death rate is completely amiss, with the two 95% confidence limits very narrowly scattered around the predicted curve.

Discussion
As regression coefficients in generalized linear mixed models are often not interpretable, we have seen applications using the fixed-effect estimates for nonlinear predictions. Such an application can lead to considerable retransformation bias due to neglect of retransforming random components. If random disturbances in a generalized linear mixed model truly follow normality, this distribution needs to convert to a non-normal function to correctly predict the nonlinear outcome. Consequently, the expectation of random components is not zero in nonlinear predictions unless the identity link function is specified or both between-and within-subjects random disturbances are zero for all subjects. Without appropriately retransforming the random effects, the variance of nonlinear predictions will be underestimated, thereby affecting the quality of the significance test. The best linear unbiased  Table 2 are based on "reduced" (misspecified) models.  prediction, widely applied in longitudinal analysis, only accounts for a portion of between-subjects variability and overlooks the withinsubject random component that may exist inherently even with the specification of the between-subjects random effects.
In this article, we compare the predicted death rate at six time points from three predicting methods: the retransformation method, the best linear unbiased predictor, and the fixed-effects approach, as relative to a pre-specified full mixed-effects logit model. In particular, we examine how each of these methods behaves in nonlinear predictions after an important predictor variable is removed. The results of our illustration display that failure to retransform random components in generalized linear mixed models can result in severe bias in nonlinear predictions and sizable underestimation of standard errors, even when the estimated regression coefficients are unbiased. Such retransformation bias exerts substantial influences on the quality of significance test on nonlinear predictions. The fixed effects are population-averaged, and therefore, the subject-specific variability is disregarded; once random effects are considered in nonlinear predictions, the inherent variability is much increased, thereby lowering the value of the chi-square statistic. The best linear unbiased predictor reduces some retransformation bias but its effect is shown to be very limited. Relative to the fixed-effects approach and the BLUP estimator that are associated with tremendous retransformation bias, our retransformation method is shown to increase the efficiency and coverage in nonlinear predictions.