Biometrics & Biostatistics

Background: Missing race data is a ubiquitous problem in studies using data from large administrative datasets such as the Veteran Health Administration and other sources. The most common approach to deal with this problem has been analyzing only those records with complete data, Complete Case Analysis (CCA) which requires the assumption of Missing Completely At Random (MCAR) but CCA could lead to biased estimates with inflated standard errors. Objective: To examine the performance of a new imputation approach, Latent Class Multiple Imputation (LCMI), for imputing missing race data and make comparisons with CCA, Multiple Imputation (MI) and Log-Linear Multiple Imputation (LLMI). Design/Participants: To empirically compare LCMI to CCA, MI and LLMI using simulated data and demonstrate their applications using data from a sample of 13,705 veterans with type 2 diabetes among whom 23% had unknown/ missing race information. Results: Our simulation study shows that under MAR, LCMI leads to lower bias and lower standard error estimates compared to CCA, MI and LLMI. Similarly, in our data example which does not conform to MCAR since subjects with missing race information had lower rates of medical comorbidities than those with race information, LCMI outperformed MI and LLMI providing lower standard errors especially when relatively larger number of latent classes is assumed for the latent class imputation model. Conclusions: Our results show that LCMI is a valid statistical technique for imputing missing categorical covariate data and particularly missing race data that offers advantages with respect to precision of estimates.


Introduction
Much research in the past decade has focused on inequalities in the health and healthcare of different racial/ethnic groups [1]. For example, studies in the Veterans Affairs (VA) health system have revealed disparities in medication adherence, surgery and invasive procedures, and other care processes [2]. In these types of health disparities research, race/ethnicity is the key covariate of interest. Race/ethnicity is also a potential adjustment factor in most analyses of healthcare data. Thus, accurate race/ethnicity information is imperative for these types of studies to lead to high-quality health services research.
However, analysis of race/ethnicity data is hampered by the high proportion of missing race/ethnicity data [3,4]. For example, in data from 1997 to 2005 used by Sohn et al. [3], 45% of Veterans had missing or unknown race/ethnicity in their records. Compounding the problem, race/ethnicity data are not usually missing completely at random (MCAR) as patients with higher degrees of comorbidity and service connectedness are less likely to have missing race/ethnicity [5]. Such missingness in race/ethnicity data can bias results in health disparity studies unless properly accounted for. On the other hand, our experience analyzing both local and national VA data (with over 20% missing race information) indicates both complete case analysis (CCA) and multiple imputation (MI) did not result in different inferences [6,7] despite the fact that CCA is only valid under MCAR and MI is more appropriate under missing at random (MAR).
Several methods of handling missing covariate data are available in the literature. The default analysis in many software programs is Complete Case Analysis (CCA) which requires a strong assumption of MCAR and is known to lead to biased statistical inference if MCAR is violated. Another approach available in most commercial statistical software packages is multiple imputations which in standard implementations uses the distribution of the observed data to estimate a set of plausible values for missing data. It requires the assumption of MAR and is widely known to lead to unbiased estimates that are reflective of the uncertainty due to missingness [8,9]. The most commonly used imputations models for missing categorical data such as race are logistic regression and discriminant analysis models. Both are appropriate for imputing categorical variables with monotone missing data pattern. The latter is appropriate when the predictors are multivariate normal with equal within group covariance matrix. Related to MI are multiple imputations using chained equations (MICE) [10]. This is a fully conditional method where the imputation model for each variable with missing values is specified using a conditional distribution and draws from the conditional distribution are used to impute the missing values. A potential limitation is that a draw from each conditional distribution may not always lead to a draw from the joint distribution [11] and lack of theoretical basis [12].
Finally, log-linear multiple imputation (LLMI) uses a fully saturated log-linear imputation model which is a logistic regression model with all two-way and higher order interaction terms included [13,14].
Latent class multiple imputation (LCMI) is an alternative multiple imputation approach [15,16]. It uses a latent class model to estimate the joint distribution of the observed data. It provides a latent classification of subjects (latent classes) which are explained by the relationship among the observed categorical variables. Data for subjects in each class are then used to impute the missing values of subjects within the same class. LCMI has been shown to produce estimates with minimal bias and smaller standard errors in the analysis of data with missing categorical covariates [15]. This report describes an empirical comparison of the performance of CCA, MI, LLMI and LCMI in dealing with missing race/ethnicity data in a dataset similar to those used by many health services researchers.

Motivating Data Example
Our data included a retrospective cohort of 13,705 veterans with type 2 diabetes recruited from a tertiary center and five communitybased outpatient clinics in the southeastern United States. The diagnosis of diabetes was based on a previously validated algorithm for VA data [17]. Subjects were followed from September 1996 until death, loss to follow-up, or May 2006. Among these, 77% had observed race/ethnicity information while the remaining 23% did not have known race/ethnicity data. The details of the creation of the study data set are provided in an earlier paper [6,7]. The study was approved by our institutional review board (IRB) and local VA Research and Development committee.
Outcome measures: The outcome variable was annual mean HbA1c calculated from measurements taken in three-month intervals. It was categorized as good control (HbA1c ≤ 8%) or poor control (HbA1c>8%). When HbA1c values were not observed in a 3-month period, they were considered as missing values. For subjects with two or more HbA1c values in a given three-month time interval, the most recent HbA1c for that interval was used.
Predictor variables: Other risk factors (or covariates) included age, gender, race/ethnicity, marital status, employment status and comorbidities. Based on age distribution in the VA, age was categorized into four groups (<50, 50-64, 65-74, 75 and above). Race/ethnicity was classified as non-Hispanic white and non-Hispanic black, Hispanic/ Other and unknown race/ethnicity. Marital status was classified as never married, married, or separated/widowed/divorced. Employment was classified as employed, not employed, or retired. Comorbidity variables (Table 1) were defined based on enhanced ICD-9 codes using validated algorithms [18].

Simulation Study
Additionally, we used a limited simulation study to demonstrate and make comparisons among the different methods. The details of the simulation study design are as follows. We generated binary data to study how well LCMI performs compared to MI, LLMI and CCA. Especially, the simulation study addresses how the estimation of a few latent classes can improve upon standard multiple imputation techniques. We simulated data with a binary outcome variable (Y) where the Pr(Y=1) was determined using a logistic regression model given by, logit(Pr(Y = 1|X)) = β 0 + β 1 X 1 + …+ β 5 X 5 . Where each of the X's are also binary variables with X 1 denoting race (0=white, 1=black), which can be missing and the other four covariates X 2 to X 5 that were generated jointly with X 1 are covariates were potential confounders of the relationship between X 1 and Y. For simulations, we set β 1 =0.69 which is equivalent to an odds ratio of 2.0. After generating complete data according to the above model, data sets with missing race (X 1 ) were generated from the cohort with a 30% and 50% missing proportion. The probability of missingness was logit (Pr (M=1))=γ 0 +γ 1 Y+γ 2 X 2 , where M is an indicator of missing X 1 . We considered a wide range of missing scenarios broadly based on missing data mechanism classifications in Little and Rubin (2002) given as missing completely at random (MCAR) and missing at random (MAR). Further specification of the missingness within MAR was based on the dependence of the probabilities of missing X 1 on another covariate X 2 or on the outcome (Y). That is, missing X 1 may depend on X 2 (MAR(X 2 )), on both X 2 and Y (MAR(X 2 , Y)), or on Y only (MAR(Y)). For a 30% missingness to create MCAR, MAR(Y), MAR(X 2 ) and MAR(X 2 , Y), we used gamma0=(-0.9,-1.9,-2.0,-2.2) with gamma1=(0, 0, 1.8,0, 0.8) and gamma2=(0,0, 0.8, 0.8).

Missing Data Analysis Using LCMI
LCMI is a multiple imputation approach where the imputation model is based on a latent class model. The latent class imputation model provides a latent classification of subjects (latent classes) which provide a sufficient representation of the joint distribution which explains the complex relationship among the observed categorical variables and is used as a tool for density estimation [16]. That is, the observed data of subjects within each of the estimated classes are used to impute the missing values of subjects within the same class. [15,16]. Advantages of LCMI include the ability to model complex associations between categorical variables-a distinct advantage over LLMI. The details of this approach are in Gebregziabher and Desantis (2010).
The LCMI is implemented in four steps as follows. Let X i,obs and K denote the observed data (all fully observed variables in table 1 including the outcome variable) and the estimated latent class respectively.
Step 1: Fit the latent class model to the observed data, X i,obs .
Step 2: Sample from the posterior probability distribution of latent class given the observed data, P(K i =k|X i,obs =x i,obs ).
Step 3: Sample from the distribution of the missing data conditional on class, P(X i,mis |K i =k).
Step 4: Use a within class posterior sampling to impute the missing category or value of X. In our case, the latent class model was fitted using proc LCA Version 1.1.5 [21,22]. PROC LCA is a SAS procedure for latent class analysis developed for SAS Version 9.2 for Windows and is used to estimate latent classes measured by categorical indicators. We used a full Bayesian MCMC approach to sample from the posterior distribution of the missing data model. Finally, after we imputed the missing categories we used likelihood and/or GEE methods to estimate the parameters (regression coefficients and their corresponding standard errors) of the HbA1c models as described below.
For each missing data analysis method, we performed two sets of analyses. First, for the crossectional data analysis we used logistic regression (PROC Logistic, SAS 9.1.3) to study the association between HbA1c control (1=HbA1c >8%, 0=HbA1c ≤ 8 %) and race/ethnicity with and without adjusting for demographic and clinical variables. For each subject, HbA1c control was defined as mean HbA1c being 8% or less over the entire study period.
Second, for the longitudinal data analysis we used a general estimating equations (GEE) approach [19,20] using PROC Genmod, SAS 9.1.3 to assess whether HbA1c control varied by race/ethnicity. Both unadjusted and covariate adjusted models were fitted with HbA1c control at each quarterly visit as response variable using time and race/ ethnicity as primary variables of interest. The final model was adjusted for demographic variables and comorbidities. Table 1 shows the patient characteristics for the 13,705 Veterans with type 2 diabetes included in the study sample stratified by whether race/ethnicity information was available or not. The patients with missing race data had lower levels of comorbid conditions compared to those with race data. For instance, the prevalence of cancer was 11.6% in those without race compared to 16% in those with race. Similarly, the prevalence of CHF was 9.6% in those without race compared to 15% in those with race. Similar trends were observed for most of the medical and psychiatric comorbidities. Most Veterans (76.4%) had HbA1c values ≤8% and the proportions were 75.6% and 78.6% in those with race and without race respectively. The mean HbA1c was 7.3 (sd 1.5) in those with race and 7.2 (sd 1.4) in those without race information. CCA=Complete case analysis, MI= Multiple imputation with logit imputation model, LCMI-k = Latent class imputation with k (k=2,..,5) classes, LLMI: Log-linear multiple imputation Longitudinal=estimates from Proc Genmod with logit link, crossectional=logistic regression with dichotomized mean HbA1c of the repeated measurements over time outcome analyzed using Proc GENMOD, the middle two columns are OR and SE estimates from the analysis of the crossectional data with binary HbA1c analyzed using Proc LOGISTIC in SAS 9.2. The OR and SE columns show the odds ratio and their corresponding standard error estimates for NHB and Other races with NHW as the reference category. The odds ratio estimates are relatively similar across the different methods ranging between 1.75 using LLMI to 1.83 using LCMI-4 for NHB and between 0.84 using CCA and using LLMI for Others. The standard error estimates for CCA are slightly higher than the other imputation methods. For example, in the crossectional setting, SE for odds ratio comparing NHB and NHW using CCA was 0.052 and this was reduced to 0.046 using LCMI-5 representing a 12% improvement in precision. The same trend is observed in comparing Other to NHW in both the crossectional and longitudinal settings. In almost all cases, LCMI provided more precise estimates than the other methods. In all scenarios, LCMI-5 provided lower standard error estimates.

Results
The simulation study results for a 30% and 50% missing data scenario are reported in Table 3 and Table 4 respectively. Both absolute bias (estimated value minus true value) and asymptotic standard error are reported. Under the 30% scenario, the biases for CCA were not very large and were similar across both MCAR and MAR mechanisms. However, the size of the bias was between two and three fold compared to MI, LLMI or LCMI. On the other hand, when the level of missingness was increased to 50% the bias in CCA increased substantially to up to 25% while the bias in LLMI and LCMI remained low. In summary, the simulation results indicate that LCMI leads to parameter estimates that are less biased and characterized by lower standard errors compared to CCA, MI and LLMI. A more detailed and rigorous simulation study about these comparisons is reported elsewhere [12].

Discussion
Health services researchers who examine large datasets require complete information on covariates in order to perform accurate analyses. For example, in health disparities research, race/ethnicity is the key covariate of interest. However, race data is substantially missing in some VA data sets as well as other data sources. Thus, robust statistical techniques are needed to deal with the problem of missing race/ ethnicity data in studies of health disparities and in other applications. This report provides empirical evidence on the performance of multiple imputation techniques with varying imputation models in dealing with missing race data.
Imputation techniques are preferable to other approaches in many cases. It is often invalid to assume that race/ethnicity data is missing completely at random [5]. In the case of VA analyses, supplementing missing race/ethnicity information with data from other sources such as Medicare is not always possible. Moreover, it may result in higher rates of misclassification in non-Black minorities [23][24][25].
Among imputation methods, LCMI offers some advantages over MI and LLMI. Many datasets like ours have multiple variables with missing categorical data which make it difficult to satisfy the assumption of a multivariate normal distribution needed to perform MI. In LCMI, the imputation is based on a latent class model which does not require the same distribution assumptions. Thus, LCMI represents an alternative to MI that may perform better in certain datasets. The LLMI approach requires a saturated imputation model that includes all higher order associations among categorical variables. Because of this, LLMI may be computationally infeasible even for a moderately large number   Table 4: Bias in the mean log-odds ratio (Bias=estimated mean -0.69) and asymptotic standard error (ASE) estimates of a logistic regression model from a simulation study with 50% missing race data (n=200, true log odds ratio=0.69).
of variables. In contrast, LCMI represents a more computationally efficient approach. Importantly, LCMI performed comparably to these other techniques in a simulation study while providing more precise estimates with lower standard errors [15]. The main limitation of LCMI is that there are no proved approaches to determine the number of latent classes in a latent class imputation model that are sufficient to well approximate the joint distribution of the variables in the data. In the limited studies in the literature, it has been recommended to use as many latent classes as possible [15,16].
In summary, LCMI represents a valid statistical approach for the imputation of missing categorical data such as race/ethnicity data that may be of use to a variety of health services researchers working with large administrative datasets.

Disclosure
This study was supported by Grant # REA 08-261, Center for Disease Prevention and Health Interventions for Diverse Populations funded by Veterans Affairs Health Services Research and Development (PI -Leonard Egede).