General Multiple Mediation Analysis With an Application to Explore Racial Disparities in Breast Cancer Survival

Mediation refers to the effect transmitted by mediators that intervenes in the relationship between an exposure and a response variable. Mediation analysis has been broadly studied in many fields. However, it remains a challenge for researchers to differentiate individual effect from multiple mediators. This paper proposes general definitions of mediation effects that are consistent for all different types (categorical or continuous) of response, exposure, or mediation variables. With these definitions, multiple mediators can be considered simultaneously, and the indirect effects carried by individual mediators can be separated from the total effect. Moreover, the derived mediation analysis can be performed with general predictive models. For linear predictive models with continuous mediators, we show that the proposed method is equivalent to the conventional coefficients product method. We also establish the relationship between the proposed definitions of direct or indirect effect and the natural direct or indirect effect for binary exposure variables. The proposed method is demonstrated by both simulations and a real example examining racial disparities in three-year survival rates for female breast cancer patients in Louisiana. Journal of Biometrics & Biostatistics J o u rn al of Bio metrics & Bistatis t i c s


Introduction
Mediation effect refers to the effect conveyed by an intervening variable to an observed relationship between an exposure and a response variable of interest. The concept of mediation originated in psychological research. In 1929, Robert S. Woodworth introduced the expression Stimulus-Organism-Response to describe the pathway between stimulus and response. This concept has then been expanded to many fields such as social science, prevention study, behavior research, and epidemiology. Investigators are interested in discovering not only the relationship between the exposure variable and response variable but also the mechanism of risk factors mediating the relationship. For example, it has been well established that low socioeconomic status is associated with poor health status. To reduce this health disparity, investigators would need to quantify the mediation effects from different risk factors so that efficient interventions can be carried out [1].
There are generally two settings for mediation analysis. One is based on linear models. According to Baron and Kenny [2], three conditions are required to establish a mediation: (a) the exposure variable (X) is significantly associated with the response variable (Y); (b) the presumed mediator (M) is significantly related to X; and (c) M significantly relates to Y controlling for X. When the relationships are represented by linear regression models, the indirect effect is typically measured by two methods: the difference in the coefficients of the exposure variable when it is regressed on Y with or without controlling for the mediator (abbreviated as `CD' thereafter. [3][4][5]; or the product of the coefficient of X when regressed on M, and the coefficient of M in explaining Y controlling for X (`CP' for abbreviation [6]). MacKinnon et al. [7] showed that when the relationship among mediators, exposure and response variables are fitted with linear regression models, the mediation effects measured by CD or CP are equivalent. However, neither CD nor CP is easily adaptable to separate multiple mediation effects when Y or M is not continuous or when the relationships cannot be fitted with linear regressions [8].
Counterfactual framework is the other popular setting to implement mediation analysis [9][10][11][12]. Let Y i (X) denote the post-treatment potential outcome if subject i is exposed to X. To compare the change in outcome when the exposure changes from x to x* (e.g., 0 or 1 for binary X), only one of the responses, Y i (x) or Y i (x*), is observed. The causal effect of treatment on the response variable for subject i is defined as Y i (x)-Y i (x*). It is impossible to estimate the individual causal effect since the estimation depends on a non-observable response. Holland [13] proposed, instead of estimating causal effect on a specific subject, to estimate the average causal effect over a pool of subjects -E(Y i (x)-Y i (x*)). If the subjects are randomly assigned to control or treatment groups, the average causal effect equals the expected conditional causal effect, E(Y i X=x)-E(Y i X=x*).
Denote M i (X) as the potential value of M when subject i is exposed to treatment X. Let Y i (x,m) be the potential outcome of subject i for a given x and m. It has been established that the total effect of X on Y when X changes from x to x* is Y i (x,M i (x))-Y i (x*,M i (x*)). The conventional mediation analysis decomposes the total effect into direct effect and indirect effect. Namely, direct effect is the effect of X directly on Y, while indirect effect is the effect of X on Y through M. Robins and Greenland [9] introduced the concepts of controlled direct effect, defined as Y i (x,m)-Y i (x*,m) and of natural direct effect, defined as ς i (x) =Y i (x,M i (x*))-Y i (x*,M i (x*)). The difference between controlled and natural direct effects is that the controlled direct effect is measured when M is fixed at m, whereas for the natural direct effect, M is random as if the actual exposure were x*. The difference between total effect and natural direct effect is defined as the natural indirect effect, ). In comparison, the difference between a total effect and a controlled direct effect cannot in general be interpreted as an indirect effect [13,14]. A common restriction for definitions of controlled and natural direct effects is that the exposure levels x and x* have to be preset. When the relationship among variables cannot be assumed linear, it is hard to choose representative exposure levels especially if the exposure variable is multi-categorical or continuous.
In this paper, we propose general definitions of mediation effects. The derived mediation analysis is promising in that the indirect effects contributed by different mediators are separable, which enables the comparison of relative mediation effects carried by different third variables. Furthermore, the mediation analysis is generalized so that we can deal with binary, multicategorical or continuous exposure, mediator and response variables. The method allows general predictive models, in addition to general linear models, to be used to fit variable relationships. We show that the proposed method is equivalent to the conventional CD and CP methods in single continuous mediator cases. We also establish the relationship between the proposed definitions of direct or indirect effect, and the natural direct or indirect effects in single binary mediator cases.
The paper is organized as follows: in Section 2, we present a motivating example that explores racial disparities of female breast cancer three-year survival rates in Louisiana. Section 3 proposes general definitions of mediation effects. We demonstrate the mediation analysis with linear regression and logistic regression, which deals with different types of mediators in Section 4 and 5. In Section 6, we propose algorithms for mediation analysis with non/semi-parametric predictive models and binary exposures. Statistical inference on indirect effect by the Delta and bootstrap methods is discussed in Section 7. In Section 8, the proposed method is adapted to the motivating example. Also a simulation study is used to demonstrate its identifiability and sensitivity. Finally, conclusions and future research are discussed in Section 9.

Motivating Example
Breast cancer is the most common cancer and the second leading cause of cancer death among American women of all races. Owing to advanced screening technologies for early stage breast cancer detection, as well as improved treatment modalities, the overall death rate of breast cancer in US has decreased in recent years. However, when compared with white women, African-American women have a higher death rate from breast cancer despite the fact their incidence rate is lower. Previous studies have found that more advanced and aggressive tumors and less than optimal treatment may explain the lower survival rates among black women [15][16][17]. There are many other contributing factors, such as patient demographic information and health care provider information. However, the relative effect contributed by each factor to racial disparities in breast cancer survival cannot be differentiated due to the lack of comprehensive data and the limitations of current analysis methods (discussed in Section 1).
The Institute of Medicine (IOM) reported in 1999 that cancer patients did not consistently receive the care known to be effective for their conditions [18]. In response to the IOM reports, the National Program of Cancer Registries (NPCR) of the Centers of Disease Control and Prevention (CDC) established a series of Pattern of Care (PoC) studies. The PoC-Breast and Prostate (BP) study collected information from breast and prostate cancer patients. The routinely collected registry data were supplemented by re-abstracting hospital records and obtaining information about adjuvant treatment and comorbidity from physicians and outpatient facilities. We use the data set collected by the Louisiana Tumor Registry on about 1453 non-Hispanic white and black women diagnosed with malignant breast cancer in 2004 in Louisiana. All patients were followed up for five years or until death, whichever is shorter. We found that the odds of dying of breast cancer within three years for black women was significantly higher than that for white patients (OR=2.03; CI: (1.468,2.809)). To identify and differentiate attributable risk factors for the racial disparities, we developed a novel mediation analysis method.
A third variable can intervene in the relationship between an explanatory variable and a response variable through many forms. In this paper, we focus on the two forms: mediation or confounding. Although they are conceptually distinctive, MacKinnon et al. [19] claimed that these effects are statistically similar in the sense that all of them measure the change of association between the explanatory and response variables when considering a third variable. Therefore the statistical methods developed for mediation framework can be used for confounding effect analysis, although the scientific interpretations of the analysis might be different. In this paper, we call the effect carried by a third variable indirect effect. We demonstrate the proposed mediation analysis method to differentiate the indirect effects from a wide range of potential mediators/confounders that account for racial disparities in breast cancer survival. Multiple potential third variables including patient residence census tract level and individual level variables are considered. Details of the analysis are in Section 8.1.

General Mediation Analysis
We would like to measure the direct effect of a variable X on the response variable Y and the indirect effect of X on Y through the third variable M. What we call a mediator in this paper should be equivalent to a confounder in confounding analysis. When there is more than one mediator, we should be able to differentiate the indirect effect from each mediator. For these objectives, we first propose general definitions of mediation effects in the counterfactual framework. Figure 1  where M j be the jth mediator/confounder. Z is the vector of other independent explanatory variables that directly relate to Y, but does not interact with X. We are interested in exploring the mechanism of the changes in the response, Y, when X is altered. Let Y(x,m) be the potential outcome if the exposure X is at x, and M at m. Denote the domain of X as dom X . Let u* (>0) be the infimum positive unit such that there is a biggest subset of dom X , denoted as dom X *, in which any x also satisfies x+u* ∈ dom X , and that dom X =dom X * U {x+u*x∈dom X *}. If u* exists, it is unique. Note that if X is continuous, u*=0+ and dom X =dom X *; if X is an ordered binary variable, taking the values 0 or 1, then u*=1 and dom X *={0}. For random variables A,B and C,A ╨ BC denotes that given C, A is conditionally independent of B.
For any function g(x), denote E x g= ∫ x∈domx g • f(x)dx and E xy g=∫ x∈domx g • f(xy)dx when x is continuous; and E x g=Σ x∈domx g • f(x) and E xy g=Σ x∈domx g•f(xy) when x is discrete, where f(x) is the density of x and f(xy) is the conditional density of x given y.

Total effect
We define the total effect in terms of the average changing rate of Y with the treatment/exposure, X. Definition 3.1 Given Z, the total effect (TE) of X on Y at X=x* is defined as the change in E(Y(X,Z)) when X changes by a u* unit: The average total effect is defined as For the identification of total effects, we need two assumptions, which were also stated in VanderWeele and Vansteelandt [14]: With the assumptions, it is easy to see that Note that if X is binary, dom x *={0}, so the (average) total effect is E(Y|Z,X=1)-E(Y|Z,X=0), which is equivalent to the commonly used definition of total effect in literature for binary outcomes.
Compared with conventional definitions of the average total effect that look at the differences in expected Y when X changes from x to x*, we define total effect based on the rate of change. The motivation is that the effect will not change with either the unit or the changing unit (x*-x) of X, thus generalizing the definitions of mediation effects, which will be consistent for exposure variables measured at any scale (binary, multi-categorical, or continuous). The benefits of the modification will be further discussed in Section 4 after the average direct and indirect effects are similarly defined.

Direct and indirect effects
We define the direct effect by fixing M at its marginal distribution, f(M|Z), while X shifts. The direct effect of X at x* is defined as For the identifiability of the DE, an additional assumption is needed: Note that if there are overlapping pathways through multiple mediators, A3 is violated. We recommend using the sensitivity analysis proposed by Imai and Yamanoto [20] to assess the robustness of empirical results. If causal relationship among mediators exists, it is necessary to combine those mediators. For example, if the real relationship is X →M t1 → M t2 → Y, we should combine the effects from M t1 and M t2 , and consider the pathway It is easy to show that when X is binary, Mathematically, the definitions of TE and DE differ in that for given Z, the former takes the conditional expectation over M j for given x*while the latter takes the marginal expectation. The direct effect measures the changing rate in the potential outcome with X, where Mj is fixed at its marginal level, while all the other mediators can change with X. As in Figure 1, DE not from Mj is calculated as TE but the line between X and M j is broken. Therefore, we call the changing rate the direct effect of X on Y not from M j , which is the summation of the direct effect of X on Y and the indirect effects through mediators other than M j . With definitions 3.1 and 3.2, the definition of indirect effect is straightforward.
If we can assume that M includes all potential mediators, ADE|Z is the average direct effect of X on Y. Note that sometimes the direct effect x T is more relevant than ADE when the direct effects from exposure are not constant at different exposure levels.
Based on the definitions, mediation analyses can be generalized to different types of exposure, outcome or mediator variables, whose relationships can be modeled by different predictive models. Moreover, indirect effect from individual mediator can be differentiated from multiple mediators.

Multiple Mediation Analysis in Linear Regressions
In this section, we show that for continuous mediators and outcomes that are modeled with linear regressions, the proposed average indirect effects are identical to those measured by the CD and CP methods. We also show that the assumption of none exposure-mediator interaction for controlled direct effect is not required for the measurements of mediation effects defined in this paper. For simplicity, we ignore the independent variable(s) Z for the following discussion. Without loss of generality, assume that there are two mediators and that the true relationships among variables are: (1) where (ε 1i ,ε 2i ) have identical and independent (iid) bivariate normal distributions, and are independent with ε 3i , which are iid  Of all the effect from X to Y, fraction is from M 2 ; and These measurements of average indirect effects are identical to those from the CD and CP methods in linear regressions. Note that correlations among mediators are allowed here. Moreover, compared with the controlled direct effects, we do not require the none exposure mediator interaction assumption. To illustrate this, assume that there is an interaction effect of M 1 and X on Y, so that equation (3) should be Y i =b 0 +b 1 M 1i +b 2 M 2i +cX i +dX i M 1i +ε 3i . Based on the models, we have the following lemma for the mediation effects.

Lemma 4.2
The total effect of X on Y at X=x is a 1 b 1 +a 2 b 2 +c+a 01 d+2a 1 dx, among which the indirect effect through M 1 is a 1 b 1 +a 1 dx; and that through M 2 is a 2 b 2 . The direct effect from X is c+a 01 d+a 1 dx.
When X has no effect on M 1 (a 1 =0) and consequently there is no indirect effect from M1, the indirect effect from M1 defined by Definition 3.3 will be 0 by Lemma 4.2. However, the total effect minus the controlled direct effect may be non-zero because of the exposure mediator interactions [14]. Similar to our results, the natural direct effect from X and the natural indirect effect of M 1 when X changes from x to x* defined by VanderWeele and Vansteelandt [14] are (c+a 01 d+a 1 da*) (x-x*) and (a 1 b 1 +a 1 da)(x-x*), both of which depend on the changing unit of X, x-x*. However, the natural direct effect depends further on the end value of X, x*, while the natural indirect effect depends further on the start value x.

Multiple Mediation Analysis in Logistic Regression
In this section, we illustrate the mediation analysis when the mediator variable is not continuous and when the response variable is binary and a logistic regression is adapted to model the relationship with X and M. In this case, the response is the log-odds of the outcome variable, e.g., in the motivating example. The independent variable X can be continuous or categorical. It is the patient race (white or black) in the motivating example. In the following, we assume the exposure variable X to be binary.

When M is binary
We first consider mediator M 1 to be binary and assume the underlying true models: In this situation, the CP method cannot be used directly since a 1 denotes the change of logit(M 1i =1) with X i , while b 1 is the change of Y i with M 1i . The CD method is also not readily adaptable, since first we have to assume two true models for Y, one fitted with X and mediators and the other with X only; second, the coefficients of X in two models fitted with different subsets of explanatory variables have different scales, and consequently are not comparable; third, the indirect effects from different mediators are not separable. With the definitions in Section 3, we derive the neat results in Lemma 5.1 to calculate the indirect effect of M 1 . The counterfactual framework is popular in dealing with the special case when X is binary (0 or 1 denoting control or treatment), for example, the natural indirect effect, δ i (x), and the natural direct effect, ς i (x), discussed in Section 1. Based on the definition, Imai et al. [20,21]  Otherwise, there could be two measurements of indirect effect or direct effect, which brings in challenges to generalizing the mediation analysis to multi-categorical or continuous exposures. Our method relaxes this assumption. One can easily show that the average direct effect we defined for binary X in single mediator scenario is (

When M is multi-categorical
When the mediator is multi-categorical with K+1 distinct categories, i.e., M takes one of the values 0,1, , ,  K multinomial logit regression model can be adapted to fit the relationship between M and X. The true models are assumed to be:

Multiple Mediation Analysis for Non/Semi-Parametric Predictive Models with Binary Exposure
When (generalized) linear regression is insufficient in describing variable relationships, mediation analysis can be very difficult. The following algorithms that derived directly from the definitions of mediation effects provide a non/semi-parametric method to calculate mediation effects when the exposure variable is binary and the sample size at each exposure level is large. More general mediation analysis with any types of exposures and nonlinear relationships is discussed in a separate paper [22]. Algorithm 6.1 Estimate the total effect: the total effect for binary X is E(Y|X=1)-(Y|X=0). Under certain conditions, it can be directly obtained by averaging the response variable Y in subgroups of X=0 and 1 separately and taking the difference. Algorithm 6.2 Estimate the direct effect not through M j , which is defined as 1 Divide the original data sets

Inferences on the Estimated Mediation Effects
Delta and bootstrap are two popular methods to measure the uncertainties of mediation estimators [23,24]. When the variable relationships are modeled with (generalized) linear models. Delta method can be used to estimate the variances of the mediation effect estimators. In the supplementary materials, we provide the Delta method estimated variances of the mediation effects estimated in Lemmas 4.1, 5.1 and 5.2. The variances of mediation effect estimators can also be obtained by bootstrap method, especially for nonlinear predictive models. First randomly draw n observations from the original data set with replacements, and then conduct mediation analysis on the bootstrap sample and obtain the direct, indirect and total effect estimates. The resampling and analysis are repeated B times. The variances of the mediation effects are calculated based on the B sets of mediation effect estimators from the bootstrap samples.

Real Example and Simulation Study
The motivating example In this example the explanatory variable is the race of patient (0 for white and 1 for black). The binary response variable is patient's vital status at the end of the third year of diagnosis (death from breast cancer or not). The potential mediators (confounders) are listed in Table 1. Among these variables, age is continuous; poverty, education, employment, radiation, chemotherapy and hormonal therapy variables are binary; and all other variables are polytomous.
We used the criteria proposed by Baron and Kenny [2] to check the qualification of potential mediators. Those variables significantly associated with race, and with vital status controlling for race were candidate mediators. A variable not related with race but significantly related with vital status adjusted for race was included as a covariate. The variables not significantly related to vital status controlling for race were excluded for further analysis. As a result, radiation was included as a covariate, and chemotherapy and all census tract level variables were left out for further analysis. The remaining variables were analyzed as potential mediators in multiple mediation analysis. We used both the logistic model and MART [25] to explore the mediation effects. Note that the IEs were measured in terms of log odds of death in logistic model, but in terms of probability of death in MART. To compare results, we define the relative indirect effect: Table 2 three-year survival. Black women were less likely to undergo hormonal therapy or lumpectomy surgery. As results, the relative indirect effects for hormonal receptor, surgery, and hormonal therapy were 10%, 20%, and 10%, respectively.
Black breast cancer patients were more likely to be diagnosed at younger age at which the patients had longer survival time. This fact indicated age at diagnosis was a suppression factor for racial disparities in mortality.
Insurance had a significant indirect effect (8%) on racial disparities in the risk of death from breast cancer. Compared with patients with private insurance, those having no insurance or having Medicaid, Medicare or other public insurance had higher three-year mortality, which might relate to the more restricted accessibility to necessary breast cancer treatment. A higher proportion of blacks than whites in this study had Medicaid, Medicare, public or no insurance.
The estimated average direct effect of race on mortality after considering all mediators was -.014 with 95% confidence interval (-.034; .007), which was statistically insignificant. This suggests that racial disparities of breast cancer survival could be satisfactorily explained by all mediators in the model. presents the indirect effect, relative indirect effect estimates, and their 95% confidence intervals from the bootstrap method. There were some differences in the results from logistic regression and from MART: 1. the orders of relative effects were slightly different; 2. comorbidity was a significant mediator by MART but not by logistic regression; and 3. the confidence intervals were narrower using MART. We recommend using MART in this case, since it is more sensitive in finding significant mediators especially if the assumed relationship in logistic regression is inappropriate. From the results, stage explained 29% of racial disparities in the three-year survival breast cancer patients. Compared with localized breast cancer, patients diagnosed with regional or distant cancer were significantly more likely to die from breast cancer. Also, the larger tumor size and/or worse grade of breast tumor contributed to the higher risk of breast cancer death among black women than white women; the relative indirect effects from tumor size and grade were 21% and 10%, respectively.
Compared with patients with negative ER/PR receptors, patients with positive ER/PR receptors developed less aggressive breast cancer and had better three-year survival rates. Black women were 54:7% less likely to be diagnosed with positive ER/PR receptors than whites. Compared with patients with lumpectomy surgery or hormonal therapy after surgery, those without surgery or hormonal therapy had worse

Simulation study
To evaluate the sensitivity and identifiability of the proposed method, we conducted simulations to check the bias, type I error, and power in estimating mediation effects. We checked the method comprehensively with all types of exposures, mediators and response variables, and with many types of relationships. In this section, we show a special case where all variables were binary. For more scenarios, readers are referred to Fan [22]. We used two potential mediators, which were independent or correlated with each other. Data were generated from the following true models: The correlation between M 1 and M 2 was controlled by the odds ratio Larger a j and b j indicated greater indirect effect from M j . When either of them was zero, there was no mediation effect from the corresponding variable. A total of 6×3 4 =486 parameter combinations were used for each of the four sample sizes, yielding 1944 simulation scenarios. We present the simulation results for a 1 = 0:518; a 2 =0:518 and sample sizes 500 and 1000 in this paper. The complete simulation results will be provided upon request.

Empirical bias
For each simulation scenario, 500 replications were conducted. We estimated the IEs of M 1 and M 2 from these 500 replicates. The empirical bias is the difference between the averaged IE and true IE. The simulation results are summarized in Table 3. For all the scenarios, we found no empirical bias that was significantly different from 0.

Type I error rate and power
For each replication described in the previous section, the variance of the estimated IE was estimated from both the Delta method and the bootstrap method (B=500). We tested the hypothesis: H 0 : IE=0 via the test statistics , ( | )

IE SE I E
which was assumed to be normally distributed.
The type I error rate or power was the proportion of times rejecting the null hypothesis at 5% significance level. The simulation results were summarized in Table 4. If the true indirect effect is 0, the values listed in the table represent the type I error rates, otherwise the powers. Figure 2 presents the power curves obtained by the Delta and bootstrap methods (OR=0.2) for mediator 1 at different b 1 and sample sizes when b 2 and c were fixed at 0.522 and 0.518, respectively. The Delta and bootstrap methods showed similar patterns. Figure 2 suggests that for all sample sizes, the statistical power increased with b 1 and then reached a plateau when b 1 hit a certain point. At fixed b 1 , larger sample size had greater power. For a given sample size, when b 2 and c were fixed, the statistical  power increased as b 1 increased; when b 1 and 2 =c were fixed, the statistical power decreased as c=b 2 increased. At the same parameter configuration, indirect effect inference showed slightly greater power if two mediators were generated independently.

Discussion and Future Work
In this paper we propose a mediation analysis through general definitions of total effect, direct effect and indirect effect. We demonstrate the method with various predictive models. The proposed method is    a general extension of mediation analysis under the counterfactual framework. It is also related to the traditional CD and CP methods. The method generalizes and improves the existing mediation analysis methodologies in many ways. First, responses, exposure variables and mediators can be measured at any scale: continuous, bi nary or multicategorical. Second, multiple mediators of different types are allowed in the pathway analysis simultaneously. Indirect effect transmitted by an individual mediator can be differentiated from the total effect, which enables the comparison of the importance of the mediators. This property is especially useful for developing policies that aim at altering the relationship between a specific exposure variable and a response variable through controlling the intervention from third variables. With the knowledge of the indirect effect carried by each mediator/ confounder, a policymaker is able to focus limited resources on changing the most important factors. Third, the mediation study allows correlations among mediators. Fourth, the concepts of mediation analysis can be applied in general predictive models. Finally, we provide two approaches to estimate the variance of indirect effect in parametric or nonparametric models. Our methods are demonstrated through a real example and a simulation study.
Several aspects of the proposed method will be explored in future work. We provide a non-parametric procedure of mediation analysis for binary exposures. The procedure will be extended to more general predictive models and other types of exposures. Only limited work has focused on mediation analysis in survival model contexts. The idea of mediation analysis proposed in this paper will also be extended to the additive models in survival analysis. With the proposed definitions, we can also take advantage of previous knowledge and information in the analysis. Therefore, we propose further research on implementing multiple mediation analysis in Bayesian settings.

Supplementary Material
1. The proofs of Lemmas 4.1, 5.1, 5.2. 2. Delta method to measure the variances of mediation effects.