Human Genetics Center, UT Health School of Public Health, 1200 Pressler Street, Suite E-447, Houston, Texas, USA
Received Date: January 14, 2016; Accepted Date: February 10, 2016; Published Date: February 17, 2016
Citation: Yazdani A, Yazdani A, Boerwinkle E (2016) Conceptual Aspects of Causal Networks in an Applied Context. J Data Mining Genomics Proteomics 7:188. doi:10.4172/2153-0602.1000188
Copyright: © 2016 Yazdani A, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Data Mining in Genomics & Proteomics
Making causal inference is conceptually straightforward in the setting of a randomized intervention, such as a clinical trial. However, in observational studies, which represent the majority of most large-scale epidemiologic studies, causal inference is complicated by confounding and lack of clear directionality underlying an observed association. In most large scale biomedical applications, causal inference is embodied in Directed Acyclic Graphs (DAG), which is an illustration of causal relationships (i.e., arrows) among the variables (i.e., nodes). A key concept for making causal inference in the context of observational studies is the assignment mechanism, whereby some individuals are treated and some are not. This perspective provides a structure for thinking about causal networks in the context of the assignment mechanism (AM). Estimation of effect sizes of the observed directed relationships is presented and discussed.
Causal inference; Assignment mechanism; Confounder; Causal network
Inferring cause-effect relationships among variables is of primary importance in many sciences and is growing in importance as a result of very large datasets in health and genomics. There are several statistical frameworks underlying causal inference, such as those of Rubin’s potential outcome framework [1,2]. Pearl’s structural equation modeling framework [3] and Dawid’s regime indicator framework [4], that have been established for making causal inference. These frameworks are hardly known to most biomedical researchers or biostatisticians who could by applying them to address real world problems. Large segments of the statistical community and decision makers find it hard to benefit from causal analyses. The main reason, we believe, is not a philosophical barrier about data analysis establishing causality, but rather lack of familiarity with the vocabulary and methods in the field. Undertaking statistical causal inference requires systematic extensions to the standard language of statistics, and this perspective provides a step toward this end.
Among available statistical causal inference frameworks, Pearl’s causal networks, which are compatible with structural equation models (SEM) [3], can be seen as a pragmatic approach to solving real world problems, especially in the age of large data sets. [5] Has critiqued Pearl’s framework and suggests that it requires additional explicit, methodological and philosophical justifications. The concept of the assignment mechanism developed by [2] describes the circumstances by which some individuals are exposed to a treatment of interest and some are not. In this perspective, we first connect causal networks to the concept of the assignment mechanism (AM). Then, we formalize the causal network parameterization using the AM notation. After discussing the concept and notations of causal networks and the AM, we present effect/causal effect estimation.
The questions that motivate most studies in the health, economics, social and behavioral sciences are causal relationships and not only associations, such as the efficacy of a given drug in a given population. The classical approach for determining such relationships uses randomized experiments where single or a few variables are intervened on. Such intervention experiments, however, are expensive, unethical or even infeasible in many of the cases. Hence, it is desirable to infer causal effects from so-called observational data obtained by observing a system without subjecting it to interventions. Then, to estimate the effect of a treatment on a response, we need to know how different values of the treatment are assigned. The circumstances by which some individuals are exposed to a treatment of interest and some are not is called the assignment mechanism (AM).
To achieve causal inference, the important data elements include not only the value of the observations but also the reason why one of the possible exposures or treatments has been realized and not others. The notation AM (KR) is introduced as the third element (in addition to treatment and response value) and is called the causal element [6]. The practitioners need to understand the underlying mechanisms by which some individuals have a certain exposure level and some do not. The knowledge related to response is represented by KR and is required to identify the AM. In a randomized clinical trial the AM is straight forward (i.e., the treatment assignment mechanism is unrelated to response) and presumably under the control of the investigators. In an observational study, many factors (covariates) may influence the AM but only some of them are related to response. Variables / covariates that influence both the outcome and the AM are termed confounders [7]. The aim of considering the AM is to identify individuals with similar confounder distributions as if there were a randomization. In an epidemiologic study, this is similar to matching [8]. In a data analysis setting, this is equivalent to SEM [3] where the AM is understood and modeled. Formalizing the AM in the context of causal networks compatible with the SEM is more practical in the age of big data. Therefore, in this study, we formalize the AM within the context of statistical causal networks.
Causal networks are illustrations of the AM, the data generating process underlying the study observations, and provide a pragmatic approach to distinguish confounders of the AM from among the covariates, and allows one to analyze observational data as if an intervention was carried out. It is important to understand and take into account that any model in a causal setting is conditioned explicitly or implicitly on illumination of assignment mechanism.
Assume the assignment mechanism over p variables Y_{1}, Y_{P} is formalized by a network, here a Directed Acyclic Graph (DAG). The distribution P over these variables is:
(1)
Where pa (j) denotes the set of predecessors of node j and are directly connected to j in the network, called parents of node j. For i ε pa (j), there is i→j in the DAG or Y_{i}→Y_{j}. Note that the formula in (1) represents the Markov properties over these set of variables compatible with the underlying DAG, an illustration of the assignment mechanism that governs over this set of variables. This is a strong assumption in application of DAGs and can be represented in (1) as
By conditioning factorized distributions on the causal element AM (K_{R}), we explicitly represent that the AM is taken into account and the work can, therefore, be considered to be within a causal setting.
Assume the AM over four variables X, Y, Z and H is illustrated in Figure 1. The variable of interest is H, and we are typically investigating the influence of the other variables on H. To factorize the joint distribution over these 4 variables, we first identify potential confounders.
Variables X, Y, and Z are all called covariates. However, the effect of X reaches to H only through Z and Y. Therefore, X is not a confounder of the value of H. The set of confounders for variable H is C (H) = {Y,Z}. The interested reader is referred to the backdoor criterion in [3] for further information. The joint probability over these variables are then factorized as
Without conditioning on the causal element, AM (K_{R}), such a unique factorization is not possible [9,10].
Assume a DAG D = D = (v,ε), where v is a set of nodes with p elements corresponds a set of p random variables with joint Gaussian distribution and ε is a set of edges which connect the nodes and represent the conditional dependencies between two corresponding variables. The existence of a directed edge between two nodes shows the direction of effect (the flow of information) between the correspondent variables. The concept of a DAG D = (v,ε), depends on the nodes in v and edges in ε and any inference depends on the set (v,ε). Assume P is a joint probability distribution over variables Y_{1},…,Y_{p} corresponding with nodes in DAG D = (v,ε). D and P must satisfy the Markov condition, the strong assumption in causal inference using networks. This means variables are related with the causal network DAG D = (v, ε), Furthermore, we assume these variables have a joint distribution which satisfies the Markov property with respect to the DAG D and all marginal and conditional independencies can be directly obtained from the graph D: every variable Y_{i}, i ε v, is independent of any subset of its predecessors conditioned on the set of its direct or immediate causes of Y_{i}, corresponding with parents of i,
Where Y_{k} occurs before Yi and parental set p_{a}(i) denotes the set of parents of node i relatives to AM formalized by D = (v,ε).
In SEM and under the assumption of a Gaussian distribution, we can write
(2)
where U_{i} is distributed normally and is independent of the Y_{j} is in the right side of the model. λ_{ij} ≠ 0 is equivalent with an edge j→i in DAG D which is due to compatibility of SEM and AM formalized as the DAG D. SEM is a deterministic form of probability models or conditional dependencies, where all uncertainties are confined in the variable U.
Given a causal network structure, the goal in this section is to discuss effect/causal effect estimation and distinguish it from mere association. To estimate the effect of Y on Z, we consider the causal element AM (KR) embodied in the DAG in Figure 2, which illustrates the assignment mechanism behind the observed variables Y and Z. To obtain the effect of Y on Z, variable X in the path y→x→z is called a confounder. In other words, X confounds the assigning mechanism Y on Z since X influences both Y and Z. Recall that the causal network structure in Figure 2 is an illustration of the assignment mechanism over these three variables and all discussions and equations for the effect measurement is given the assignment mechanism.
To find the effect of Y (and not X) on Z and under Gaussian assumption, we first adjust for the effect of X on Y by
(3)
and then find the effect of variations in Y on Z by
(4)
Equation (4) represents the degree to which variable Y is responsible for the variation in Z, excluding the effect of X.
Therefore, the coefficient ᵞ is interpreted as a causal effect. However, in the regression of Z on Y as
(5)
the coefficient λ shows only association between Y and Z, since some of the variations in Z attributed to Y is due to the confounder X.
We estimate the effect of X on Z excluding Y by:
where e_{z} is the residual Z after removing the effect of Y on Z. The coefficient β is interpreted as the effect of X on Z excluding the effect of Y. A mediator effect has not been discussed in this section and interested readers are referred to [11-13].
To illustrate the above principles, we simulated three variables based on the underlying network in Figure 2 with the primary interest in the effect of Y on Z excluding any effect of X. We simulated 50 set of data and average of estimated effects and average of degree of association over 50 sets are tabulated in Table 1 for three different values of true effects. The standard deviations are presented in parenthesis in the Table 1. The degrees of associations are measured by regressing Z on Y.
True effect | Estimated effect | Degree of association |
---|---|---|
2.000 | 2.005_{(0.03)} | 1.596_{(0.01)} |
3.000 | 2.993_{(0.03)} | 1.729_{(0.02)} |
4.000 | 3.995_{(0.03)} | 1.853_{(0.03)} |
Table 1: Average of estimated effects and degree of associations for three different true effects of Y on Z over 50 replication sets with standard deviation in subscript.
We estimated the effects and the degrees of association using equation 3 through 5 while substituting estimate of e_{yx} in equation 4.
We have provided a short and selective perspective of causal inference, including network analysis, the concept of the assignment mechanism, and effect size estimation. A unique aspect of causal inference compared to traditional applied statistics is captured in the concept of the assignment mechanism. To achieve causal inference, the assignment mechanism must be understood and requires close collaboration between analysts and other biomedical scientists. Taking the AM into account, we are able to identify confounders and distinguish the effect from association. The assignment mechanism, here formalized in a DAG, can be either known a priori or estimated by an algorithm for directed structures. In this perspective, we assumed that the assignment mechanism is known. In the case of known AM, confounders can be identified from the AM and the measurements remain in a causal setting.
In most of the cases, AM is not known and needs to be estimated. An ambitious approach is data integration. We have introduced an algorithm called granularity DAG (GDAG), which generates causal networks using data integration [14]. In an application, genomic information is extracted from SNPs scattered across genome by first selecting a subset of informative SNPS using hierarchical clustering and linkage disequilibrium [15] and second principal component analysis. The extracted genome information is used to generate a causal network over phenotypic variables (e.g. body mass index and blood cholesterol levels) of interest.
This work is supported by a training fellowship from the Keck Center for Interdisciplinary Bioscience Training of the Gulf Coast Consortia (Grant No. RP140113).