Using Directed Acyclic Graphs for Investigating Causal Paths for Cardiovascular Disease

By testing for conditional dependence, algorithms can generate directed acyclic graphs (DAGs), which may help inform variable selection when building models for statistical risk prediction or for assessing causal influence. Here, we demonstrate how the method may help us understand the relationship between variables commonly used to predict cardiovascular disease (CVD) risk. The sample included people who were aged 30 to 80 years old, free of CVD, who had a CVD risk assessment in primary care and had at least 2 years of follow-up. The endpoints were combined CVD events, and the other variables were age, sex, diabetes, smoking, ethnic group, preventive drug use (statins or antihypertensive), blood pressure, family history and cholesterol ratio. We used the ‘grow shrink’ algorithm, in the bnlearn library of R software to generate a DAG. A total of 6256 individuals were included, and 101 CVD events occurred during follow-up. The accepted causal associations between tobacco smoking and age and CVD were identified in the DAG. Ethnic group also influenced risk of CVD events, but it did so indirectly mediated through the effect of smoking. Drug treatment at baseline was influenced by a wide range of other variables, such as family history of CVD, age and diabetes status, but drug treatment did not have a ‘causal’ association with CVD events. Algorithms which generate DAGs are a useful adjunct to traditional statistical methods when deciding on the structure of a regression model to test causal hypotheses. Journal of Biometrics & Biostatistics J o u rn al of Bio metrics & Bistatis t i c s


Introduction
Most risk prediction and causation models in epidemiology are based on additive combinations of risk factors in a regression model framework, and the additive structure implies that variables typically act, unless interaction effects are introduced, without influence on the other variables, to yield a risk of developing disease. Since they are simply mathematical constructs, the models do not necessarily provide a plausible causal representation of how disease develops. One method to more explicitly consider causality is to attempt to describe the influence of variables on a particular disease outcome, accounting for causal pathways that are, at least, plausible, in the form of a directed acyclic graph (DAG). These can be built using learning Bayesian network algorithms [1].
Directed acyclic graphs (DAGs), also known as probabilistic networks, or Bayesian networks, encode a structure of conditional independence between variables, represented by nodes of a graph. Connections between nodes imply causal influence, observed in the data as statistical dependence. These connections are often directed, to indicate which variable influences the other (referred to as directed edges). In this way, DAGs represent a set of conditional dependence and independence properties associated with epidemiological variables [1].
In a DAG, no distinction is made between 'independent' and 'dependent' variables in the sense used in regression modelling. The idea underlying their use is to fuse domain knowledge with information from the collected data into a model which mimics a network of causal influences of how the observed data were generated.
DAGs are therefore useful for elucidating possible causal pathways and have been applied in epidemiology for this purpose [2]. However, they also have a role in forming sensible judgements about variables to be included in regression prediction models. For example, a key idea of Pearl, who has been a proponent of DAG ideas, is that variables may act as 'colliders' [3]. That is, on a causal path between exposure and outcome, another variable on the path is entered and exited through arrowheads, which indicate more than one influence (collision of influences) on the variable. Here, we interchangeably use the terms 'cause' and 'influence' to indicate directional conditional dependence, or a link between variables, generated by a computer algorithm.
This idea of including an explicit causal understanding is absent from much statistical analysis. Including colliders as regressors can result in unpredictable behaviour, biasing measures of association in a regression model. Pearl shows that bias may increase, by introducing dependence from unobserved or other variables, rather than reduce, after their inclusion. Further, in certain instances, adjusting for colliders, or their 'descendants' , that is variables which are causally influenced by colliders, may indicate no causal influence between the variable of interest and the regression model's outcome variable, when in fact a causal relationship does exist [3]. DAGs, derived from data, may help identify such variables, so that they can be omitted, rather than be included in regression models.
To develop prediction models, we believe that a causal understanding is likely to lead to more accurate and reliable predictions than those developed using standard statistical methods alone [4,5]. In this study, we explore a small database of CVD and associated risk factors using DAG techniques to inform variable selection for risk prediction models, and, it is hoped, to better explain the development of CVD.

Methods
The analysis is based on a cohort assembled by primary care practitioners in the Auckland and Northland regions of New Zealand using the PREDICT programme for CVD risk management that is integrated with patient electronic health records [6]. Cohort participants were patients attending their primary care practitioners who had their CVD risk formally assessed using a Framingham Heart Study risk prediction equation [7]. The information, from participating GPs, was stored on a secure project web-server and each patient was linked to national health databases via an encrypted version of the New Zealand national health index (the NHI) number. This unique number is allocated to all New Zealand residents and attached to their routine health records.
Databases that were linked included: hospital discharges, mortality, and drug dispensing. We selected a group of individuals enrolled between the 1 st of Jan 2006 to the 31 st of December 2007.Two years after the baseline assessment we determined if they had been admitted to hospital with CVD or died from CVD or other causes by consulting hospital diagnosis and cause-of-death information. Diagnosis codes that were used have been listed elsewhere [6].
Individuals under 30 or over 80 years of age at the time at the time of screening were excluded because CVD is uncommon under 30 years and hospital diagnoses are known to be less accurate in older people. We also excluded people with a history of prior CVD or heart failure, identified by a general practitioner diagnosis of CVD or hospitalisation with CVD in the last five years, or those dispensed a loop diuretic in the six months before assessment, who were assumed to have heart failure. The variables which were considered as candidates in the DAG were: age-at-enrolment, sex, diabetes, smoking, ethnic group, family history of premature CVD, statin use, antihypertensive drug use, systolic blood pressure, total: high-density-lipoprotein (HDL) cholesterol ratio and CVD events during follow-up. Continuous variables were categorised, mostly into deciles, as this format is required for the particular DAG algorithm (see below) that we selected. The categorical variables were included as dummy variables without an ordinal structure.
The R package bnlearn drew the DAG, using the 'growshrink' algorithm, first developed by Pearl [8]. An understandable summary of the algorithm has been documented elsewhere [1,9,10]. The algorithm effectively filters links out of a full skeletal DAG, in which all nodes are initially connected except those 'banned' (see below), based on tests of conditional independence between a pair of nodes given all possible subsets of the rest. We used the Monte Carlo permutation tests [11] option which has performed better in simulations in which the causal structure of the data is known, compared to standard chisquare tests [8]. Logical rules are applied to determine the direction of links (conditional dependence between variables), so that cycles are not introduced and patterns of conditional independence found in the data match the generated DAG.
We estimated link influence in the final DAG by estimating the beta-coefficient for a regression for each potential causal effect in which the variable at the base of the arrow ('cause') was considered a covariate, and the variable at the head of the arrow ('effect') was considered the outcome or dependent variable. Other variables which opened 'back door paths' (Pearl's terminology for confounding) between cause and effect variables were included as covariates in the regression. Either linear or logistic regression was used depending on whether the 'effect' variable was continuous or categorical.
For the link between ethnic group and family history of disease, we adjusted for age. For, although age directly causes CVD, it does not influence ethnic group, and is in fact 'banned' (see below), so does not qualify as a confounder. Age does, however, modify the risk of an individual reporting a positive family history of CVD and so we felt that it was sensible to adjust for age in this instance [12,13]. Other adjustments in the regressions are indicated in Table 1.
The bnlearn algorithm allows implausible causal influences to be 'banned' . The following rules generated the banned list: • Sex, ethnic group and age must not be caused by any other variable.
• Family history must not be caused by drug treatment variables.
• he outcome, fatal and nonfatal CVD, must not cause any other variable.

Results
After the selection criteria were applied, 6256 subjects were available for analysis, 101 (1.6%) of whom experienced a CVD event during follow-up, and 35 (0.6%) of whom died of causes other than CVD. Table 2 shows that age-at-enrolment, ethnic group, smoking status, antihypertensive drug use, systolic blood pressure and diabetes status were significantly associated with event status. Among ethnic groups, Maori were at highest risk of a CVD event (estimated odds ratio: 1.87; 95% CI: 1.09 to 3.10). Those who used either statins or antihypertensive agents were at higher risk of CVD than non-users.
The derived DAG is depicted in Figure 1. Directed arrows indicate the direction of 'causal' influence between variables. Only two direct influences on cardiovascular disease are detected: age and cigarette smoking.
Ethnic group influences risk of cardiovascular disease, but it does so mediated through the effect of smoking. Age influences several other variables, such as family history of disease and the risk of taking preventive drug treatment. Ethnic group influences three variables: family history, smoking and diabetes status. The ratio of total: HDLcholesterol concentration is influenced by two variables: sex and cigarette smoking.
There was no link between anti-hypertensive or statin therapy and cardiovascular disease. Also, we observed that commonly accepted causal associations, such as systolic blood pressure and total: HDLcholesterol ratio did not show a causal link to CVD events. This contrasts with strong univariable associations between systolic blood pressure and CVD ( Table 2). The analysis, also, did not causally link statin use with the cholesterol ratio variable.
Indices of link influence are given in Table 1. These are betacoefficients derived from regressing the cause (tail of arrow) on the effect (arrowhead), using either linear or logistic regression, adjusting for other immediately adjacent influences on the effect variable. All links between age and other variables show strong evidence of association, along with ethnic group, male sex and diabetes and their causal links. Strong associations were noted between diabetes status and use of preventive drugs.  From the logistic regression analyses, the greatest odds ratios were between ethnic group and diabetes status. Pacific people were4.4 times more likely than 'Others' to be diagnosed with diabetes (estimated OR: 6.44, 95% CI: 5.39, 7.70; prevalence of diabetes among Others: 8.6%) and Indian people were almost four times more likely than 'Others' to have the diagnosis in this cohort (estimated OR: 5.14, 95% CI: 3.78 to 7.00). For continuous outcome measures, those who used antihypertensive drugs had an average systolic blood pressure 7.30 mmHg (95% CI: 6.28 to 8.33) higher than people who did not use these drugs.

Discussion
In this exploratory analysis with a relatively small dataset, we have shown that a DAG learning algorithm generated a plausible graph explaining the occurrence of cardiovascular disease. The DAG captures the two known key causal influences of CVD: age and cigarette smoking. It also demonstrates the well-known influence of age on other variables, such as systolic blood pressure [14] and preventive drug use [15]. Positive or higher values of these variables increased with advancing age.
The DAG may help inform variable selection decisions for regression modelling to establish magnitude of effects. For example, from our data, ethnicity influences diabetes, cigarette smoking, and family history of premature CVD. These 'causal' relationships indicate that in trying to assess the effects of ethnic group on CVD, adjusting for any of these mediating variables will bias the association. It is equivalent to adjusting for blood pressure level when investigating if there is a causal relationship between body mass and CVD, as blood pressure is on the causal pathway. Similarly, the DAG simplifies assessing the influence of potential confounding factors on CVD incidence. It also suggests that none of the other baseline variables confounds the relationship between ethnicity and CVD, since no other variable directly influences ethnic group. Thus, when assessing the causal effect of ethnicity, it may only be necessary to adjust for age, since, as we argued before, it is a modifier of the effect of ethnic group on CVD.
An interesting feature of the DAG was the link between age and reported family history, showing a negative relationship. This may reflect the belief and reporting practice of the physician, who may only enquire about family history of CVD in younger patients, assuming that older patients will not have a family history. An alternative assumption is that genetic causes of CVD only manifest disease in younger patients, so that older patients, when risk assessed, are assumed not to have a genetic predisposition.
Again, if this DAG were a valid representation of causality it would suggest that very few of the variables that were measured actually cause CVD, so in assessing the effect of various exposures, some adjustment may cause more harm than good. It also counters the common practice in clinical research of reporting 'independent risk factors' after adjusting for a number of other variables by regression [16] and considering them as causal.
The DAG presented here also may help identify what Pearl terms 'barren proxies' when assessing causal influences. These are variables which have no direct influence on either the exposure or outcome variables, but are themselves causally influenced by factors that are either related to exposure or disease, or possibly both. In this sense, they could be considered as proxy measures of either exposure or disease. For example, consider a scenario in which one was to investigate the statistical evidence for a causal link between sex and CVD incidence. In this case, including the cholesterol ratio variable as a covariate, which, in this data set is influenced by sex, but does not show convincing evidence of influencing disease status, may increase (rather than reduce) bias in estimating the strength of association between sex and CVD in a regression model. Thus, in this dataset the cholesterol ratio would be termed a barren proxy. As with the ethnicity example above, the value of excluding the cholesterol ratio in a causal analysis is distinct from the value which the cholesterol ratio variable may play in predicting disease incidence.
Some known links emerged from the analysis, for example that between cigarette smoking and serum lipids has been long described [17]. The DAG did not, however, directly link serum lipids with CVD. In addition, the DAG and the effect estimates in Table 1 identified that of checking whether the assumptions encoded in the researcher-drawn DAG are actually observed in the study data.
In this exploratory study, we demonstrate how a simple DAG could shed light on the likely causal structure of risk factors for incident cardiovascular disease. The derived graph provides useful information to inform variable selection decisions when assessing causal relationships with the disease, and since they are related concepts [4], the DAG also usefully informs the development of models used for prediction.
anti-hypertensive drug treatment increases systolic blood pressure. As anti-hypertensive drugs are known to lower blood pressure, this, at first, seems counter-intuitive. However, the drug use data is collected before the blood pressure information so this is the only sensible direction for the link to be oriented. The orientation of the link means that people were taking the drugs simply because their blood pressure was high, and that, on average, treated individuals had a higher blood pressure on average, than untreated individuals (7.3 mmHg, adjusted for age) when they were screened.
This analysis clearly has some limitations. These include the likelihood that some associations are not identified because of type-2 errors (only 101 CVD events occurred). There may also be information bias and unmeasured variables that could affect the nature of the DAG. However the main objective of this paper is to demonstrate the potential of the DAG learning algorithm rather than add to our knowledge of CVD risk. A further limitation of the DAG algorithm is that it does not deal with time-to-event data, commonly used in cohort studies, which may be censored. In these analyses we used a short, two year period of follow-up, and in this period there were few losses to follow up, mostly from non-CVD deaths.
There are a few other studies which have used learning Bayesian networks to explore similar datasets. Twardy et al. [18] used Bayesian network algorithms, based on minimisation of information metrics, to determine the causal structure of the data in two cohort studies of cardiovascular disease. The authors did not exclude, or ban, implausible relationships, as in our study. Also, their study was limited by a high proportion of cases in which some covariates were missing. In their 'final' model, several implausible relationships were present, such as diabetes and weight influencing age. Their model described age as the only influence on coronary heart disease and had some similar findings to our study, of age influencing many risk factors: total cholesterol, triglycerides, systolic blood pressure, smoking status and height. Unlike our study, some known causal links were included such as between diabetes and systolic blood pressure, which were not drawn in our DAG, even though it is well known that diabetes raises blood pressure [19].
To summarise the implications of our DAG for statistical modelling, we suggest that when using regression to assess the causal influences for cardiovascular disease, an analysis could be done to generate a DAG to estimate the conditional association between disease status and other variables. Only those variables which appear causally related -that is, with arrows that point to disease-should be included in the model. This means, for our data we would only include age and smoking status, along with the exposure of interest, in a regression. Other variables may be justified if they were thought to be important effect modifiers or confounders. Effect modification is not captured in the DAG, so inclusion of variables for this reason will not be informed by the DAG. If justified as confounders, researchers must think carefully about whether they are likely to act in such a way, that is, causally influence both the exposure of interest and the outcome, rather than act as 'barren proxies' . The Bradford-Hill criteria [20] may be used to guide these decisions. In contrast, for developing prediction algorithms, many variables can be used in statistical models that may be associated, but not necessarily causally related to disease.
Also, it is increasingly common practice for researchers to propose a DAG, drawn from informed scientific knowledge, which is then used to inform variable selection when testing causal relationships in observational studies. The algorithm used in this study provides a way