Today we will go over the basics of quantitative causal inference including examples of the potential outcomes framework and linking it to existing concepts such as confounding. We will walk through some example data to clarify these concepts.
We will discuss two critical conditions, exchangeability and causal consistency, we need to fulfill in order to be able to estimate and infer assocations as causal effects. We will consider how important third variables, i.e. that are not exposures or outcomes, such as confounders and mediators relate to these conditions.
Then, we go over how to draw a causal diagram which illustrate some of our assumptions about how these variables relate to one another and, in turn, help us build statistical models (the next step in causal inference).
Finally, we review the potential outcome framework and casual condition requirements relate in the context of frame a causal question.
While there are many philosophies of studying causation, “epidemiologists take a specific pragmatic approach” that quickly narrows the problem down to quantitative questions that can be answered with numeric data. Simply put, because nearly anything can be a cause of something and causes must work in concert with one another in context (“multi-causality”), using numeric data to look for “causes of effects” may be quite fruitless (we will discuss counter-arguments in our health disparities section). Instead, epidemiologists ought to be relying on background theory and knowledge to identify possible causes, choosing a context, and estimating “effects of causes” within that context.
To say it a different way, it is not the best use of numeric data to try to demonstrate whether something can be a cause by simply observing the strength of association observed in a single study. To take an extreme example: Imagine trying to determine whether types of natural materials are “causes” of flammability by studying them underwater…or in space! There are clearly large differences in how well different materials burn, but being able to observe this “effect” relies on the presence of other causes (alternative explanations), that is to say observable effects are context-specific. In certain contexts, some causes don’t matter at all, but I don’t think anyone would disagree that material type is definitely a “cause” or determinant of flammability!
Instead, we can take a more productive approach by starting with the assumption that material type is a cause of different rates of burning. We then specify the context under which we want to investigate this difference in burning rates – e.g. at sea level, indoors, ambient temperature of 75 degrees F, 60% humidity, and using a butane lighter as a source of ignition. A study like this provides us very specific, narrow information: Does the rate of burning for material A differ from material B in this specific context? These findings are more translatable in suggesting when results should or should not replicate, for example, when the context differs too much from that which was observed in this study.
As you are probably already thinking, the concept of “specific” is relative and will change as we gain more information about systems, causes, and mechanisms. Such is the nature of science. However, this aspiration to estimate a narrowly-defined numeric effect size, which comes from specifying the background knowledge (of possible causes) and context for which the effect should apply, are hallmarks of “quantitative causal inference.”
We will talk more specifically about the elements of quantitative causal inference in the next section, but it’s worth giving some of it’s own context here first: Even though a main goal of public health research should be to produce actionable evidence to improve health, epidemiologists are often too timid to make causal claims about the effects of exposures and fall back on just describing statistical “associations”. This is in part because we known that there are many potentially alternative explanations to any observed association. Unfortunately, the product of such non-committal findings often don’t even help us understand what research are needed to developing future public health interventions. Worse, they might be interpreted in ways we didn’t even intend – think of all the news articles flip-flopping about whether wine, coffee, chocolate, etc. are good or bad based on vague associations.
Good quantitative causal inference lays out transparently the alternative explanations we think are important and gives us a way to productively discuss whether or not they are believable. Specifically, it acknowledges that it is difficult or impossible to capture all the complexity that is inherent in human populations and systems and instead opts to answer simple questions in context. Rather than throwing up our hands and saying “it’s just an association,” we do our best to define a question we can reasonably answer and defend our answer to it. In this way, we can at least try to fulfill epidemiology’s mission as the “basic science of public health” and move in the direction of informing future interventions or policies.
Too hungry to focus.
To formalize how we think about “effects of causes,” alternative explanations, and context, we rely on the concept of counterfactuals: hypothetical scenarios that might have played out (for an individual or group of individuals) under different treatments or exposures. For example, if you had decided on a busy day to skip a meal prior to a meeting, a counterfactual “exposure” scenario might have been that you had packed a lunch to eat quickly before the meeting. The latter scenario is counter to the fact (opposed to reality) of you having actually skipped a meal, and there is no way to turn back time to actually see what would have happened had you packed the lunch!
Let’s express this in terms of variables and values, with capital \(X\) representing what you had for lunch. Little \(x = 0\) denotes skipping lunch (what actually happened / what was observed) and a counterfactual (i.e. not actually observed) \(x = 1\) denotes eating a packed lunch. Note that there could be a number of alternate scenarios, such as if you packed a lunch but still didn’t eat it (is it still \(x = 0\)?) or you bought a snack instead (\(x = 2\)?). We’ll come back to these very important considerations.
Now we need to think of an outcome that we are interested in estimating the effect of having lunch on. You can’t have a causal effect without an outcome! Let’s say that we’re trying to estimate the effect of having lunch on being able to remember details from your regular afternoon meeting. We could denote this as a continuous outcome big \(Y\), where little \(y = 1\) meaning you remembered everything, \(y = 0.5\) you remembered half of it, and \(y = 0\) meaning you remembered nothing. Let’s say that what actually happened was you retained about 25% of what was said (\(y = 0.25\)).
An intuitive way of thinking of an individual-level causal effect here is whether, if counter-factually you had lunch, the amount you retained would have been more than 25%. We can represent this intuition as a contrast or a difference between two potential outcomes:
Since we actually observed the outcome \(Y_{x = 0} = 0.25\), so the only information we are missing to solve this is the “potential” outcome \(Y_{x = 1}\).
But how do we get information on the potential outcome \(Y_{x = 1}\) since you actually didn’t have lunch (\(x = 0\))?
Or if you actually had lunch (\(x = 1\)), how would get the counterfactual potential outcome \(Y_{x = 0}\)?
As it turns out, it’s impossible to know an individual’s causal effect because we’ll never be able to turn back time to see all potential outcomes for an individual!
This is known as the fundamental problem of causal inference and is often thought of as a missing data problem (i.e. since we are missing data on at least one of individual’s potential outcomes).
So how do we solve this? We must use outcomes from people who have observed exposure (\(Y_{x = 1}\)) to stand in for the counterfactual potential outcomes of people who were unexposed (\(x = 0\)) and thus do not have an observed (\(Y_{x = 1}\)) and vice versa.
However, there is a major problem with this, which is that we don’t know if it’s fair to use one person’s potential outcomes to substitute for another’s. Obviously, each person’s experiences and response to exposures can be very different!
Imagine we could see actually the potential outcomes for three different people who actually didn’t have lunch (\(x = 0\)) and remember 25% of what was said in the meeting (\(Y_{x = 0} = 0.25\)).
Go through the three tabs to see the potential outcome (\(Y_{x = 0} = 0.25\)) and resultant true Individual Causal Effect (ICE) for three different people:
\[\text{Observed outcome: }\quad Y_{x = 0} = 0.25\]
\[\text{Unobserved potential outcome: }\quad Y_{x = 1} = 0.25\]
\[ICE = Y_{x = 1} - Y_{x = 0} = 0.25 - 0.25 = \mathbf{0}\]
Person 1 would’ve remembered the same amount either way, so eating lunch had no effect (ITE = 0)
Not hungry, but still tired.
\[\text{Observed outcome: }\quad Y_{x = 0} = 0.25\]
\[\text{Unobserved potential outcome: }\quad Y_{x = 1} = 0.50\]
\[ICE = Y_{x = 1} - Y_{x = 0} = 0.50 - 0.25 = \mathbf{0.25}\]
Person 2 would’ve remembered twice as much, eating lunch would’ve been beneficial (ITE = 0.25)
Focused and on point.
\[\text{Observed outcome: }\quad Y_{x = 0} = 0.25\]
\[\text{Unobserved potential outcome: }\quad Y_{x = 1} = 0\]
\[ICE = Y_{x = 1} - Y_{x = 0} = 0 - 0.25 = \mathbf{-0.25}\]
Person 3 would’ve remembered less (maybe it made them sleepy), eating lunch would’ve made things worse (ITE = -0.25)!
Food coma.
These example highlight several important points:
- Exposures do not have the same effect in everyone
- Whether the exposures is helpful, harmful, neutral, and to what degree is a property of the individual’s characteristics and specific context
- Just comparing the observed outcomes for exposed individuals vs. unexposed individuals (association) is not the same as knowing if an exposure is helpful or harmful (causation)
- BUT, the counterfactual / potential outcomes framework gives you a recipe to turn regular associations into causal ones!
While we can never get the causal effect for an individual’s exposure, we can get the average causal effect of an exposure within a group of people. All we need (ha!) to do is to justify that that the outcomes observed for the exposed group correctly represent the potential outcomes for the unexposure group if they had counterfactually been exposed and vice versa.
This is a major condition known as exchangeability.
Let’s look at an example dataset where the exchangeability condition is fulfilled, in other words our assumption about exchangeability is correct. We follow our previous scenario where \(X\) is having lunch and \(Y\) is retention. Like our dataset from Week 5, we have columns representing data that we would never see in real life. In this case, we are given the potential outcome for individuals under no lunch (\(Y_{x = 0}\)) given by \(Y_0\) and the potential outcome for that same individual but having had lunch (\(Y_{x = 1}\)) given by \(Y_1\). Remember that, given a binary exposure, we only get to see one outcome, \(Y_0\) if \(X = 0\) and \(Y_1\) if \(X = 1\), and that is represented by \(Y\). \(Y\) would be the only measure we’d get to see in a study (or in real life)!
However, in this example data, you can easily see that both no-lunch (\(X = 0\)) and lunch (\(X = 1\)) folks had the exact same set of potential outcomes, meaning that they all would have responded identically to no lunch (retain 25%) and no lunch (retain 50%), therefore they all have the same potential of being helped by having lunch. Thus, the average causal effect is easily computed by taking the mean difference between exposed and unexposed:
## [1] "The Average Causal Effect is (E[Y(X = 1)] - E[Y(X = 0)]): 0.25"
Or, by equivalently fitting the linear regression of \(Y ~ X\):
mdl <- lm(wk6_dat, formula = Y ~ X) %>% summary()
paste("By linear regression, the Average Causal Effect is:", mdl$coefficients[2,1])
## [1] "By linear regression, the Average Causal Effect is: 0.25"
Again, the implication is that the observed \(Y\) for those with \(X = 1\) perfectly capture the potential outcome for those with \(X = 0\). This equivalence or exchangability can formally be captured by the logical statement:
\[(Y_{x = 0}, Y_{x = 1}) \quad \perp \quad X \]
Literally, this expression means all potential outcomes are independent of the values of the exposure variable.
This was definitely fulfilled in our example data because the set of potential outcomes for every single exposed individual was exactly the same as the potential outcomes for every single unexposed individual. However, data need not need be that uniform to fulfill the exchangeability condition. If you think about how we estimated the causal effects, the potential outcomes for exposed and unexposed groups only need to be identical on average. That is, there needs to be an equal proportion of helped, harmed, or unaffected people in each exposure group.
Consider the following slightly modified dataset in which people who have different responses to having lunch (different Individual Causal Effects) including “no benefit,” but the two exposure groups are still exchangeable:
## [1] "By linear regression, the Average Causal Effect is: 0.25"
Quick reality check: What types of settings or study designs would have data like this, where exposed and unexposed are balanced (on average) with respect to their potential outcomes?
Answer: Randomized Controlled Trials!
That is why in fact that randomized controlled trials (RCTs) don’t need to be balanced on the basis of every co-variate. By the randomization design, the exposure groups will be balanced on the basis of their response type (whether they will benefit, harmed, or unaffected), allowing you (for the most part) to be able to estimate an Average Causal Effect just by comparing mean differences (t-test) or simple linear regression!
Except in the case of the trial or natural experiment that occurs outside of anyone’s control, people who are exposed are usually systematically different than people who are not exposed. This because they have circumstances, preferences, behaviors, or conditions that would lead them to be exposed and which also influence how they might respond to the exposure. That is, there are factors that confound the effect of exposures. The potential outcomes framework gives us a new or slightly different way to understand confounding:
A classic confounding scenario is “confounding by indication” where people who are put on a medication (exposure) by a doctor systematically differ than those who may have the same disease, but are not put on the treatment.
In some instances, they may be sicker than the typical patient, so a medication is tried as a last resort. In such a case, the medicine might seem to do poorly, because the potential outcomes for those exposed people (\(x = 0\)) are systematically worse than those who aren’t put on treatment (\(x = 1\)). See the example dataset below for an example of this. Here \(X\) represents treatment and \(Y\) represents an illness score where lower numbers equal better health:
From this dataset, you can clearly see that the Individual Causal Effect is actually the same for everyone (ICE = -1), where, if treated, illness status would improve by 1 point/category for everyone. However, people who were \(Sicker\) were much more likely to get treated \(X = 1\) and people who were \(Sicker\) have generally higher illness scores no matter what.
As a consequence, a naive comparison of exposure groups make it appear that the treatment is actually associated with harm!
## [1] "By simple linear regression, the observed association is: 0.2"
On the other hand, once we account for the differences in potential outcomes that are represented by the confounder, we come to the correct conclusion!
## [1] "Adjusting for illness status, the Average Causal Effect is: -1"
That is why the exchangeability condition is also known as no unmeasured confounding condition We often refer to it as “conditional exchangeability” to refer to the potential outcomes for exposed and unexposed groups to be exchangeable after adjusting or conditioning on measured confounders. (We will talk more about adjustment methods over the next two weeks.)
Of course, confounding by indication may also have an opposite implication: People who are given treatments may be done so because they are healthier or judged to be more likely to benefit from the treatment than the typical person with a condition that might otherwise benefit. Organ transplants and fertility treatments are two examples of this. Run the code below to see the previous dataset modified to show this. In this case, however, the treatment has no real effect in anyone, as the \(ICE\) column will show.
wk6_dat4 <- tibble(ID = c(1:10),
X = c(rep(0,5), rep(1,5)),
Healthier = c(0,0,0,0,1, 1,1,1,1,0),
Y_0 = c(3,3,3,3,1, 1,1,1,1,3),
Y_1 = c(3,3,3,3,1, 1,1,1,1,3),
Y = case_when(X == 0 ~ Y_0, X == 1 ~ Y_1),
ICE = Y_1 - Y_0)
wk6_dat4
mdl <- lm(formula = Y ~ X, data = wk6_dat4) %>% summary()
paste("By simple linear regression, the observed association is:", mdl$coefficients[2,1])
mdl_adj <- lm(formula = Y ~ X + factor(Healthier), data = wk6_dat4) %>% summary()
paste("Adjusting for health status, the Average Causal Effect is:", round(mdl_adj$coefficients[2,1], 2))
Do anticipate the unadjusted analyses to show a benefit (negative effect), harm (positive effect), or no effect? The adjusted estimate?
With just these two examples, it becomes clear that you need to know something specific about the context in which you are trying to estimate causal effects. The same sources of confounding bias may results in very different changes to the effect (over- or under-estimating) depending on the clinical context of the disease and typical treatment decisions. This is true for any causal inference effort!
Quick understanding check: If the ICE gives us the true causal effect, why don’t we just compute that every time?
Now that we can better envision how causal inference is enabled by borrowing observed outcomes across exposure groups to serve as substitute potential outcomes, there are in fact three more conditions or assumptions that must be fulfilled in order to reliably estimate causal effects. We will focus on just one of them here and leave the other two (positivity and no interference) to later weeks.
In order for us to directly take the observed outcome from the exposed group as the potential outcome for the unexposed group (or vice versa), it is not sufficient just that the potential outcomes are evenly distributed. We also need to be able to assume that the respective observed outcomes are what would actually happen if we had actively set the exposure level to the observed levels. This condition is known as causal consistency and can be formally represented in the case of a binary exposure as (among other ways):
\[Y(x = 0) = Y_{x = 0} \quad \text{and} \quad Y(x = 1) = Y_{x = 1}\]
That is to say the value of \(Y\) you get when setting \(x = 0\) is identical to the observed \(Y\) for those with observed \(x = 0\) (and so on for \(x = 1\)). Since we cannot actually assign any exposures and observe the consequences, consistency must be justified on the basis of prior scientific knowledge and reasoning about the given study setting. Let’s walk through an example of this consideration.
Remember at the very beginning, we asked what would happen if we had brought lunch, but not actually eaten it. Should we count this as \(x = 0\) or \(x = 1\)? The mechanism that links lunch to remembering meeting details is through actually having eaten something, i.e having the energy to focus on the meeting. This suggest that we should count bringing a lunch, but not eating it, as \(x = 0\).
An underlying reasoning for this, even if you have not thought about it in this exact way, is such an exposure categorization is most likely to be consistent: If we had assigned people who brought a lunch, but didn’t eat it \(x = 1\), it is unlikely their performance in meetings would be what you would expect if someone had actually eaten their lunch. That is, the mapping of these people’s outcomes (\(Y_{x = 1}\)) as counterfactual potential outcomes for people with \(x = 0\) would be invalid.
Consider an effect estimate that might arise if you categorized “bringing lunch, eaten or not” as the exposed group (\(Bad_X = 1\)). Note that the last two rows are categorized as \(Bad_X = 1\) even though \(Ate = 0\). Therefore, the observed \(Y\) for these individual is actually \(Y_{x = 0}\) since the active ingredient eating a lunch is actually missing.
Do you think this error would result in an over or underestimate of effect relative to the true effect size of 0.25?
wk6_dat5 <- tibble(ID = c(1:10),
Bad_X = c(rep(0,5), rep(1,5)),
Ate = c(rep(0,5),1,1,1,0,0),
Y_0 = c(rep(0.25,10)),
Y_1 = rep(0.5,10),
Y = case_when(Ate == 0 ~ Y_0, Ate == 1 ~ Y_1))
wk6_dat5
mdl <- lm(formula = Y ~ Bad_X, data = wk6_dat5) %>% summary()
paste("Using an inconsistent exposure of 'bringing lunch', the observed association is:", mdl$coefficients[2,1])
## [1] "Using an inconsistent exposure of 'bringing lunch', the observed association is: 0.15"
mdl_adj <- lm(formula = Y ~ Ate, data = wk6_dat5) %>% summary()
paste("Using a consistent exposure of 'eating lunch', the Average Causal Effect is:", round(mdl_adj$coefficients[2,1], 2))
## [1] "Using a consistent exposure of 'eating lunch', the Average Causal Effect is: 0.25"
On the other hand, we could imagine that people who were able to eat some food prior to the meeting as equally likely to benefit, whether they bought a snack or brought food from home. Thus, we could reason that a categorization where anyone who ate food prior to the meeting was assigned \(x = 1\) and anyone who didn’t as \(x = 0\) is reasonable exposure measure with consistent effects for the purposes of this effect estimation.
Moreover, by defining a clearer, consistent exposure of “eating lunch,” we can better exclude other correlated factors outside the mechanism of interest.
For example, let’s say we stuck with an exposure definition of “bringing lunch or not” specifically because we think that it is an indicator of people who have better memory, organizational skills, or planning, and these are actually the main causes of better meeting performance. But, some people might bring lunch because it was packed for them or it they purchased at the food court. The observed outcomes might not then well represent the benefit of, e.g. having better memory or planning skills. Either way, we are stuck, because there is too much heterogeneity in how “bringing lunch” might be related to meeting performance (poor consistency) so we could never be confident that our estimated associations are equivalent to what might happen if, say, we were to hand someone a bagged lunch just before a meeting.
On the other hand, defining the exposure as “eating lunch” specific narrows down the mechanism of interest as consumption of food. In this case, it is clearer that better memory or organizational skills are clearly confounders of this relationship, and not part of the exposure of interest itself. Consequently, while we can’t now estimate the effects of better memory, we are more precisely estimating that effect of providing food just before a meeting.
When we are thinking about specifying a clear, consistent causal exposure we want to estimate effects for, we commonly run into this problem of deciding what to do with other interesting third variables (that are not the exposure or outcome). Above, we just showed that when we define an exposure in terms of the specific mechanism, we identify factors or variable we aren’t interested in, for example characteristics of people that might lead them to have better retention at the meeting regardless of whether or not they had lunch. Those kinds of variables, which we previously thought of as alternate explanations of the outcome we now understand are sources of imbalance of potential outcomes between exposure groups. We know to treat as confounders by getting rid of this imbalance by adjustment and other techniques (next few weeks).
Free food for those who come to the meeting early.
On the other hand, what about a factor like “did the person come to the meeting room early”? Perhaps those who didn’t eat lunch were more likely to go to the meeting room early and start chatting with the other attendees and start thinking about agenda items? We could reasonably that the latter people might retain even more than those who ate on their own.
How we deal with this variable which is associated with both the exposure (eating lunch) and outcome (retention of meeting info)? Is it a confounder that we should adjust for?
Importantly, let’s assume again that we are interested specifically in the effects of eating lunch prior to the meeting and that (after necessary adjustments) people who are likely to benefit from eating lunch (i.e. their baseline potential outcomes) are evenly distributed across the main exposure groups \(x = 0\) (no lunch) and \(x = 1\) (had lunch). Does this change your thinking?
If we are interested in the overall effect of a consistent exposure and baseline potential outcomes are balanced, we are done, no further action is necessary to estimate an unbiased Average Causal Effect of that exposure!
That being said, it seems like we have important information about what may happen after the main exposure that has a bearing on the outcome. That is, some people might experience a second exposure that influences the outcome. In this particular case, we might be interested in this second exposure because it may weaken the association between exposure and outcome. If we wanted to translate our finding here to other contexts where this second exposure doesn’t occur, e.g. to Zoom meetings where no one has the ability to show up early to chat, the effect might be underestimated.
Here “early attendance” is a mediator of the effect of (not) eating lunch prior to the meeting, that is explains (away) part of the total effect of eating lunch. Let’s see how this might work in numbers adding a new variable \(Early\) representing early attendance, first looking at the overall ACE:
## lm(formula = Y ~ X, data = wk6_dat6)
## [1] "By linear regression, the Average Causal Effect is: 0.25"
Then, let’s breakdown the relationship between variables in sequence how they would occur in real life.
First, looking at how eating lunch is related to attending early (using a log-binomial regression to produce a risk ratio):
## glm(formula = Early ~ X, family = binomial(link = "log"), data = wk6_dat6)
## [1] "Eating lunch before the meeting is related to 0.67% the rate of early attendance (as not eating lunch)."
As expected, those who ate lunch were less likely to attend early. Then, let’s look at the relationship between early attendance on the outcome:
## lm(formula = Y ~ Early, data = wk6_dat6)
## [1] "Attending early was associated with a 0.13 unit increase in retention at the meeting."
Since those who did not eat lunch were more likely to attend early, and early attendance was associated with greater retention, this explains why the benefit of eating lunch was not as large as anticipated. If counter to the fact, we could make lunch eaters and non-eaters attend early at the same rate (for example, in the Zoom meeting), we ought to see a larger effect associated with eating lunch:
## [1] "In a population where we could set similar proportions of early attendance, the Average Causal Effect of eating lunch is: 0.29."
## [1] "This is 0.04 units great than the observed ACE (0.25), due to the fact that in this population, non-lunch eaters are more likely to benefit from early attendance."
Some might also consider this to be a moderator, since the third variable reduces the effect of the exposure, however this is a longer discussion of mediation and moderation we will pick up later.
- Clarity in the exposure definition including timing is critical to understanding how a variable should be treated (exposure, confounder, mediator).
- Here we had to refine our exposure defintion to say we were specifically interested in the exposure prior to attending the meeting.
- So therefore we can consider early attendance, and anything that happened upon joining as after our main exposure.
- Say for example, those who show up early also order a pizza together and end up having food anyway.
- If we are clear in defining our exposure, we are ok, because we know that this event is separate and may be a consequence of or main exposure.
- Therefore, we would not consider people who ordered pizza after as \(x = 1\) because that event passed.
- If we had left our original definition of “eating lunch,” we would not know how to separate this event.
- Looking at the data table for the different scenarios of confounder, misclassification, and mediator, there is not a lot of difference between the data structures.
- This is by design and refective of reality. The data (and statistics) do not tell us how we should analyze the data, for example whether we should adjust for a variable.
- This determination comes exclusively from the story we can tell about the relationship between variables.
- In the case where the third variable comes after the exposure in sequence, there may not necessarily be cause to adjust for this variable.
- Even if the adjusted effect is different from the unadjusted, it does not mean the unadjusted is biased!
- In fact, great care should be taken with mediators and it is often safer not to adjust for them. We will come back to this
- If we cannot rely on the data itself to tell us how to design our study, we must use another approach.
- This is where directed acyclic graphs come in
Causal diagrams, specifically directly acyclic graphs or DAGs, help us use pictures to diagnose three important concerns in effect estimation:
1. Confounding factors that influence the distribution of potential outcomes (*e.g.* responder types) relative to our main exposure variable
2. The temporal sequence of variables, *i.e.* whether variables are confounders or mediators
3. The inter-relationship between these variables, *e.g.* confounders of confounders
In turn, this plotting serves three practical purposes:
1. Identifying whether there are any important variables, particularly confounders, that are missing from the data
2. Providing a tool for other people to check and challenge your assumptions (including on missing variables)
3. Selecting a proper set of confounders that, once adjusted, will allow you to estimate a causal effect of your exposure
As always, the extent to which this diagramming is successful depends on how clearly the exposure and context has been specified.
Let’s go through a few basic examples.
Node: A variable, the more specific the better
- *Better*: Eating lunch before a weekly meeting
- *Worse*: Lunch
Edge: A line segment, connecting two Nodes
- In a DAG, the edge must be *Directed*, meaning there must be an arrow on at least one end
- The arrow points into the Node that is caused by the Node at the other end
Here the DAG is showing X causing Y:
ggdag::dagify(Y ~ X, exposure = "X", outcome = "Y") %>% ggdag() + theme_dag()
Like in any causal chain, there may be an infinite number of Nodes that
exist on the edge linking the two end Nodes, but they do not need to be
put in unless they are mediators of particular interest (for example,
“arriving early” in the Lunch example).
Here we can represent the exact same \(X\) and \(Y\) relationship as above, but call out a few mediators of that relationship:
ggdag::dagify(Y ~ M_3, M_3 ~ M_2, M_2 ~ M_1, M_1 ~ X, exposure = "X", outcome = "Y") %>% ggdag() + theme_dag()
Confounders, then are represented as Nodes that influence an exposure and an outcome. Here we represent two confounders:
\(C_1\) is a typical confounder of
\(X\) and \(Y\)
\(C_2\) is a less-commonly recognized
confounder of \(X\) and \(M_1\)
ggdag::dagify(Y ~ M_1 + X + C_1,
M_1 ~ X + C_2,
X ~ C_1 + C_2,
exposure = "X", outcome = "Y") %>% ggdag() + theme_dag()
One of the many reasons to suggest caution in mediation analyses (e.g. adjusting for M_1) is because of the challenges posed by exposure-mediator confounders such as \(C_2\).
Path: A path is a trace through the edges between any two nodes, regardless of direction of the arrows
In all but the simplest two node DAG, there are many possible paths. Here are four paths represented by the previous DAG:
ggdag::dagify(Y ~ M_1 + X + C_1,
M_1 ~ X + C_2,
X ~ C_1 + C_2,
exposure = "X", outcome = "Y") %>% ggdag_paths() + theme_dag()
When a path is traced from a Node starting with an arrow going into the Node, such as: \(X\) –> \(C_1\) –> \(Y\), this is known as a back-door path.
Each path shows us how we should interpret and treat statistical relations between Nodes in our data.
With our goal of estimating the unbiased effect of \(X\) on \(Y\), we must “block” all back-door paths from \(X\) to \(Y\) by adjustment or stratification. If we can do this, and there are no paths other than \(X\) –> \(Y\) remaining. \(X\) and \(Y\) are known as “d-separated” (direction separated), meaning there are no confounding effects remaining.
As a bit of an oversimplified link to potential outcomes: if \(X\) and \(Y\) are d-separated (except for the \(X\) –> \(Y\) path), individual potential outcomes are probably exchangeable or in the case that adjustments were made leading to d-separate, conditionally exchangeable.
By representing adjustments in our DAG, we can quickly scan the DAG to see if there are any back-door paths remaining. In fact, software makes this easy for us. Let’s look at a simple relationship with one confounder \(C_1\), leaving out the \(X\) –> \(Y\) path. Specifying that we control for \(C_1\), we can see they are d-separated, therefore an unbiased effect can be estimated:
ggdag::dagify(Y ~ C_1,
X ~ C_1,
exposure = "X", outcome = "Y") %>% ggdag_dseparated(controlling_for = c("C_1")) + theme_dag()
To take a more complex example, we now have three confounders, one of which is unmeasured \(U\) so obviously, we cannot adjust for it. In this case, the DAG shows us there is no d-separation, therefore effect estimates will remain confounded.
ggdag::dagify(Y ~ C_1 + C_2 + U,
X ~ C_1 + C_2 + U,
exposure = "X", outcome = "Y") %>% ggdag_dseparated(controlling_for = c("C_1", "C_2")) + theme_dag()
We can quickly look at a more complex scenario, which highlights to us why we must take caution when adjusting for mediators (such as with mediation analyses). Here we have an \(X\) and \(Y\) that are d-separated without any adjustments, because the third variable \(W\) is not a true confounder:
ggdag::dagify(Y ~ M_1 + W,
M_1 ~ X + W,
exposure = "X", outcome = "Y") %>% ggdag_dseparated() + theme_dag()
However, look what happens when we adjust for the mediator \(M_1\), as we might do with a mediation analysis. \(M_1\) here is what is known as a collider – a node into which two (or more) arrows are pointing. When colliders are adjusted for, they introduce associations that did not exist previously. In this case, they introduced an association with \(W\) and \(X\) and \(Y\) are no longer d-separated. Thus, in attempting to do a mediation analyses, a previously unbiased effect estimate is now biased! We will come back to this in future weeks.
ggdag::dagify(Y ~ M_1 + W,
M_1 ~ X + W,
exposure = "X", outcome = "Y") %>% ggdag_dseparated(controlling_for = c("M_1")) + theme_dag()
Now lets consider our Lunch example. We have four variables or Nodes to consider: \(Lunch\) - Having lunch before \(Early\) - Arrived at the meeting early \(Organized\) - Whether a person is typically organized and a good planner \(Remember\) - The amount of information that is retained from the meeting
Use ggdag to place these variables in a DAG, and show what adjustment(s) are needed if any to estimate the effect of \(Lunch\) on \(Remember\).
Now that we’ve discussed the importance of exposure specificity and clear context before in proposing a causal research question, it should be clearer now the quantitative implication of why we must do this. To review, let’s expand upon a few key examples from your readings.
Obesity: While extremely high weights are generally associated with many adverse health outcomes, this does not directly translate into good predictions of what would happen if you were to intervene on individual weight status. Certainly, having underweight people gain 10 pounds gradually over a year, is different than having an obese person gain 10 pounds in a month. Losing 50 pounds immediately by bariatric surgery will certainly have different effects than the same reduction by medication, gradual diet and exercise, or as a consequence of advanced cancer.
Consequently, a weight or BMI measure itself does not form the basis of a consistent exposure. That said, a great deal of progress can be made towards clearer consistency and consideration of confounders by defining a specific population or context for which the investigation should apply, for example in healthy adults with BMI in the healthy range living in large US cities. And then defining exposure with relatively clear timing and dose information, for example, jogging 1-3 times per week for at least 6 months.
Taking a specific exposure with respect to context, population, and timing, we can then map out the variables that may lead to differences in potential outcomes and decide on an adjustment set that leads to exposure and outcome being d-separated (allowing us to estimate an unbiased effect).
Gentle introduction to DAGs using the ‘ggdag’ package (Barrett. CRAN. 2023 May). Introduction to ‘ggdag’ package itself (Barrett. CRAN. 2023 May 28).
Introduction to the ‘Daggity’ package that ‘ggdag’ is built on (Textor, et al. IJE. 2016).
Threats to validity in social science studies can also be represented as DAGs (Matthay and Glymour Epidemiology. 2020).
A formal approach to incorporate evidence synthesis into drawing DAGs (Ferguson, et al. IJE. 2020).
A comprehensive discussion of “selection bias” using DAGs (Lu, et al. Epidemiology. 2022).