Today we will review the causal conditions of exchangeability and consistency, show confounding vs. mediation vs. collider bias and that distinctions between them are conceptual (not statistical), and thus reinforcing why DAGs are an important tool.
Then we will start in earnest in looking through a dataset, doing basic descriptives and visualization, defining a causal hypothesis, and distinguishing confounders (nuisance variables) from mediators and effect modifiers (things that play a role in explaining the effect of the exposure of interest).
First, a quick reminder about up-coming assignments and due dates:
- SEP 13 through OCT 11: Project 1 presentations x2 (article + method)
- OCT 11 through DEC 06: Project 2 presentations x3 (question + plan + final)
- Post your slides to Laulima day after presentation (at the latest)
- OCT 27 (Fri): DRAFT of analytic plan due
- NOV 17 (Fri): DRAFT of analyses with code due
- DEC 08 (Fri): Final written report
- Post your written reports to Laulima by the end of the due date
The discussion of a paper and related statistical methods (Project 1) will be graded very leniently as there was a lot of variation in what information was available in chosen papers and methods. I won’t hold that against you! However, I will have more specific criteria for Project 2 starting with the presentation on your Causal Hypotheses starting October 18th.
For this assignment, I expect each presentation to make an attempt to answer each of the following questions:
Looking ahead to the second presentation on the Analytic Plan, you can also start to answer, if possible:
Don’t forget that the Hawai’i Journal of Health and Social Welfare is accepting original research with a $500 prize for best undergrad, grad, and staff/faculty paper!
Recall that the main goal of causal inference is setting up the conditions by which we can use exposure group A’s outcomes to represent the (potential) outcomes for exposure group B, and vice versa. This condition is known as exchangeability or after adjustments, conditional exchangeability. Exchangeability is one intuition behind why we adjust for confounding: To balance out the potential outcomes between exposure groups.
For example, if group A is treated by a drug but is otherwise sicker than group B, their observed outcomes are going to be worse that group B’s would be in the absence of treatment. This is a case of confounding by indication. Even if the treatment is truly beneficial, this would weaken the apparent benefit, potentially even making the treatment appear harmful if the analyses are not corrected to account for the sicker group A. Thus, we perform adjustments or stratification to balance the potential outcomes.
We might think that people who are Organized:
We can represent this information in DAG form:
lunch_dag <- ggdag::dagify(Retention ~ Lunch + Organized, Lunch ~ Organized, exposure = "Lunch", outcome = "Retention")
ggdag(lunch_dag) + theme_dag() + geom_dag_text(col = "Orange")
Then, we can use DAG rules to see if adjusting for Organized can balance the potential outcomes between exposure groups, that is create “conditional exchangeability” (conditional because its adjusted for Organized).
We represent this by first removing the edge from Exposure –> Retention; you can think of this as graphically representing the exchangeability condition we want to have – the observed exposure is independent of it’s potential outcomes.
Then we represent adjusting for Organized by making Organized a blocking node (we show with a square instead of a circle). This done, we then check to see if there any remaining open “backdoor” paths between Lunch and Retention.
If not, they are considered d-separated (directionally separated), and we should be able to estimate an unbiased effect (if our remaining DAG is correct):
lunch_dag <- ggdag::dagify(Retention ~ Organized, Lunch ~ Organized, exposure = "Lunch", outcome = "Retention")
ggdag_dseparated(lunch_dag, controlling_for = "Organized") + theme_dag() + geom_dag_text(col = "Black")
## [1] "By simple linear regression, the observed association is: 0.25"
## [1] "Adjusting for being Organized, the Average Causal Effect is: 0.21"
From this data, we can see those who were Organized indeed had higher retention no matter what: Their \(Y_0\) and \(Y_1\) were higher than not Organized people. So the fact that there were more of them in the Lunch = 1 group over estimated the benefit of having Lunch! Then, by adjusting for Organized we make sure that those potential outcomes are now evenly distributed between exposure groups.
However, there is one more important ingredient be able to map the observed outcome in one exposure group to the potential outcome in the other group:
We need to be able to assume that changing the exposure (from one value to another), produces the same predictable effect for anyone and everyone within the given context regardless of how that level of exposure was assigned. In causal terminology, we need the exposure, as it is measured and operationalized, to have consistent effects. Specifically, we must be confident that the potential outcomes we observe in the exposed group would be exactly the same if we had assigned the unexposed group the same exposure level.
Causal consistency can be formally represented in the case of a binary exposure as:
\[Y(x = 0) = Y_{x = 0} \quad \text{and} \quad Y(x = 1) = Y_{x = 1}\]
Meaning, the \(Y\) that is observed for those with exposure level \(x = 1\) is the same as the \(Y\) that would be observed if you had assigned \(x = 1\) by some form of intervention (and similarly for \(x = 0\)).
Last week, we used an example of an in-consistent exposure: Was Lunch Brought (\(x = 1\)) or Not (\(x = 0\)) to the Meeting. We show how this was not a consistent exposure, because in certain cases people could have brought their lunch but not eaten it, so we couldn’t distinguish between people, example who were \(x = 1\) but perhaps didn’t have time eat their lunch (should have no effect) or \(x = 1\) but ate it (would probably have an effect). In other words, we saw that some of the people who were observed to have \(x = 1\) actually were observed to have the outcomes that would have happened if assigned to \(X = 0\), thus the exposure definition violated the consistency assumption. This resulted in a biased effect estimate.
## lm(formula = Y ~ Bad_X, data = wk6_dat5)
## [1] "Using an inconsistent exposure of 'bringing lunch', the observed association is: 0.15"
## lm(formula = Y ~ Ate, data = wk6_dat5)
## [1] "Using a consistent exposure of 'eating lunch', the Average Causal Effect is: 0.25"
Alternative, someone could have had \(X = 0\) because they ate earlier and didn’t bring anything, and thus we would have observed the outcomes if they were actually exposed to the causal exposure \(X = 1\). Therefore, we have people that could be assigned the same level of \(X\) but have very different sets of potential outcomes. In a way, this exposure mis-classification is a fairly obvious and trivial example of having an inconsistent exposure.
The literature on BMI and obesity has even better examples:
Imagine two individuals with the same reduction in BMI from 28 to 20 \(kg/m^2\). For one, they arrived at it from diet and exercise and thus should have a set of potential outcomes that reflect a health benefit. However, for the other the weight loss reflects advance progression of cancer and thus a set of potential outcomes that likely reflect declining health. Thus, the context for weight loss is an important part of understanding whether changes in exposure level are causally consistent.
The two main take-away points about consistent exposures are:
- You need to be as clear as possible about the mechanism by which your specific exposure has an effect, and
- understand how levels of that exposure are assigned in the context you are studying
Free food for those who come to the meeting early.
Once we have developed a clear story about exactly what are exposure is (and isn’t), it is much clearer what we do with other “explanatory” variables. Basically, there are factors that affect the outcome (and potential outcomes) which are not affected by the exposure (confounders) and others that are. And unless we have a strong reason to believe otherwise, we can assume that most factors that temporally follow the exposure, could be influenced by it.
We ended up clarifying that the main exposure of interest was “eating lunch prior to arriving at the meeting (yes/no).” Let’s also clarify that we measure the outcome by quizzing every participant right after the conclusion of the meeting. Therefore, any event between eating lunch first (when the exposure is measured) and the end of the actual meeting (when the outcome is measured) could be considered a mediator. Let’s consider the potential event like “arriving to the meeting room early prior to the start”?
Perhaps those who didn’t eat lunch were more likely to go to the meeting room early and start chatting with the other attendees and start thinking about agenda items? Therefore, we might imagine that that earlier interactions made people more likely to retain more information than those who did not show up early. That is to say, we are dealing with a causal system like so:
lunch_dag <- ggdag::dagify(Retention ~ Lunch + Early, Early ~ Lunch, exposure = "Lunch", outcome = "Retention")
ggdag(lunch_dag) + theme_dag() + geom_dag_text(col = "Orange")
How we deal with this variable which is associated with both the exposure (eating lunch) and outcome (retention of meeting info)?
Should we adjust for it? Note that there are two paths that lead from Lunch to Retention:
lunch_dag <- ggdag::dagify(Retention ~ Lunch + Early, Early ~ Lunch, exposure = "Lunch", outcome = "Retention")
ggdag_paths(lunch_dag) + theme_dag() + geom_dag_text(col = "Black")
Importantly, let’s assume that we are interested specifically in the effects of eating lunch prior to the meeting and that (after necessary adjustments) people who are likely to benefit from eating lunch (i.e. their baseline potential outcomes) are evenly distributed across the main exposure groups \(x = 0\) (no lunch) and \(x = 1\) (had lunch). Does this change your thinking?
If we are interested in the overall effect of a consistent exposure, that means that we include any potential reasons that Lunch influences Retention, including through sequences of events or “paths” that oppose each other (such as via Early attendance).
That means we can treat all forward paths as one total effect. Recall that an edge may actually include any number of potential mediators and mediating paths, but if you are not particularly interested in them, you do not need to include them as nodes or separate edges.
Thus, in this case, if we are just interested in the total effect of Lunch –> Retention, we can collapse the forward path \(Lunch\) –> \(Early\) –> \(Retention\) into the total effect edge and represent the same DAG simply as:
lunch_dag <- ggdag::dagify(Retention ~ Lunch, exposure = "Lunch", outcome = "Retention")
ggdag_paths(lunch_dag) + theme_dag() + geom_dag_text(col = "Black")
Additionally, this DAG also tells us that there are no confounders (that is, backdoor paths), therefore the baseline potential outcomes are balanced. Hence, all we have to do to estimate an unbiased Average Causal Effect of Lunch is ignore the variable Early!
That being said, it seems like we have additional contextual information about a second exposure after the main exposure that, in turn, affects the outcome. In this particular case, we might be particularly interested in this second exposure because it may weaken the association between the main exposure and outcome.
If we wanted to translate our finding here to other contexts where this second exposure doesn’t occur, e.g. to Zoom meetings where no one has the ability to show up early to chat, we might want to estimate just the effect that excludes the possibility of showing up to meetings early. In particular, because that effect might be larger! That is, we might want just the effect represented by Path 2 in the diagram above. This requires an effect decomposition analysis that breaks down the effect into mediated and non-mediated paths.
Let’s take a look at the relationships between the variables to get a sense of how these effects operate in our specific example. First, as a baseline we look at the overall ACE or total effect:
## lm(formula = Y ~ X, data = wk6_dat6)
## [1] "By linear regression, the Average Causal Effect is: 0.25"
Then, let’s breakdown the relationship between variables in sequence how they would occur in real life.
First, looking at how eating lunch is related to attending early (using a log-binomial regression to produce a risk ratio):
## glm(formula = Early ~ X, family = binomial(link = "log"), data = wk6_dat6)
## [1] "Eating lunch before the meeting is related to 0.67% the rate of early attendance (as not eating lunch)."
As expected, those who ate lunch were less likely to attend early. Then, let’s look at the relationship between early attendance on the outcome:
## lm(formula = Y ~ Early, data = wk6_dat6)
## [1] "Attending early was associated with a 0.13 unit increase in retention at the meeting."
Since those who did not eat lunch were more likely to attend early, and early attendance was associated with greater retention, this explains why the benefit of eating lunch was not as large as anticipated. If counter to the fact, we could make lunch eaters and non-eaters attend early at the same rate (for example, in the Zoom meeting), we ought to see a larger effect associated with eating lunch:
## [1] "In a population where we could set similar proportions of early attendance, the Average Causal Effect of eating lunch is: 0.29."
## [1] "This is 0.04 units great than the observed ACE (0.25), due to the fact that in this population, non-lunch eaters are more likely to benefit from early attendance."
A note of strong caution: Once we start considering adjusting for variables that follow the exposure, such as mediators, we start to run into increasingly more complex challenges in causal interpretation and statistical model fitting. In this case, we have set up data so there are no problems with this method and interpretation. As shown by the DAG there are only three variables, and importantly, no measured or unmeasured confounders. However, in the next section we introduce one of the many problems you may run into when adjusting for variables that follow (whether you know it or not) your exposure of interest. And we will come back to the specific issue of adjusting for mediators.
Now let’s consider another factor that follows the exposure: “filling out a suggestion form for future meeting agenda items” (Filled_Form). In this case, we aren’t clear whether it affects the outcome or perhaps it might be affected by it. For example, if it is collected prior to the conclusion of the meeting, it could be a mediator since it might jog people’s memories about what was discussed. In this case, we find out that this suggestion form was given over email an hour after the conclusion of the meeting. Should we adjust for this variable? After all, it might explain some of the different in the outcome.
Again let’s consider the effect size without adjustment:
## lm(formula = Y ~ X, data = wk7_dat1)
## [1] "By linear regression, the Average Causal Effect is: 0.25"
Then, look at the relationship between exposure and Filled_Form:
## glm(formula = Filled_Form ~ X, family = binomial(link = "log"),
## data = wk7_dat1)
## [1] "Eating lunch before the meeting is associated with 0.67times the rate of filling the form (as not eating lunch)."
If you ate lunch, you were less likely to fill out the suggestion form, maybe you didn’t arrive early enough to hear the reminder to look out for it.
Then, the relationship between Filled_Form and the amount retained. We can look at this with Filled_Form as the independent (linear regression on continuous \(Y\)) or dependent variable (logistic regression on the binary Filled_Form):
## lm(formula = Y ~ Filled_Form, data = wk7_dat1)
## [1] "Filling the form was associated with a 0.13 unit increase in information retention from the meeting."
## glm(formula = Filled_Form ~ Y, family = binomial(link = "logit"),
## data = wk7_dat1)
## [1] "A 1 unit increased retention was associated with 82.79 times the rate of filling the form."
By both models, higher scores was positively related to filling out the form. Knowing that the form was sent after the conclusion of the meeting and the measurement of outcome, perhaps remembering more details from the meeting made people more willing to complete the suggestion form.
Now let’s look at the estimate adjusted for Filled_Form:
## [1] "In a population where we could set similar proportions of completed suggestion forms, the Average Causal Effect of eating lunch is: 0.29."
## [1] "This is 0.04 units great than the observed ACE (0.25), due to the fact that in this population, non-lunch eaters and higher retainers are more likely to complete the suggestion form."
Is this adjusted value the correct causal effect of eating lunch on retention?
NO! In this case, by adjusting for a consequence of the exposure and outcome we are actually biasing the effect by comparing groups that are now too unalike.
First, some quick intuitions about how adjustments work: In a regression model, we assume that estimated effect sizes within groups with the same (or similar) values of the adjustment variable give you the true effect size. So, associations between exposure and outcome are made within categories of the third variable and then those results averaged together (this is the same whether you are doing stratification or regression-based adjustments).
Here, by comparing exposure to outcome within levels of Filled_Form we are picking very specific people to be comparing: For example, among Filled_Form == 1, you are more likely exposed AND have higher outcome scores (to counteract the fact that being exposed is related to a lower chance of filling out the form). Conversely, among Filled_Form == 0, you are more likely to be unexposed AND have lower outcome scores (to counteract the fact that being unexposed is related to a higher chance of filling out the form). By grouping people by an event that follow both the exposure and the outcome, we created a false relationship between exposure and outcome.
The consequence of this is the resultant effect size is too exaggerated, because the people who are compared end up being exposed with higher scores and unexposed with lower scores. Think on this logic for bit and if you are still not convinced the results are biased, it should also be noted that any event that follows the outcome should not (and should not be allowed) to change the effect of the exposure on the outcome!
If you don’t follow that logic (we’ll come back to other examples), there is yet another way to figure out what to do:
Make a DAG and follow the DAG “rules”!
Here we know that Filled_Form follows and is affected by the exposure and outcome.
So the DAG would look like this:
lunch_dag <- ggdag::dagify(Retention ~ Lunch, Filled_Form ~ Lunch + Retention, exposure = "Lunch", outcome = "Retention")
ggdag(lunch_dag) + theme_dag() + geom_dag_text(col = "Orange")
We can see that Filled_Form is a particular type of node called a collider, where there are at least two arrows pointing inward at it.
When we adjusting for a collider, we create a false (or biasing) relationship between the source of the arrows pointing at the collider node (parents). This can be shown automatically, if we draw the adjustment with our software:
lunch_dag <- ggdag::dagify(Filled_Form ~ Lunch + Retention, exposure = "Lunch", outcome = "Retention")
ggdag_dseparated(lunch_dag, controlling_for = "Filled_Form") + theme_dag() + geom_dag_text(col = "Black")
Imagine that there are two truly independent events that happen at random:
Independent events
However, you know of an event, a door opening, that depends on both of these events.
If only the bell rings, the door stays closed.
Bell only
If only the light turns on, the door stays closed.
Light only
If both the bell rings AND the light turn on, the door will open.
Bell and light
Similarly, if both the bell is silent AND the light is off, the door will open.
No bell and no light
So, if you see the door open and you hear the bell silent, you know that the light is…
False dependence
And if you see the door open and you see the light on, you know that the bell is…
False dependence
By conditioning on a consequence of two independent events, you create a false observed dependency between them.
Berkson’s bias
Studies relying on hospital samples has many people who were sick for reasons other than COVID-19. In turn, those without COVID-19 had disproportionately more smokers. Thus, it appeared that smoking was related to lower rates of COVID-19 (but only in this selected hospitalized population).
ggdag::dagify(Hospitalized ~ COVID19 + Smoking, exposure = "Smoking", outcome = "COVID19") %>% ggdag_dseparated(controlling_for = "Hospitalized") + theme_dag() + geom_dag_text(col = "Black")
For years, studies were interested in estimating how much maternal smoking contributed to increasing infant mortality. But, they knew that smoking also reduced birthweight, which also increases risk of mortality. So, they wanted to conduct a sort of mediation analyses where they adjusted for low birth weight.
When they did this, in study after study, they found that maternal smoking, adjusted for low birth weight (LBW, birth weight < 2500 grams), was associated with reduced infant mortality. After much reasoning, it was realized that birth weight was a collider as it is affected by both smoking (the main exposure) and many other diseases and conditions that have their own strong effects on infant mortality. However, those conditions often go unmeasured / unconsidered in basic analyses.
In a sense they picture the relationship as so, which would have Smoking and Mortality d-separated (unbiased effect):
ggdag::dagify(Mortality ~ LBW, LBW ~ Smoking, exposure = "Smoking", outcome = "Mortality") %>% ggdag_dseparated(controlling_for = "LBW") + theme_dag() + geom_dag_text(col = "Black")
However, the true relationships were more like this, where \(U\) is unmeasured sources of infant
morbidity:
ggdag::dagify(Mortality ~ LBW + U, LBW ~ Smoking + U, exposure = "Smoking", outcome = "Mortality") %>% ggdag_dseparated(controlling_for = "LBW") + theme_dag() + geom_dag_text(col = "Black")
By adjusting for the mediator LBW, they introduced a false relations between smoking and these unmeasured health factors. Like the light bulb and bell example, for those infants with low birth weight and smoking, they were less likely to have these other conditions \(U\) that result in mortality. Therefore, in this low birth weight group, smoking would appear to protect against mortality.
Adjustments for covariates that are caused by the exposure and the outcome is known as collider stratification bias.
- This bias arises specifically because the comparisons that are made within levels of the collider are not fair in that they do not represent groups that could occur in real life.
- Selecting samples based on their values of a collider induces the same bias: You can think of selection as a stratification adjustment that just takes the results of one stratum (i.e those with the value of the collider you selected on).
DAGs can help quickly identify where there might be a problem of collider bias, as long as you are honest about unmeasured variables!
Be especially careful of adjusting for mediators (really, don’t do it)!
In all of our Lunch Meeting examples showing the effects of confounding, in-consistent exposure, mediation, and collider bias, the data and effect estimates were nearly identical. In fact, you could not tell from the estimate itself which one was correct or unbiased.
This was deliberate.
Perhaps the most important concept about causal inference is that there is no way to tell from data alone which is the right model.
Any set of data and correlations can arise from a near infinite set of relationships. Finding associations between variables (or not) can be explained by any number of alternative hypotheses.
Statistical procedures that produce a biased effect look exactly the same as those that produce the correct effect. Let’s take a look at visualizations of each of these adjustments to illustrate this point further.
This should not be frightening (but maybe a little humbling). Instead, this should give you a motivation to not worry about statistical relationship, and instead structure your analysis based on the story (i.e the background knowledge and theory).
This is why DAGs are so important to help structure what we know (and possibly don’t know) about how the variables should be related to one another.
Let’s think through all the information we have gathered about the different variables over the course of our Lunch Meeting example. Then we assemble our DAG, starting most simply with the temporal sequence (always a good place to start) and then putting in how different variables are thought to influence one another. Once again, since we are trying to estimate the Lunch –> Retention effect, we will omit that edge from the DAG:
Using DAG rules, we can quickly find the minimal adjustment set to
estimate the unbiased effect of Lunch on Retention, avoid unnecessary or
dangerous adjustments:
Using DAG rules, we can also show where efforts such as mediation
analyses can introduce problems:
Seeing the bias that is introduced by adjusting for the collider
\(Early\), the DAG also show us the
solution to produced an unbiased estimate is to also adjust for \(Organized\)!
Exposure to “forever chemicals” (per- and poly-fluoroalkyl substances; PFAS) may cause a number of adverse health outcomes including high blood pressure, as they have similar structures as fatty acids and may have similar biological functions (hormonal; cell-signaling) but unlike fatty acids from food are not easily broken down.
Furthermore, they may have a role to play in ethnic and socioeconomic health disparities, due to differences in exposure and differences in effect. Differences in exposure may arise from different diets and environmental exposures due to historical and structurally discriminatory practices (e.g. less safe food and water) and differences in effect may arise from differences in susceptibility due to factors such as food insecurity or nutritional state.
This project will:
Because of the cross-sectional nature of the data source, we will have to make – and attempt to test where possible – some fairly strong assumptions about the relationships between the variables. And will talk through them as we go along. First and foremost, we will have to assume that PFAS exposure reflect similar exposure periods between individuals and that those exposure preceed the outcome of interest.
This association may be more tenable for short- and medium- half-life PFAS and in middle-aged to older adults. We will try to examine this by look at associations across different PFAS and in different age groups. There are additional questions about causal consistency that we will also address as we work through the example.
We will first load nhanes_merged_18.xlsx that I have constructed in advance. Please refer to “nhanes_pfas_project.R” on how the file was created. This file will also be the “behind the scenes” for the file that will be built for the analyses. Some of these code will be shown or highlighted, but not all.
This file merges five files from the 2017-2018 wave of the CDC’s NHANES annual survey. I chose this wave because this is the most recent survey that has publicly available PFAS data.
Thankfully, NHANES comes with good documentation of the data linked here (dataset names in parenthesis with J indicated the survey wave year):
Finding the relevant (first set of) variables, merging the data, and building a working dataset can take a large proportion of your time!
Next, tabulating and describing with visualizations is the next large chunk of time.
If we can just do some summarizing of available data and creating a basic (working) DAG, we’ll be in good shape.
Let’s try making a Table 1 with minimal recoding and see waht we need to fix:
How would we improve this DAG?
The implied adjustment set: