In this handout, we explore a simple example where collider bias messes up our estimates of the causal ATE of X on Y. I strongly encourage you to read A brief introduction to regression and average treatment effects in R before reading this.
Suppose we wanted to estimate the causal effect of some variable X on some other variable Y, and there’s some variable W which is affected by both X and Y.
Let’s simulate some data to explore this. The true model we’ll work with here is the following system of equations:
\(\begin{align} Y &= \alpha + \beta X + \epsilon \\ W &= \gamma X + \delta Y + \mu, \end{align}\)
where \(\epsilon\) and \(\mu\) are both Normally-distributed random error terms (let’s suppose they have mean zero for convenience, and some standard deviation—these details don’t really matter for what we’re after here).
set.seed(210)
library(tidyverse)
library(cowplot)
n <- 1000 # some population with 1000 observations
epsilon <- rnorm(n,mean=0,sd=1) # just a simple random error which we don't observe
Z <- runif(n,min=-1,max=1)
X <- ifelse(Z>0, 1, 0) # X is something super simple, an indicator variable which is 1 if Z is positive and 0 otherwise.
Y <- 10 + 5*X + epsilon # Y is a simple linear function of X and Z. To make Y negatively correlated with Z, just change the + 250 to - 250. To increase the magnitude of the correlation between Y and Z, increase the coefficient from 250 to something larger in absolute value.
W <- - 20*X + 1.5*Y + rnorm(n,mean=0,sd=2) # W is (directly) negatively correlated with X and positively correlated with Y. You can change these correlations by changing the coefficients and their signs.
dataset <- data.frame(Y=Y, X=X, W=W) # Let's put this together in a dataframe. We don't include epsilon since it's an unobserved random error term.
summary(dataset)
## Y X W
## Min. : 7.387 Min. :0.000 Min. :-6.665
## 1st Qu.:10.044 1st Qu.:0.000 1st Qu.: 2.319
## Median :13.511 Median :1.000 Median : 6.507
## Mean :12.608 Mean :0.528 Mean : 8.398
## 3rd Qu.:15.020 3rd Qu.:1.000 3rd Qu.:14.934
## Max. :18.092 Max. :1.000 Max. :23.359
plot_grid(ggplot(data = dataset) + geom_point(aes(x=X,y=Y)), ggplot(data = dataset) + geom_point(aes(x=Y,y=W)), ggplot(data = dataset) + geom_point(aes(x=X,y=W)) )
Our goal is to estimate the ATE of X on Y. We know that the true ATE is 5, so we can judge our methods according to how well they recover the true effect.
Note that though the direct effect of X on W is negative (-20 to be precise), the indirect effect through Y is positive (1.5*5 = 7.5). The combination of these effects is negative, but different coefficient values could change that and give W a net positive correlation with X.
Let’s calculate the ATE by estimating a regression model, \[ Y = \hat{\alpha} + \hat{\beta} X + \epsilon\]
model1 <- lm(Y ~ X, data = dataset)
summary(model1)
##
## Call:
## lm(formula = Y ~ X, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6712 -0.7049 0.0115 0.6666 3.1219
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.96593 0.04642 214.69 <2e-16 ***
## X 5.00383 0.06388 78.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.008 on 998 degrees of freedom
## Multiple R-squared: 0.8601, Adjusted R-squared: 0.86
## F-statistic: 6135 on 1 and 998 DF, p-value: < 2.2e-16
Ok, so \(\hat{\beta}\) is basically correct, and so is \(\hat{\alpha}\). You can replicate these results using the filter method if you’d like.
Now let’s do what the causal graph tells us not to do and control for W, estimating \[ Y = \hat{\alpha} + \hat{\beta} X + \hat{\omega} W + \epsilon \] Note that there is no causal interpretation for \(\hat{\omega}\)! We can frame it as an ATE using the usual math of differences in conditional expectations, but there’s just no causal interpretation. The true value of \(\omega\) is 0, so in the absence of any funny business we ought to fail to reject the null hypothesis (because the null hypothesis, that \(\omega = 0\), is correct).
model2 <- lm(Y ~ X + W, data = dataset)
summary(model2)
##
## Call:
## lm(formula = Y ~ X + W, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.51766 -0.53191 0.00706 0.51572 2.43895
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277273 0.152796 41.08 <2e-16 ***
## X 8.088878 0.133871 60.42 <2e-16 ***
## W 0.245276 0.009866 24.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7928 on 997 degrees of freedom
## Multiple R-squared: 0.9136, Adjusted R-squared: 0.9135
## F-statistic: 5273 on 2 and 997 DF, p-value: < 2.2e-16
Ok, wow! Now our \(\hat{\beta}\) is off, as is our \(\hat{\alpha}\). And the standard errors and p-values are giving us a false sense of certainty as to what these parameter values are. We’re even led to reject the null hypothesis that \(\omega=0\)! Collider bias = no good.
Reading about omitted variable bias, you might be tempted to say “to heck with it, I’ll just control for every variable I can possibly cnotrol for”. Collider bias is a cautionary tale about why that instinct is misguided. If you control for a collider, you’re actually going to put yourself in a worse spot than if you had just left it alone.
So what are you supposed to do? No dataset just outright says, “this is an omitted variable/confounder, that is a collider”. This is a huge part of why it’s so important to learn the context. Economics, like any other science (social or not), isn’t done in a vacuum.1 You need to do the work to read up on the context and understand what’s going on before you start putting datasets into RStudio and running your statistical analysis.
Establishing causal inference in social science settings is much more like building a case than it is proving a theorem. The best social science blends qualitative and quantitative analysis to build a strong case for the author—no one will believe you if you just appeal to some theorem or say “but my p-values are small”.
“What about scientists in space?” Yeah ok real clever you know what I meant…↩