A simple example of collider bias

In this handout, we explore a simple example where collider bias messes up our estimates of the causal ATE of X on Y. I strongly encourage you to read A brief introduction to regression and average treatment effects in R before reading this.

A simple model

Suppose we wanted to estimate the causal effect of some variable X on some other variable Y, and there’s some variable W which is affected by both X and Y.

Let’s simulate some data to explore this. The true model we’ll work with here is the following system of equations:

\(\begin{align} Y &= \alpha + \beta X + \epsilon \\ W &= \gamma X + \delta Y + \mu, \end{align}\)

where \(\epsilon\) and \(\mu\) are both Normally-distributed random error terms (let’s suppose they have mean zero for convenience, and some standard deviation—these details don’t really matter for what we’re after here).

set.seed(210)
library(tidyverse)
library(cowplot)
n <- 1000 # some population with 1000 observations
epsilon <- rnorm(n,mean=0,sd=1) # just a simple random error which we don't observe
Z <- runif(n,min=-1,max=1)
X <- ifelse(Z>0, 1, 0) # X is something super simple, an indicator variable which is 1 if Z is positive and 0 otherwise.

Y <- 10 + 5*X + epsilon # Y is a simple linear function of X and Z. To make Y negatively correlated with Z, just change the + 250 to - 250. To increase the magnitude of the correlation between Y and Z, increase the coefficient from 250 to something larger in absolute value.

W <- - 20*X + 1.5*Y + rnorm(n,mean=0,sd=2) # W is (directly) negatively correlated with X and positively correlated with Y. You can change these correlations by changing the coefficients and their signs.

dataset <- data.frame(Y=Y, X=X, W=W) # Let's put this together in a dataframe. We don't include epsilon since it's an unobserved random error term.

summary(dataset)

##        Y                X               W         
##  Min.   : 7.387   Min.   :0.000   Min.   :-6.665  
##  1st Qu.:10.044   1st Qu.:0.000   1st Qu.: 2.319  
##  Median :13.511   Median :1.000   Median : 6.507  
##  Mean   :12.608   Mean   :0.528   Mean   : 8.398  
##  3rd Qu.:15.020   3rd Qu.:1.000   3rd Qu.:14.934  
##  Max.   :18.092   Max.   :1.000   Max.   :23.359

plot_grid(ggplot(data = dataset) + geom_point(aes(x=X,y=Y)), ggplot(data = dataset) + geom_point(aes(x=Y,y=W)), ggplot(data = dataset) + geom_point(aes(x=X,y=W)) )

Our goal is to estimate the ATE of X on Y. We know that the true ATE is 5, so we can judge our methods according to how well they recover the true effect.

Note that though the direct effect of X on W is negative (-20 to be precise), the indirect effect through Y is positive (1.5*5 = 7.5). The combination of these effects is negative, but different coefficient values could change that and give W a net positive correlation with X.

Estimating the ATE without controlling for W

Let’s calculate the ATE by estimating a regression model, \[ Y = \hat{\alpha} + \hat{\beta} X + \epsilon\]

model1 <- lm(Y ~ X, data = dataset)
summary(model1)

## 
## Call:
## lm(formula = Y ~ X, data = dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6712 -0.7049  0.0115  0.6666  3.1219 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.96593    0.04642  214.69   <2e-16 ***
## X            5.00383    0.06388   78.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.008 on 998 degrees of freedom
## Multiple R-squared:  0.8601, Adjusted R-squared:   0.86 
## F-statistic:  6135 on 1 and 998 DF,  p-value: < 2.2e-16

Ok, so \(\hat{\beta}\) is basically correct, and so is \(\hat{\alpha}\). You can replicate these results using the filter method if you’d like.

Controlling for W and inducing collider bias

Now let’s do what the causal graph tells us not to do and control for W, estimating \[ Y = \hat{\alpha} + \hat{\beta} X + \hat{\omega} W + \epsilon \] Note that there is no causal interpretation for \(\hat{\omega}\)! We can frame it as an ATE using the usual math of differences in conditional expectations, but there’s just no causal interpretation. The true value of \(\omega\) is 0, so in the absence of any funny business we ought to fail to reject the null hypothesis (because the null hypothesis, that \(\omega = 0\), is correct).

model2 <- lm(Y ~ X + W, data = dataset)
summary(model2)

## 
## Call:
## lm(formula = Y ~ X + W, data = dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.51766 -0.53191  0.00706  0.51572  2.43895 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.277273   0.152796   41.08   <2e-16 ***
## X           8.088878   0.133871   60.42   <2e-16 ***
## W           0.245276   0.009866   24.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7928 on 997 degrees of freedom
## Multiple R-squared:  0.9136, Adjusted R-squared:  0.9135 
## F-statistic:  5273 on 2 and 997 DF,  p-value: < 2.2e-16

Ok, wow! Now our \(\hat{\beta}\) is off, as is our \(\hat{\alpha}\). And the standard errors and p-values are giving us a false sense of certainty as to what these parameter values are. We’re even led to reject the null hypothesis that \(\omega=0\)! Collider bias = no good.

Exercises at home

1. Try changing the correlations between X and W, Y and W, or both. What do you see? Does it match the OVB sign theorem?

2. Does increasing the sample size mitigate collider bias?

Conclusion

Reading about omitted variable bias, you might be tempted to say “to heck with it, I’ll just control for every variable I can possibly cnotrol for”. Collider bias is a cautionary tale about why that instinct is misguided. If you control for a collider, you’re actually going to put yourself in a worse spot than if you had just left it alone.

So what are you supposed to do? No dataset just outright says, “this is an omitted variable/confounder, that is a collider”. This is a huge part of why it’s so important to learn the context. Economics, like any other science (social or not), isn’t done in a vacuum.¹ You need to do the work to read up on the context and understand what’s going on before you start putting datasets into RStudio and running your statistical analysis.

Establishing causal inference in social science settings is much more like building a case than it is proving a theorem. The best social science blends qualitative and quantitative analysis to build a strong case for the author—no one will believe you if you just appeal to some theorem or say “but my p-values are small”.

“What about scientists in space?” Yeah ok real clever you know what I meant…↩