A simple example of omitted variable bias

In this handout, we explore a simple example where omitted variable bias messes up our estimates of the causal ATE of X on Y. I strongly encourage you to read A brief introduction to regression and average treatment effects in R before reading this.

A simple model

Suppose we wanted to estimate the causal effect of some variable X on some other variable Y, and there’s some variable Z which affects both X and Y.

The in-class example we used is grit (Z), college completion (X), and lifetime earnings (Y)—attending college increases lifetime earnings, and individual grit increases the chances of completing college as well as lifetime earnings—but we’ll leave things abstract here to focus on the statistical issues and explore different cases.

Let’s simulate some data to explore this.

set.seed(210)
library(tidyverse)
library(cowplot)
n <- 1000 # some population with 1000 observations
epsilon <- rnorm(n,mean=0,sd=1) # just a simple random error which we don't observe
Z <- runif(n,min=-1,max=1)
X <- ifelse(Z>0, 1, 0) # X is something super simple, an indicator variable which is 1 if Z is positive and 0 otherwise. This setup makes corr(X,Z) positive. To make it negative, just swap the 1 and 0 in the ifelse.

Y <- 1000 + 500*X + 250*Z + epsilon # Y is a simple linear function of X and Z. To make Y negatively correlated with Z, just change the + 250 to - 250. To increase the magnitude of the correlation between Y and Z, increase the coefficient from 250 to something larger in absolute value.

dataset <- data.frame(Y=Y, X=X, Z=Z) # Let's put this together in a dataframe. We don't include epsilon since it's an unobserved random error term.

summary(dataset)

##        Y                X               Z           
##  Min.   : 748.1   Min.   :0.000   Min.   :-0.99950  
##  1st Qu.: 888.4   1st Qu.:0.000   1st Qu.:-0.44864  
##  Median :1515.3   Median :1.000   Median : 0.06093  
##  Mean   :1275.3   Mean   :0.528   Mean   : 0.04542  
##  3rd Qu.:1639.7   3rd Qu.:1.000   3rd Qu.: 0.55966  
##  Max.   :1750.1   Max.   :1.000   Max.   : 0.99856

plot_grid(ggplot(data = dataset) + geom_point(aes(x=X,y=Y)), ggplot(data = dataset) + geom_point(aes(x=Z,y=Y)))

Our goal is to estimate the ATE of X on Y. We know that the true ATE is 500, so we can judge our methods according to how well they recover the true effect.

OVB sign theorem

The OVB sign theorem states that, given

a model where X and Z are causes of Y, and Z is a cause of X,
causal ATE \(\beta\) and an estimate \(\hat{\beta}\),

the sign (+/-) of the omitted variable bias \(E[\hat{\beta} - \beta]\) is equal to the product of the signs of the correlations between X and Z and Y and Z. Formally, letting \(sign(X) = +1\) if \(X>0\) and \(sign(X) = -1\) if \(X < 0\),

\[ sign(E[\hat{\beta} - \beta]) = sign(corr(X,Z))*sign(corr(Y,Z)).\]

(The notational convention in econometrics when discussing parameter estimates and bias is to use hats (^) on top of variables to indicate that they’re estimates of the associated parameter. So \(\beta\) is the parameter we want to estimate, and \(\hat{\beta}\) (pronounced “beta-hat”) is our estimate of \(\beta\).)

Estimating the ATE without controlling for Z

Let’s calculate the ATE by estimating a regression model, \[ Y = \hat{\alpha} + \hat{\beta} X + \epsilon\]

model1 <- lm(Y ~ X, data = dataset)
summary(model1)

## 
## Call:
## lm(formula = Y ~ X, data = dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -131.150  -64.634    2.349   64.129  121.804 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  878.519      3.377   260.2   <2e-16 ***
## X            751.524      4.647   161.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 73.37 on 998 degrees of freedom
## Multiple R-squared:  0.9632, Adjusted R-squared:  0.9632 
## F-statistic: 2.615e+04 on 1 and 998 DF,  p-value: < 2.2e-16

Ok, so this is in the right direction, but it’s overestimated—just like the OVB sign theorem would predict. Worse, notice how small the standard error and p-value are: if we take this model at face value, we’ll end up being overly certain that \(\beta\) is around 750 when in fact it’s actually 500.

Just to drive the point home and show that this isn’t a problem with using a regression, let’s replicate these results with the filter method, \[ ATE = E[Y|X=1] - E[Y|X=0] .\]

E_Y_X_1 <- dataset %>% filter(X==1) %>% summarise(conditional_mean = mean(Y))
E_Y_X_0 <- dataset %>% filter(X==0) %>% summarise(conditional_mean = mean(Y))
ATE_filter <- E_Y_X_1 - E_Y_X_0
names(ATE_filter) <- c("ATE_filter")
ATE_filter

##   ATE_filter
## 1    751.524

And the t-test:

t.test(dataset$Y[X==1], dataset$Y[X==0])

## 
##  Welch Two Sample t-test
## 
## data:  dataset$Y[X == 1] and dataset$Y[X == 0]
## t = 161.75, df = 986.41, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  742.4062 760.6418
## sample estimates:
## mean of x mean of y 
## 1630.0426  878.5186

Notice that the confidence interval is affirming the same incorrect certainty we got from the standard error and p-value.

Controlling for Z

Now let’s do what the causal graph tells us to do and control for Z, estimating \[ Y = \hat{\alpha} + \hat{\beta} X + \hat{\gamma} Z + \epsilon \]

model2 <- lm(Y ~ X + Z, data = dataset)
summary(model2)

## 
## Call:
## lm(formula = Y ~ X + Z, data = dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7854 -0.7188 -0.0143  0.6636  3.1338 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.000e+03  7.022e-02   14242   <2e-16 ***
## X           4.997e+02  1.265e-01    3949   <2e-16 ***
## Z           2.503e+02  1.087e-01    2303   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.006 on 997 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.216e+07 on 2 and 997 DF,  p-value: < 2.2e-16

Ok, wow! Now our \(\hat{\beta}\) is basically 500, the true value. As a bonus, since there are no omitted variables affecting Z, \(\hat{\gamma}\) is also basically the true value (250). (In case it shows things in scientific notation: 4.99e+02 means 4.99*10^2, which is 499.)

Exercises at home

1. Try changing the correlations between X and Z, Y and Z, or both. You should see the following results, predicted by the OVB sign theorem:

when \(sign(corr(X,Z)) = sign(corr(Y,Z))\), \(E[\hat{\beta} - \beta] > 0\) (OVB causes an overestimate)
when \(sign(corr(X,Z)) \neq sign(corr(Y,Z))\), \(E[\hat{\beta} - \beta] < 0\) (OVB causes an underestimate)

2. Note that increasing the sample size won’t help. OVB persists until you control for Z.

3. Try changing the magnitude of the correlations by increasing the coefficient on \(Z\) in the true regression model (see the comment). You’ll notice that the stronger the correlations, the bigger the omitted variable bias.

Conclusion

Omitted variable bias is a real problem in trying to infer causality. In this handout we saw how it can really mess our estimates up, just as the OVB sign theorem would predict. Try playing with the correlations and sample sizes to build up your intuition for how this comes into play—omitted variable bias is one of the most important details to understand when doing causal inference (whether with experimental or non-experimental data).¹

None of this is to say that Central Limit Theorems don’t hold in the presence of OVB. They still hold. Our estimated \(\hat{\beta}\) is still converging to something with a Normal distribution. It’s just not convering to the thing we’re trying to estimate, which is \(\beta\)↩