In this handout, we show two ways to estimate the average treatment effect (ATE) of X on Y. First, by using filter and summarise to calculate a difference in conditional means; second, by using lm() to estiamte a linear regression model.
Suppose we wanted to estimate the causal effect of some variable X on some other variable Y. Y is affected by some other variable as well, let’s call it A. (You can fill in your favorite examples.) To make this concrete, let’s simulate the situation using a linear model of Y, assuming we have the whole population (sampling issues are not the point here).
set.seed(210)
library(tidyverse)
library(cowplot)
n <- 1000 # some population with 1000 observations
epsilon <- rnorm(n,mean=0,sd=1) # just a simple random error which we don't observe
X_underlying <- runif(n,min=-1,max=1)
X <- ifelse(X_underlying>0, 1, 0) # X is something super simple, an indicator variable which is 1 if X_underlying is positive and 0 otherwise
A <- runif(n,min=5,max=6) # A is also something super simple, a uniform random variable living in [5,6]
Y <- 1 + 5*X - A + epsilon # Y is a simple linear function of X and A
dataset <- data.frame(Y=Y, X=X, A=A) # Let's put this together in a dataframe. We don't include epsilon since it's an unobserved random error term.
summary(dataset)
## Y X A
## Min. :-7.5971 Min. :0.000 Min. :5.000
## 1st Qu.:-4.4377 1st Qu.:0.000 1st Qu.:5.246
## Median :-1.1179 Median :1.000 Median :5.511
## Mean :-1.8948 Mean :0.528 Mean :5.503
## 3rd Qu.: 0.4886 3rd Qu.:1.000 3rd Qu.:5.757
## Max. : 4.0478 Max. :1.000 Max. :5.998
plot_grid(ggplot(data = dataset) + geom_point(aes(x=X,y=Y)), ggplot(data = dataset) + geom_point(aes(x=A,y=Y)))
Our goal is to estimate the ATE of X on Y. We know that the true ATE is 5, so we can judge our methods according to how well they recover the true effect.
filterThe first approach has us take advantage of the fact that X is a binary variable and calculate \[ ATE = E[Y|X=1] - E[Y|X=0] .\]
E_Y_X_1 <- dataset %>% filter(X==1) %>% summarise(conditional_mean = mean(Y))
E_Y_X_0 <- dataset %>% filter(X==0) %>% summarise(conditional_mean = mean(Y))
ATE_filter <- E_Y_X_1 - E_Y_X_0
names(ATE_filter) <- c("ATE_filter")
ATE_filter
## ATE_filter
## 1 4.992067
Ok, so this gets us pretty darn close! Just to round things out, let’s also do a t-test on the difference in Y given X=1 and X=0:
t.test(dataset$Y[X==1], dataset$Y[X==0])
##
## Welch Two Sample t-test
##
## data: dataset$Y[X == 1] and dataset$Y[X == 0]
## t = 73.592, df = 979.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.858950 5.125185
## sample estimates:
## mean of x mean of y
## 0.4614252 -4.5306422
The filter method is intuitive, and we can see clearly how to put it into a t-test: just compare the two groups and we’re set! But how do we extend this method to control for A? This is also challenging because A is a continuous variable, not a binary one.
lm)An alternative approach is to “estimate a linear regression”. This is a fancy way of saying, “let’s have the computer calculate a best-fit line which goes through the data”. That is, let’s estimate \(\alpha\) and \(\beta\) in
\[ Y = \alpha + \beta X + \epsilon ,\]
where \(\epsilon \sim N(0,\sigma)\) is a random error term and \(\beta\) is our ATE.
There are two nice things about this approach, if it works: we can handle continuous X variables, and we can control for multiple things by sticking in additional terms on the right-hand side. But does it work? Let’s find out below. The R function to estimate a model line this is lm(), for linear model.
The syntax for lm() is to give it a formula in the first argument slot, and then data in the second slot. The data argument is familiar from ggplot(), but the formula syntax is a bit different. To estimate \(Y = \alpha + \beta X + \epsilon\), we’ll write Y ~ X: the intercept and random error terms (\(\alpha\) and \(\epsilon\)) are assumed present by default.1 This is because most linear models in statistics typically have an intercept and a random error term so it’s a safe default to assume they’re there, but the choice of variables to put on the right-hand side is problem specific and left to the statistician (that’s you, dear reader).
model <- lm(Y ~ X, data = dataset)
summary(model) # This gives us the nice output we want. The verb for this is "print the model summary".
##
## Call:
## lm(formula = Y ~ X, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7536 -0.7208 -0.0165 0.6872 3.5863
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.53064 0.04922 -92.04 <2e-16 ***
## X 4.99207 0.06774 73.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.069 on 998 degrees of freedom
## Multiple R-squared: 0.8448, Adjusted R-squared: 0.8446
## F-statistic: 5431 on 1 and 998 DF, p-value: < 2.2e-16
That’s a lot of stuff, so let’s focus our attention on just a couple things.
First, the coefficient \(\beta\): this is the entry in the row labeled X, under the column labeled estimate. It’s basically identical to what we got with the filter method!
Second, the t-statistic and p-value on \(\beta\), which are in the row labeled \(X\), under the columns t value and Pr(>|t|). These are also very close to the t-test results from t.test()!
We also get a bunch of other useful output from this, including the standard error on the estimate. Notice, however, that the intercept term (the row labeled (Intercept)) isn’t the correct value (the true intercept is 1, see above). That’s ok if we’re interested in the effect of X on Y, but just worth noting.
But does this scale? Let’s try estimating the effects of both X and A using this method. That is, now we estimate
\[ Y = \alpha + \beta X + \gamma A + \epsilon \]
model2 <- lm(Y ~ X + A, data = dataset)
summary(model2)
##
## Call:
## lm(formula = Y ~ X + A, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6506 -0.7343 0.0091 0.6783 3.1523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.33667 0.60430 3.867 0.000117 ***
## X 5.00676 0.06376 78.522 < 2e-16 ***
## A -1.24938 0.10962 -11.398 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.006 on 997 degrees of freedom
## Multiple R-squared: 0.8627, Adjusted R-squared: 0.8624
## F-statistic: 3131 on 2 and 997 DF, p-value: < 2.2e-16
Heyo! We still get a pretty good estimate of \(\beta\)—the true value is 5, and we’re basically there. Our estimate of \(\gamma\) is also really close to what it ought to be, which is -1. The intercept is also closer to the true value of 1. Try increasing the number of observations to 10,000 and see how things change.
When estimating the ATE of X on Y, the simplest approach is to take a binary X variable, use filter and summarise to calculate conditional means, and then take the difference in conditional means. But what do we do about continuous variables? And how do we control for variables? That’s where regression comes in.
Regression is a really powerful tool which lets us handle multiple variables, whether discrete or continuous. The estimated ATEs are identical to what an equivalent filter approach would yield, as are the t-test results. It also gives you standard errors, measures of fit (like \(R^2\)), and more.2 These features (and many more) make regression modeling a great way to estimate ATEs.
Note that the \(\epsilon\) in the regression model isn’t necessarily the same as the epsilon in the simulation, even though they share the same name. It’s confusing, I know, but it’s a bit easier than getting into regression residuals vs true error distributions—save that for Econ 211.↩
While we estimated a linear model (which happened to also be the form of the true model), linear regression can uncover causal ATEs even when the underlying model is not linear. We’ll discuss regression a bit in this class, but a more detailed treatment will have to wait until Econ 211.↩