A brief introduction to regression and average treatment effects in R

In this handout, we show two ways to estimate the average treatment effect (ATE) of X on Y. First, by using filter and summarise to calculate a difference in conditional means; second, by using lm() to estiamte a linear regression model.

A simple model

Suppose we wanted to estimate the causal effect of some variable X on some other variable Y. Y is affected by some other variable as well, let’s call it A. (You can fill in your favorite examples.) To make this concrete, let’s simulate the situation using a linear model of Y, assuming we have the whole population (sampling issues are not the point here).

set.seed(210)
library(tidyverse)
library(cowplot)
n <- 1000 # some population with 1000 observations
epsilon <- rnorm(n,mean=0,sd=1) # just a simple random error which we don't observe
X_underlying <- runif(n,min=-1,max=1)
X <- ifelse(X_underlying>0, 1, 0) # X is something super simple, an indicator variable which is 1 if X_underlying is positive and 0 otherwise
A <- runif(n,min=5,max=6) # A is also something super simple, a uniform random variable living in [5,6]

Y <- 1 + 5*X - A + epsilon # Y is a simple linear function of X and A

dataset <- data.frame(Y=Y, X=X, A=A) # Let's put this together in a dataframe. We don't include epsilon since it's an unobserved random error term.

summary(dataset)

##        Y                 X               A        
##  Min.   :-7.5971   Min.   :0.000   Min.   :5.000  
##  1st Qu.:-4.4377   1st Qu.:0.000   1st Qu.:5.246  
##  Median :-1.1179   Median :1.000   Median :5.511  
##  Mean   :-1.8948   Mean   :0.528   Mean   :5.503  
##  3rd Qu.: 0.4886   3rd Qu.:1.000   3rd Qu.:5.757  
##  Max.   : 4.0478   Max.   :1.000   Max.   :5.998

plot_grid(ggplot(data = dataset) + geom_point(aes(x=X,y=Y)), ggplot(data = dataset) + geom_point(aes(x=A,y=Y)))

Our goal is to estimate the ATE of X on Y. We know that the true ATE is 5, so we can judge our methods according to how well they recover the true effect.

Estimating average treatment effects with `filter`

The first approach has us take advantage of the fact that X is a binary variable and calculate \[ ATE = E[Y|X=1] - E[Y|X=0] .\]

E_Y_X_1 <- dataset %>% filter(X==1) %>% summarise(conditional_mean = mean(Y))
E_Y_X_0 <- dataset %>% filter(X==0) %>% summarise(conditional_mean = mean(Y))
ATE_filter <- E_Y_X_1 - E_Y_X_0
names(ATE_filter) <- c("ATE_filter")
ATE_filter

##   ATE_filter
## 1   4.992067

Ok, so this gets us pretty darn close! Just to round things out, let’s also do a t-test on the difference in Y given X=1 and X=0:

t.test(dataset$Y[X==1], dataset$Y[X==0])

## 
##  Welch Two Sample t-test
## 
## data:  dataset$Y[X == 1] and dataset$Y[X == 0]
## t = 73.592, df = 979.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  4.858950 5.125185
## sample estimates:
##  mean of x  mean of y 
##  0.4614252 -4.5306422

The filter method is intuitive, and we can see clearly how to put it into a t-test: just compare the two groups and we’re set! But how do we extend this method to control for A? This is also challenging because A is a continuous variable, not a binary one.

Estimating average treatment effects with regression (using `lm`)

An alternative approach is to “estimate a linear regression”. This is a fancy way of saying, “let’s have the computer calculate a best-fit line which goes through the data”. That is, let’s estimate \(\alpha\) and \(\beta\) in

\[ Y = \alpha + \beta X + \epsilon ,\]

where \(\epsilon \sim N(0,\sigma)\) is a random error term and \(\beta\) is our ATE.

There are two nice things about this approach, if it works: we can handle continuous X variables, and we can control for multiple things by sticking in additional terms on the right-hand side. But does it work? Let’s find out below. The R function to estimate a model line this is lm(), for linear model.

The syntax for lm() is to give it a formula in the first argument slot, and then data in the second slot. The data argument is familiar from ggplot(), but the formula syntax is a bit different. To estimate \(Y = \alpha + \beta X + \epsilon\), we’ll write Y ~ X: the intercept and random error terms (\(\alpha\) and \(\epsilon\)) are assumed present by default.¹ This is because most linear models in statistics typically have an intercept and a random error term so it’s a safe default to assume they’re there, but the choice of variables to put on the right-hand side is problem specific and left to the statistician (that’s you, dear reader).

model <- lm(Y ~ X, data = dataset)
summary(model) # This gives us the nice output we want. The verb for this is "print the model summary".

## 
## Call:
## lm(formula = Y ~ X, data = dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7536 -0.7208 -0.0165  0.6872  3.5863 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.53064    0.04922  -92.04   <2e-16 ***
## X            4.99207    0.06774   73.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.069 on 998 degrees of freedom
## Multiple R-squared:  0.8448, Adjusted R-squared:  0.8446 
## F-statistic:  5431 on 1 and 998 DF,  p-value: < 2.2e-16

That’s a lot of stuff, so let’s focus our attention on just a couple things.

First, the coefficient \(\beta\): this is the entry in the row labeled X, under the column labeled estimate. It’s basically identical to what we got with the filter method!

Second, the t-statistic and p-value on \(\beta\), which are in the row labeled \(X\), under the columns t value and Pr(>|t|). These are also very close to the t-test results from t.test()!

We also get a bunch of other useful output from this, including the standard error on the estimate. Notice, however, that the intercept term (the row labeled (Intercept)) isn’t the correct value (the true intercept is 1, see above). That’s ok if we’re interested in the effect of X on Y, but just worth noting.

But does this scale? Let’s try estimating the effects of both X and A using this method. That is, now we estimate

\[ Y = \alpha + \beta X + \gamma A + \epsilon \]

model2 <- lm(Y ~ X + A, data = dataset)
summary(model2)

## 
## Call:
## lm(formula = Y ~ X + A, data = dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6506 -0.7343  0.0091  0.6783  3.1523 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.33667    0.60430   3.867 0.000117 ***
## X            5.00676    0.06376  78.522  < 2e-16 ***
## A           -1.24938    0.10962 -11.398  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.006 on 997 degrees of freedom
## Multiple R-squared:  0.8627, Adjusted R-squared:  0.8624 
## F-statistic:  3131 on 2 and 997 DF,  p-value: < 2.2e-16

Heyo! We still get a pretty good estimate of \(\beta\)—the true value is 5, and we’re basically there. Our estimate of \(\gamma\) is also really close to what it ought to be, which is -1. The intercept is also closer to the true value of 1. Try increasing the number of observations to 10,000 and see how things change.

Conclusion

When estimating the ATE of X on Y, the simplest approach is to take a binary X variable, use filter and summarise to calculate conditional means, and then take the difference in conditional means. But what do we do about continuous variables? And how do we control for variables? That’s where regression comes in.

Regression is a really powerful tool which lets us handle multiple variables, whether discrete or continuous. The estimated ATEs are identical to what an equivalent filter approach would yield, as are the t-test results. It also gives you standard errors, measures of fit (like \(R^2\)), and more.² These features (and many more) make regression modeling a great way to estimate ATEs.

Note that the \(\epsilon\) in the regression model isn’t necessarily the same as the epsilon in the simulation, even though they share the same name. It’s confusing, I know, but it’s a bit easier than getting into regression residuals vs true error distributions—save that for Econ 211.↩
While we estimated a linear model (which happened to also be the form of the true model), linear regression can uncover causal ATEs even when the underlying model is not linear. We’ll discuss regression a bit in this class, but a more detailed treatment will have to wait until Econ 211.↩

A brief introduction to regression and average treatment effects in R

Econ 210

A simple model

Estimating average treatment effects with filter

Estimating average treatment effects with regression (using lm)

Conclusion

Estimating average treatment effects with `filter`

Estimating average treatment effects with regression (using `lm`)