Motivating IV

Intuition

Suppose you estimate a regression of income on years of schooling:

\[income_i = \beta_0+\beta_1*schooling_i+u_i\]

One assumption we make for OLS to be BLUE is exogeneity, or in other words: \(E[u|schooling] = 0\).

Does this seem reasonable in this case? Probably not. It is almost certain that schooling and income are both correlated with the error \(u_i\), because \(u_i\) contains some measure of ‘natural ability’, which we cannot observe. Further, someone with more income might choose to get more school because their parents were wealthy. Or, growing up wealthy puts more pressure on kids to go to college. This creates selection bias - the people who go to school longer are different from those who don’t in unobservable ways.

In other words, if an individual has higher ability, or childhood income impacts choice of schooling it is likely that both their salary and schooling will be higher. One can see that if \(E[u|schooling] \neq 0\), our estimate for \(\beta_1\) will be inconsistent and biased. Recall, we can write:

\[\hat{\beta_1} = \beta_1 + \frac{cov(x,u)}{var(x)}\]

So if we think \(cov(x,u)\) is NOT equal to zero, then our estimates of \(\beta_1\) will be biased.

IV is a tool that you can use when you want to estimate the causal effect of some variable on an outcome but it is hard to distinguish whether some coefficient you’ve estimated is simply do to a correlation with something else that has an impact on your outcome. IV attempts to separate the exogenous part of x and the endogenous part of x and uses on the exogenous part to give us unbiased estimates.

Okay, so maybe the covariance of x and u isn’t zero, that is, we have some variation in an excluded variable potentially, or a ‘two-way-street’ referred to as ‘simultaneity bias.’ But maybe, we have some other variable we can use that explains our causal variable (X) well, but does not explain our outcome (Y).

To get causal impacts of education on earnings, we would need some variable that impact peoples’ schooling choices but not their income (at least, not directly). A good example: mandatory minimum schooling for highschool - some minimum amount of schooling required by the government. Now, assuming we can manipulate this new variable, we can see how differing levels of mandatory minimum schooling impacts students’ education levels, and use that change to estimate changes in income.

Formal Definition

An instrumental variable, \(z\) for the regression \(y= \beta_1x+u\) is one that satsifies the following two conditions:

-Exogeneity - it is uncorrelated with the error, u
-Relevance - it is correlated with x

In other words, the instrument only impacts the y variable through the x variable.

An Example

A common instrument (made popular by Card (1995)) for schooling in the income regression is to use inidividual proximity to a college. Lets think about if this satisifies our assumptions.

Is this instrument exogenous? Probably — the only issue you might think that people that live in cities usually have higher wages than rural areas and if someone’s parents are more wealthy, then you might accidentally capture some effect from having wealthy parents. An easy fix, maybe, would be to control for metropolitan area/rural status.

How about relevance? This one we can test by running a regression estimating the effect of distance on years of schooling:

\[schooling_i = \alpha_0+\beta_1*distance_i+e_i\]

How would you know if the distance impacts schooling? Do a t-test where \(H_0: \beta_1=0\). Check the significance. You know, a p-value! If we reject \(H_0\) then the instrument is relevant.

Implementation

Two Stage Least Squares

The 4 steps of 2SLS:

1. Find an instrument
2. Argue that it is exogenous.
3. Stage 1. Demonstrate that it is relevant (how?).
4. Stage 2 regression.

This technique is called two stage least squares because we estimate two ordinary least squared regressions. Suppose you are interested in estimating:

\[y_i= \beta_0+\beta_1x_i+u_i\]

but you are concerned about the endogeneity of x. We could use some kind of proxy, say, z, using the following regression:

\[y_i = n_0 + n_1z_i + m_i\]

but then we’re worried that perhaps someone would argue we aren’t telling a good story. We really only care about X.

So we can use z, if it’s a valid instrument. In our above example, z would be distance from college and x would be schooling. To implement 2SLS we do the following:

Step 1: Regress x on z and save the predicted values of x. That is, estimate the regression:

\[x_i= \alpha_0+\alpha_1z_i+e_i\]

From this regression we can calculate \(\hat{x}\). Note that this regression is telling us how much of \(x\) is explained by the exogenous variable, \(z\). Any left-over variation (the bad, endogenous stuff, plus any bits not related to \(z\)) are thrown into \(e_i\).

Step 2: Now, take the predicted values from stage 1 and estimate the following regression:

\[y_i =c_0 + c_1\hat{x}_i +q_i\]

where \(q_i\) is a new error term.

So if we did our job well and our instrument is really exogenous and relevant, the estimate of \(c_1\) will be consistent. Note that \(q_i\) is not correlated with \(\hat{x}\) since this will only be the ‘part’ of the variation of \(x\) that is due to \(z\), which we have claimed is exogenous to \(q_i\).

2SLS in R

Continuing with the college distance example, let’s load some packages and data.

library(pacman)
p_load(tidyverse, AER)
data("CollegeDistance")
wage_data <- CollegeDistance

Why is running OLS bad? Well, exogeneity is violated and we get biased estimates. We can see that this is happening if we just run an OLS:

ols_mod <- lm(wage ~ education + urban + gender + ethnicity + unemp + income, data = wage_data)
summary(ols_mod)
## 
## Call:
## lm(formula = wage ~ education + urban + gender + ethnicity + 
##     unemp + income, data = wage_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3790 -0.8513  0.1701  0.8286  3.8570 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        8.685542   0.157062  55.300  < 2e-16 ***
## education         -0.003891   0.010568  -0.368   0.7128    
## urbanyes           0.078633   0.044691   1.760   0.0786 .  
## genderfemale      -0.076373   0.037060  -2.061   0.0394 *  
## ethnicityafam     -0.533688   0.052335 -10.197  < 2e-16 ***
## ethnicityhispanic -0.516272   0.049014 -10.533  < 2e-16 ***
## unemp              0.135182   0.006716  20.127  < 2e-16 ***
## incomehigh         0.181047   0.042397   4.270 1.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.266 on 4731 degrees of freedom
## Multiple R-squared:  0.1132, Adjusted R-squared:  0.1119 
## F-statistic: 86.29 on 7 and 4731 DF,  p-value: < 2.2e-16

Notice that the coefficient on education is not significant, which doesn’t make a lot of sense. So we should use 2SLS instead.

The Long Way

We can implement both steps of the 2SLS. For step one, we regress the instrument (and some controls) on the x variable, education.

stage_one <- lm(education~distance + urban + gender + ethnicity + unemp + income, data=wage_data)
summary(stage_one)
## 
## Call:
## lm(formula = education ~ distance + urban + gender + ethnicity + 
##     unemp + income, data = wage_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.679 -1.564 -0.465  1.479  4.691 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       13.678428   0.085939 159.165  < 2e-16 ***
## distance          -0.072826   0.012045  -6.046 1.60e-09 ***
## urbanyes          -0.035527   0.063884  -0.556   0.5782    
## genderfemale       0.015520   0.050790   0.306   0.7599    
## ethnicityafam     -0.403432   0.071541  -5.639 1.81e-08 ***
## ethnicityhispanic -0.144992   0.067187  -2.158   0.0310 *  
## unemp              0.016602   0.009586   1.732   0.0833 .  
## incomehigh         0.794369   0.057067  13.920  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.735 on 4731 degrees of freedom
## Multiple R-squared:  0.06142,    Adjusted R-squared:  0.06003 
## F-statistic: 44.23 on 7 and 4731 DF,  p-value: < 2.2e-16

Great. Looks like the instrument is relevant. Now, we need to predict values of x. We can use a command fitted.values() to return fitted values from a regression object. Let’s attach that object to our original wage_data dataframe.

wage_data$x_hat <- fitted.values(stage_one)

Next, we do the second stage regression.

stage_two <- lm(wage~x_hat + urban + gender + ethnicity + unemp + income, data=wage_data)
summary(stage_two, robust = TRUE)
## 
## Call:
## lm(formula = wage ~ x_hat + urban + gender + ethnicity + unemp + 
##     income, data = wage_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1816 -0.8702  0.1452  0.8464  3.8382 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.37169    1.64027  -0.836 0.403053    
## x_hat              0.73370    0.12021   6.103 1.12e-09 ***
## urbanyes           0.02368    0.04540   0.522 0.602032    
## genderfemale      -0.09017    0.03698  -2.438 0.014804 *  
## ethnicityafam     -0.24668    0.06992  -3.528 0.000423 ***
## ethnicityhispanic -0.39711    0.05252  -7.562 4.75e-14 ***
## unemp              0.13487    0.00669  20.160  < 2e-16 ***
## incomehigh        -0.42616    0.10724  -3.974 7.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.261 on 4731 degrees of freedom
## Multiple R-squared:  0.1201, Adjusted R-squared:  0.1188 
## F-statistic: 92.27 on 7 and 4731 DF,  p-value: < 2.2e-16

After estimating the equation via 2SLS, we have significance on x_hat (the causal effect of education on earnings).

Using ivreg

We can use the ivreg() function from the AER package to do this in one line of code. You separate, in a sense, your two stages with a | operator. On the left side, put your original equation. On the right side put what variables you will use to instrument. Of course, R won’t know what you’re instrumenting for, so you need to tell it with a .- before it. In our case, .-education.

reg_iv <- ivreg(wage~urban + gender + ethnicity + unemp + income + education|.-education + distance, data=wage_data)
summary(reg_iv)
## 
## Call:
## ivreg(formula = wage ~ urban + gender + ethnicity + unemp + income + 
##     education | . - education + distance, data = wage_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -5.57678 -1.21632 -0.02711  1.40379  5.07247 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.371690   2.346001  -0.585  0.55878    
## urbanyes           0.023678   0.064936   0.365  0.71540    
## genderfemale      -0.090166   0.052895  -1.705  0.08833 .  
## ethnicityafam     -0.246683   0.100003  -2.467  0.01367 *  
## ethnicityhispanic -0.397113   0.075111  -5.287 1.30e-07 ***
## unemp              0.134875   0.009569  14.095  < 2e-16 ***
## incomehigh        -0.426157   0.153387  -2.778  0.00549 ** 
## education          0.733704   0.171931   4.267 2.02e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.803 on 4731 degrees of freedom
## Multiple R-Squared: -0.7999, Adjusted R-squared: -0.8026 
## Wald test:  45.1 on 7 and 4731 DF,  p-value: < 2.2e-16