Lesson 1: IV, a motivation

Instrumental Variables:

IV is a tool that you can use when you want to estimate the causal effect of some variable on an outcome, but hard to distinguish whether some coefficient you’ve estimated is simply do to a correlation with something else that has an impact on your outcome.

Suppose you estimate a regression of income on years of schooling, ie:

\(income_i = b0+b1*schooling_i+u_i\)

One assumption we make for OLS to be BLUE is exogeneity, or in other words: E[u|schooling] = 0.

It is almost certain that schooling and income are both correlated with the error u, because u contains some measure of ‘natural ability’, which we cannot observe. more income might choose to get more school because their parents were wealthy. Or, growing up wealthy puts more pressure on kids to go to college. There’s so many problems.

In other words, if an individual has higher ability, or childhood income impacts choice of schooling it is likely that both their salary and schooling will be higher. One can see that if E[u|schooling] != 0, our estimate for b1 will be inconsistent and biased.

Recall, we can write:

\(b_hat = b + cov(x,u)/var(x)\)

So if we think \(cov(x,u)\) is NOT equal to zero, then our estimates of b will be biased.

Soooooo we use an approach called Instrumental Variables or (IV).

Before giving you the formal definition of an instrument, lets talk intuition for a second.

Okay, so maybe the covariance of x and u isn’t zero, that is, we have some variation in an excluded variable potentially, or a ‘two-way-street’ referred to as ‘simultaneity bias.’ But maybe, we have some other variable we can use that explains our causal variable well, but does not explain our outcome.

Lets imagine you have a perfect, worldwide laboratory, and you could change some policy that would 1.) impact peoples’ schooling choices, but 2.) not their income. At least, not directly.

A good example: mandatory minimum schooling - some minimum amount of schooling required by the government.

Now, assuming we can manipulate this new variable, we can see how differing levels of mandatory minimum schooling impacts students’ education levels, and use THAT change to estimate changes in income. Perhaps by regressing mandatory minimums on education levels, and then predicted education levels on income. A two-stage, least squares.

Formal Definition: Instrumental Variable:

An instrumental variable, z for the regression y= b1*x+u is one that satsifies the following two conditions:

  1. Exogeneity - it is uncorrelated with the error, u

  2. Relevance - it is correlated with x

A common instrument (made popular by Card (1995)) for schooling in the income regression is to use inidividual proximity to a college. For relevance, We could test by running a regression estimating the effect of distance on years of schooling:

\[schooling_i = a0+b1*distance_i+e_i\]

How would you know if the distance impacts schooling? Do a t-test where H_0: b1=0. Check the significance. You know, a p-value!

Lesson 2: Implementation

How would we go about estimating an IV regression?

2 Stage Least Squares (2SLS)

Overview

This technique is called two stage least squares because we estimate two regressions. However, I really like to think of this as being a 4 stage procedure. Here is the general set-up; ill give you the “two” official stages in the general setting and then give an example where I illustrate why I believe it is a couple of extra steps. Suppose you are interested in estimating:

\[y= b0+b1*x+u \]

but you are concerned about the endogeneity of x. So you find a valid instrument, call it z.

To implement 2SLS we do the following:

  1. Regress x on z and save the predicted values of x. That is, estimate the regression:

\[x_i= a0+a1*z_i+e_i\]

From this regression we can calculate x-hat. Note that this regression is telling us what part/how much of x is explained by the exogenous variable z. Any left-over variation (the bad, endogenous stuff, plus any bits not related to z) are thrown into e_i

  1. Now, take the predicted values from stage 1 and estimate the following regression:

\[y_i =c0 + c1* x_hat +q_i\]

Where q_i is a new error term. So if we did our job well, and our instrument is really valid, the estimate of c1 will be consistent. Note that q_i is not correlated with x_hat since this will only be the ‘part’ of the variation of x that is due to z, which we have claimed is exogenous to q_i.

Let’s think about our new coefficient, c1. c1 is now equal to, really, the coefficient of y_i on z divided by the coefficient of y_i on X. Which gives us a plim such that

\[plim(c1) = B1 + cov(z,u)/cov(z,X)\]

This gives us a nice summary of our issues at hand. The numerator (cov(z,u)) ought to be SMALL and the denominator ought to be LARGE (cov(z,X)). What does this mean? (Think about bias!)

The 4 steps of 2SLS

Back to the schooling regression. Here are the real steps to 2SLS. We need to convince everyone that the first two stages are even worth our time!!

So. Let’s go through these four steps.

  1. Find an instrument

  2. Argue like hell that it is exogenous.

  3. Stage 1. Demonstrate that it is relevant (how?)

  4. Stage 2.

Lesson 2: Estimating 2SLS in R

As per usual, we’ll need to load some packages.

library(pacman)
p_load(tidyverse, AER)

and we’ll need some data to manipulate. We’re going to use the education-income example above to examine the effects here.

Our dataset is contained in the AER data. Let’s use it.

data("CollegeDistance")
wage_data<-CollegeDistance
names(wage_data)
##  [1] "gender"    "ethnicity" "score"     "fcollege"  "mcollege"  "home"     
##  [7] "urban"     "unemp"     "wage"      "distance"  "tuition"   "education"
## [13] "income"    "region"

What variables do we have? How do we examine these?

summary(wage_data)
##     gender        ethnicity        score       fcollege   mcollege    home     
##  male  :2139   other   :3050   Min.   :28.95   no :3753   no :4088   no : 852  
##  female:2600   afam    : 786   1st Qu.:43.92   yes: 986   yes: 651   yes:3887  
##                hispanic: 903   Median :51.19                                   
##                                Mean   :50.89                                   
##                                3rd Qu.:57.77                                   
##                                Max.   :72.81                                   
##  urban          unemp             wage           distance         tuition      
##  no :3635   Min.   : 1.400   Min.   : 6.590   Min.   : 0.000   Min.   :0.2575  
##  yes:1104   1st Qu.: 5.900   1st Qu.: 8.850   1st Qu.: 0.400   1st Qu.:0.4850  
##             Median : 7.100   Median : 9.680   Median : 1.000   Median :0.8245  
##             Mean   : 7.597   Mean   : 9.501   Mean   : 1.803   Mean   :0.8146  
##             3rd Qu.: 8.900   3rd Qu.:10.150   3rd Qu.: 2.500   3rd Qu.:1.1270  
##             Max.   :24.900   Max.   :12.960   Max.   :20.000   Max.   :1.4042  
##    education      income       region    
##  Min.   :12.00   low :3374   other:3796  
##  1st Qu.:12.00   high:1365   west : 943  
##  Median :13.00                           
##  Mean   :13.81                           
##  3rd Qu.:16.00                           
##  Max.   :18.00

OLS

Now, lets estimate the returns to education on income the ‘naive’ way, via OLS:

THINK: Why is it naive to do this?

ols_mod <- lm(wage ~ education+urban + gender + ethnicity + unemp +income , data=wage_data)
summary(ols_mod)
## 
## Call:
## lm(formula = wage ~ education + urban + gender + ethnicity + 
##     unemp + income, data = wage_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3790 -0.8513  0.1701  0.8286  3.8570 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        8.685542   0.157062  55.300  < 2e-16 ***
## education         -0.003891   0.010568  -0.368   0.7128    
## urbanyes           0.078633   0.044691   1.760   0.0786 .  
## genderfemale      -0.076373   0.037060  -2.061   0.0394 *  
## ethnicityafam     -0.533688   0.052335 -10.197  < 2e-16 ***
## ethnicityhispanic -0.516272   0.049014 -10.533  < 2e-16 ***
## unemp              0.135182   0.006716  20.127  < 2e-16 ***
## incomehigh         0.181047   0.042397   4.270 1.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.266 on 4731 degrees of freedom
## Multiple R-squared:  0.1132, Adjusted R-squared:  0.1119 
## F-statistic: 86.29 on 7 and 4731 DF,  p-value: < 2.2e-16

Note that in the case of OLS, there is no significant effect of education on earnings. I guess y’all should leave school now! (Please don’t! Unless you want to…)

Let’s use our approach outlined above.

2SLS

Now lets treat education as endogenous and instrument for it using distance to college. Recall, we nee two conditions to hold for distance to be valid instrument. What are they?

  1. Exogeneity: Does it directly impact my earnings? Probably not. I am talking about the direct effect of distance on education. Anything we can control for (such as living in a city vs the country) isnt really an issue. Ask yourself this: when you go apply for a job, would it be strange for the employer to ask you how far away

  2. Relevance: Does it impact my own level of schooling? This one we can test. And the answer is probably.

Okay, now lets implement 2SLS

Stage 1:

stage_one<-lm(education~distance + urban + gender + ethnicity + income + unemp, data=wage_data)

check relevance

summary(stage_one)
## 
## Call:
## lm(formula = education ~ distance + urban + gender + ethnicity + 
##     income + unemp, data = wage_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.679 -1.564 -0.465  1.479  4.691 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       13.678428   0.085939 159.165  < 2e-16 ***
## distance          -0.072826   0.012045  -6.046 1.60e-09 ***
## urbanyes          -0.035527   0.063884  -0.556   0.5782    
## genderfemale       0.015520   0.050790   0.306   0.7599    
## ethnicityafam     -0.403432   0.071541  -5.639 1.81e-08 ***
## ethnicityhispanic -0.144992   0.067187  -2.158   0.0310 *  
## incomehigh         0.794369   0.057067  13.920  < 2e-16 ***
## unemp              0.016602   0.009586   1.732   0.0833 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.735 on 4731 degrees of freedom
## Multiple R-squared:  0.06142,    Adjusted R-squared:  0.06003 
## F-statistic: 44.23 on 7 and 4731 DF,  p-value: < 2.2e-16

get fitted values and add them to our data-frame

wage_data$x_hat <- fitted.values(stage_one)

Stage 2

stage_two<-lm(wage~urban + gender + ethnicity + unemp+x_hat+income, data=wage_data)

Lets compare the results.

summary(ols_mod)
## 
## Call:
## lm(formula = wage ~ education + urban + gender + ethnicity + 
##     unemp + income, data = wage_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3790 -0.8513  0.1701  0.8286  3.8570 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        8.685542   0.157062  55.300  < 2e-16 ***
## education         -0.003891   0.010568  -0.368   0.7128    
## urbanyes           0.078633   0.044691   1.760   0.0786 .  
## genderfemale      -0.076373   0.037060  -2.061   0.0394 *  
## ethnicityafam     -0.533688   0.052335 -10.197  < 2e-16 ***
## ethnicityhispanic -0.516272   0.049014 -10.533  < 2e-16 ***
## unemp              0.135182   0.006716  20.127  < 2e-16 ***
## incomehigh         0.181047   0.042397   4.270 1.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.266 on 4731 degrees of freedom
## Multiple R-squared:  0.1132, Adjusted R-squared:  0.1119 
## F-statistic: 86.29 on 7 and 4731 DF,  p-value: < 2.2e-16
summary(stage_two)
## 
## Call:
## lm(formula = wage ~ urban + gender + ethnicity + unemp + x_hat + 
##     income, data = wage_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1816 -0.8702  0.1452  0.8464  3.8382 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.37169    1.64027  -0.836 0.403053    
## urbanyes           0.02368    0.04540   0.522 0.602032    
## genderfemale      -0.09017    0.03698  -2.438 0.014804 *  
## ethnicityafam     -0.24668    0.06992  -3.528 0.000423 ***
## ethnicityhispanic -0.39711    0.05252  -7.562 4.75e-14 ***
## unemp              0.13487    0.00669  20.160  < 2e-16 ***
## x_hat              0.73370    0.12021   6.103 1.12e-09 ***
## incomehigh        -0.42616    0.10724  -3.974 7.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.261 on 4731 degrees of freedom
## Multiple R-squared:  0.1201, Adjusted R-squared:  0.1188 
## F-statistic: 92.27 on 7 and 4731 DF,  p-value: < 2.2e-16

After estimating the equation via 2SLS, we have significance on x_hat! (the causal effect of education on earnings).

There is also a built in way to do this in R, called ivreg. Let’s do this process using ivreg.

This is a “fun” tool. You separate, in a sense, your two stages with a ‘|’. On the left side, put your original equation. On the right side put what variables you will use to instrument. Of course, R won’t know what you’re instrumenting for, so you need to tell it with a .- before it. In our case, .-education.

reg_iv<-AER::ivreg(wage~urban + gender + ethnicity + unemp + income + education|.-education + distance, data=wage_data)
summary(reg_iv)
## 
## Call:
## AER::ivreg(formula = wage ~ urban + gender + ethnicity + unemp + 
##     income + education | . - education + distance, data = wage_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -5.57678 -1.21632 -0.02711  1.40379  5.07247 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.371690   2.346001  -0.585  0.55878    
## urbanyes           0.023678   0.064936   0.365  0.71540    
## genderfemale      -0.090166   0.052895  -1.705  0.08833 .  
## ethnicityafam     -0.246683   0.100003  -2.467  0.01367 *  
## ethnicityhispanic -0.397113   0.075111  -5.287 1.30e-07 ***
## unemp              0.134875   0.009569  14.095  < 2e-16 ***
## incomehigh        -0.426157   0.153387  -2.778  0.00549 ** 
## education          0.733704   0.171931   4.267 2.02e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.803 on 4731 degrees of freedom
## Multiple R-Squared: -0.7999, Adjusted R-squared: -0.8026 
## Wald test:  45.1 on 7 and 4731 DF,  p-value: < 2.2e-16