IV is a tool that you can use when you want to estimate the causal effect of some variable on an outcome, but hard to distinguish whether some coefficient you’ve estimated is simply do to a correlation with something else that has an impact on your outcome.
Suppose you estimate a regression of income on years of schooling, ie:
\(income_i = b0+b1*schooling_i+u_i\)
One assumption we make for OLS to be BLUE is exogeneity, or in other words: E[u|schooling] = 0.
It is almost certain that schooling and income are both correlated with the error u, because u contains some measure of ‘natural ability’, which we cannot observe. more income might choose to get more school because their parents were wealthy. Or, growing up wealthy puts more pressure on kids to go to college. There’s so many problems.
In other words, if an individual has higher ability, or childhood income impacts choice of schooling it is likely that both their salary and schooling will be higher. One can see that if E[u|schooling] != 0, our estimate for b1 will be inconsistent and biased.
Recall, we can write:
\(b_hat = b + cov(x,u)/var(x)\)
So if we think \(cov(x,u)\) is NOT equal to zero, then our estimates of b will be biased.
Soooooo we use an approach called Instrumental Variables or (IV).
Before giving you the formal definition of an instrument, lets talk intuition for a second.
Okay, so maybe the covariance of x and u isn’t zero, that is, we have some variation in an excluded variable potentially, or a ‘two-way-street’ referred to as ‘simultaneity bias.’ But maybe, we have some other variable we can use that explains our causal variable well, but does not explain our outcome.
Lets imagine you have a perfect, worldwide laboratory, and you could change some policy that would 1.) impact peoples’ schooling choices, but 2.) not their income. At least, not directly.
A good example: mandatory minimum schooling - some minimum amount of schooling required by the government.
Now, assuming we can manipulate this new variable, we can see how differing levels of mandatory minimum schooling impacts students’ education levels, and use THAT change to estimate changes in income. Perhaps by regressing mandatory minimums on education levels, and then predicted education levels on income. A two-stage, least squares.
An instrumental variable, z for the regression y= b1*x+u is one that satsifies the following two conditions:
Exogeneity - it is uncorrelated with the error, u
Relevance - it is correlated with x
A common instrument (made popular by Card (1995)) for schooling in the income regression is to use inidividual proximity to a college. For relevance, We could test by running a regression estimating the effect of distance on years of schooling:
\[schooling_i = a0+b1*distance_i+e_i\]
How would you know if the distance impacts schooling? Do a t-test where H_0: b1=0. Check the significance. You know, a p-value!
How would we go about estimating an IV regression?
This technique is called two stage least squares because we estimate two regressions. However, I really like to think of this as being a 4 stage procedure. Here is the general set-up; ill give you the “two” official stages in the general setting and then give an example where I illustrate why I believe it is a couple of extra steps. Suppose you are interested in estimating:
\[y= b0+b1*x+u \]
but you are concerned about the endogeneity of x. So you find a valid instrument, call it z.
To implement 2SLS we do the following:
\[x_i= a0+a1*z_i+e_i\]
From this regression we can calculate x-hat. Note that this regression is telling us what part/how much of x is explained by the exogenous variable z. Any left-over variation (the bad, endogenous stuff, plus any bits not related to z) are thrown into e_i
\[y_i =c0 + c1* x_hat +q_i\]
Where q_i is a new error term. So if we did our job well, and our instrument is really valid, the estimate of c1 will be consistent. Note that q_i is not correlated with x_hat since this will only be the ‘part’ of the variation of x that is due to z, which we have claimed is exogenous to q_i.
Let’s think about our new coefficient, c1. c1 is now equal to, really, the coefficient of y_i on z divided by the coefficient of y_i on X. Which gives us a plim such that
\[plim(c1) = B1 + cov(z,u)/cov(z,X)\]
This gives us a nice summary of our issues at hand. The numerator (cov(z,u)) ought to be SMALL and the denominator ought to be LARGE (cov(z,X)). What does this mean? (Think about bias!)
Back to the schooling regression. Here are the real steps to 2SLS. We need to convince everyone that the first two stages are even worth our time!!
So. Let’s go through these four steps.
Find an instrument
Argue like hell that it is exogenous.
Stage 1. Demonstrate that it is relevant (how?)
Stage 2.
As per usual, we’ll need to load some packages.
library(pacman)
p_load(tidyverse, AER)
and we’ll need some data to manipulate. We’re going to use the education-income example above to examine the effects here.
Our dataset is contained in the AER data. Let’s use it.
data("CollegeDistance")
wage_data<-CollegeDistance
names(wage_data)
## [1] "gender" "ethnicity" "score" "fcollege" "mcollege" "home"
## [7] "urban" "unemp" "wage" "distance" "tuition" "education"
## [13] "income" "region"
What variables do we have? How do we examine these?
summary(wage_data)
## gender ethnicity score fcollege mcollege home
## male :2139 other :3050 Min. :28.95 no :3753 no :4088 no : 852
## female:2600 afam : 786 1st Qu.:43.92 yes: 986 yes: 651 yes:3887
## hispanic: 903 Median :51.19
## Mean :50.89
## 3rd Qu.:57.77
## Max. :72.81
## urban unemp wage distance tuition
## no :3635 Min. : 1.400 Min. : 6.590 Min. : 0.000 Min. :0.2575
## yes:1104 1st Qu.: 5.900 1st Qu.: 8.850 1st Qu.: 0.400 1st Qu.:0.4850
## Median : 7.100 Median : 9.680 Median : 1.000 Median :0.8245
## Mean : 7.597 Mean : 9.501 Mean : 1.803 Mean :0.8146
## 3rd Qu.: 8.900 3rd Qu.:10.150 3rd Qu.: 2.500 3rd Qu.:1.1270
## Max. :24.900 Max. :12.960 Max. :20.000 Max. :1.4042
## education income region
## Min. :12.00 low :3374 other:3796
## 1st Qu.:12.00 high:1365 west : 943
## Median :13.00
## Mean :13.81
## 3rd Qu.:16.00
## Max. :18.00
Now, lets estimate the returns to education on income the ‘naive’ way, via OLS:
THINK: Why is it naive to do this?
ols_mod <- lm(wage ~ education+urban + gender + ethnicity + unemp +income , data=wage_data)
summary(ols_mod)
##
## Call:
## lm(formula = wage ~ education + urban + gender + ethnicity +
## unemp + income, data = wage_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3790 -0.8513 0.1701 0.8286 3.8570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.685542 0.157062 55.300 < 2e-16 ***
## education -0.003891 0.010568 -0.368 0.7128
## urbanyes 0.078633 0.044691 1.760 0.0786 .
## genderfemale -0.076373 0.037060 -2.061 0.0394 *
## ethnicityafam -0.533688 0.052335 -10.197 < 2e-16 ***
## ethnicityhispanic -0.516272 0.049014 -10.533 < 2e-16 ***
## unemp 0.135182 0.006716 20.127 < 2e-16 ***
## incomehigh 0.181047 0.042397 4.270 1.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.266 on 4731 degrees of freedom
## Multiple R-squared: 0.1132, Adjusted R-squared: 0.1119
## F-statistic: 86.29 on 7 and 4731 DF, p-value: < 2.2e-16
Note that in the case of OLS, there is no significant effect of education on earnings. I guess y’all should leave school now! (Please don’t! Unless you want to…)
Let’s use our approach outlined above.
Now lets treat education as endogenous and instrument for it using distance to college. Recall, we nee two conditions to hold for distance to be valid instrument. What are they?
Exogeneity: Does it directly impact my earnings? Probably not. I am talking about the direct effect of distance on education. Anything we can control for (such as living in a city vs the country) isnt really an issue. Ask yourself this: when you go apply for a job, would it be strange for the employer to ask you how far away
Relevance: Does it impact my own level of schooling? This one we can test. And the answer is probably.
Okay, now lets implement 2SLS
stage_one<-lm(education~distance + urban + gender + ethnicity + income + unemp, data=wage_data)
check relevance
summary(stage_one)
##
## Call:
## lm(formula = education ~ distance + urban + gender + ethnicity +
## income + unemp, data = wage_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.679 -1.564 -0.465 1.479 4.691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.678428 0.085939 159.165 < 2e-16 ***
## distance -0.072826 0.012045 -6.046 1.60e-09 ***
## urbanyes -0.035527 0.063884 -0.556 0.5782
## genderfemale 0.015520 0.050790 0.306 0.7599
## ethnicityafam -0.403432 0.071541 -5.639 1.81e-08 ***
## ethnicityhispanic -0.144992 0.067187 -2.158 0.0310 *
## incomehigh 0.794369 0.057067 13.920 < 2e-16 ***
## unemp 0.016602 0.009586 1.732 0.0833 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.735 on 4731 degrees of freedom
## Multiple R-squared: 0.06142, Adjusted R-squared: 0.06003
## F-statistic: 44.23 on 7 and 4731 DF, p-value: < 2.2e-16
get fitted values and add them to our data-frame
wage_data$x_hat <- fitted.values(stage_one)
stage_two<-lm(wage~urban + gender + ethnicity + unemp+x_hat+income, data=wage_data)
Lets compare the results.
summary(ols_mod)
##
## Call:
## lm(formula = wage ~ education + urban + gender + ethnicity +
## unemp + income, data = wage_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3790 -0.8513 0.1701 0.8286 3.8570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.685542 0.157062 55.300 < 2e-16 ***
## education -0.003891 0.010568 -0.368 0.7128
## urbanyes 0.078633 0.044691 1.760 0.0786 .
## genderfemale -0.076373 0.037060 -2.061 0.0394 *
## ethnicityafam -0.533688 0.052335 -10.197 < 2e-16 ***
## ethnicityhispanic -0.516272 0.049014 -10.533 < 2e-16 ***
## unemp 0.135182 0.006716 20.127 < 2e-16 ***
## incomehigh 0.181047 0.042397 4.270 1.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.266 on 4731 degrees of freedom
## Multiple R-squared: 0.1132, Adjusted R-squared: 0.1119
## F-statistic: 86.29 on 7 and 4731 DF, p-value: < 2.2e-16
summary(stage_two)
##
## Call:
## lm(formula = wage ~ urban + gender + ethnicity + unemp + x_hat +
## income, data = wage_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1816 -0.8702 0.1452 0.8464 3.8382
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.37169 1.64027 -0.836 0.403053
## urbanyes 0.02368 0.04540 0.522 0.602032
## genderfemale -0.09017 0.03698 -2.438 0.014804 *
## ethnicityafam -0.24668 0.06992 -3.528 0.000423 ***
## ethnicityhispanic -0.39711 0.05252 -7.562 4.75e-14 ***
## unemp 0.13487 0.00669 20.160 < 2e-16 ***
## x_hat 0.73370 0.12021 6.103 1.12e-09 ***
## incomehigh -0.42616 0.10724 -3.974 7.18e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.261 on 4731 degrees of freedom
## Multiple R-squared: 0.1201, Adjusted R-squared: 0.1188
## F-statistic: 92.27 on 7 and 4731 DF, p-value: < 2.2e-16
After estimating the equation via 2SLS, we have significance on x_hat! (the causal effect of education on earnings).
There is also a built in way to do this in R, called ivreg. Let’s do this process using ivreg.
This is a “fun” tool. You separate, in a sense, your two stages with a ‘|’. On the left side, put your original equation. On the right side put what variables you will use to instrument. Of course, R won’t know what you’re instrumenting for, so you need to tell it with a .- before it. In our case, .-education.
reg_iv<-AER::ivreg(wage~urban + gender + ethnicity + unemp + income + education|.-education + distance, data=wage_data)
summary(reg_iv)
##
## Call:
## AER::ivreg(formula = wage ~ urban + gender + ethnicity + unemp +
## income + education | . - education + distance, data = wage_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.57678 -1.21632 -0.02711 1.40379 5.07247
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.371690 2.346001 -0.585 0.55878
## urbanyes 0.023678 0.064936 0.365 0.71540
## genderfemale -0.090166 0.052895 -1.705 0.08833 .
## ethnicityafam -0.246683 0.100003 -2.467 0.01367 *
## ethnicityhispanic -0.397113 0.075111 -5.287 1.30e-07 ***
## unemp 0.134875 0.009569 14.095 < 2e-16 ***
## incomehigh -0.426157 0.153387 -2.778 0.00549 **
## education 0.733704 0.171931 4.267 2.02e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.803 on 4731 degrees of freedom
## Multiple R-Squared: -0.7999, Adjusted R-squared: -0.8026
## Wald test: 45.1 on 7 and 4731 DF, p-value: < 2.2e-16