library(AER)
data("CollegeDistance") # load the datasetLabor Project 2 — IV regression
1 Estimating the economic returns to schooling
Card (1993) investigates the economic returns to schooling using “distance to college” as an instrumental variable.
We load the AER package, which contains the survey data of high school graduates with variables coded for wages, education, average tuition, distance to college, and someother socio-economic measures.
You can see details of the dataset by running str(CollegeDistance).
names(CollegeDistance) [1] "gender" "ethnicity" "score" "fcollege" "mcollege" "home"
[7] "urban" "unemp" "wage" "distance" "tuition" "education"
[13] "income" "region"
dim(CollegeDistance)[1] 4739 14
head(CollegeDistance) gender ethnicity score fcollege mcollege home urban unemp wage distance
1 male other 39.15 yes no yes yes 6.2 8.09 0.2
2 female other 48.87 no no yes yes 6.2 8.09 0.2
3 male other 48.74 no no yes yes 6.2 8.09 0.2
4 male afam 40.40 no no yes yes 6.2 8.09 0.2
5 female other 40.48 no no no yes 5.6 8.09 0.4
6 male other 54.71 no no yes yes 5.6 8.09 0.4
tuition education income region
1 0.88915 12 high other
2 0.88915 12 low other
3 0.88915 12 low other
4 0.88915 12 low other
5 0.88915 13 low other
6 0.88915 12 low other
summary(CollegeDistance) gender ethnicity score fcollege mcollege home
male :2139 other :3050 Min. :28.95 no :3753 no :4088 no : 852
female:2600 afam : 786 1st Qu.:43.92 yes: 986 yes: 651 yes:3887
hispanic: 903 Median :51.19
Mean :50.89
3rd Qu.:57.77
Max. :72.81
urban unemp wage distance tuition
no :3635 Min. : 1.400 Min. : 6.590 Min. : 0.000 Min. :0.2575
yes:1104 1st Qu.: 5.900 1st Qu.: 8.850 1st Qu.: 0.400 1st Qu.:0.4850
Median : 7.100 Median : 9.680 Median : 1.000 Median :0.8245
Mean : 7.597 Mean : 9.501 Mean : 1.803 Mean :0.8146
3rd Qu.: 8.900 3rd Qu.:10.150 3rd Qu.: 2.500 3rd Qu.:1.1270
Max. :24.900 Max. :12.960 Max. :20.000 Max. :1.4042
education income region
Min. :12.00 low :3374 other:3796
1st Qu.:12.00 high:1365 west : 943
Median :13.00
Mean :13.81
3rd Qu.:16.00
Max. :18.00
2 Linear regressions
2.1 Regress the log wage on education
lm_no_control = lm(log(wage) ~ education, data = CollegeDistance)summary(lm_no_control)
Call:
lm(formula = log(wage) ~ education, data = CollegeDistance)
Residuals:
Min 1Q Median 3Q Max
-0.36208 -0.06271 0.03255 0.07794 0.32436
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.213220 0.016228 136.379 <2e-16 ***
education 0.002024 0.001166 1.737 0.0825 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1435 on 4737 degrees of freedom
Multiple R-squared: 0.0006364, Adjusted R-squared: 0.0004254
F-statistic: 3.016 on 1 and 4737 DF, p-value: 0.08249
2.2 Regress the log of wage on education and controls
Controls include unemp, hispanic, afam, female and urban.
Note that all controls except unemp are dummy variables.
CollegeDistance$is_hispanic =
ifelse(
CollegeDistance$ethnicity == "hispanic",
1,
0)
CollegeDistance$is_afam =
ifelse(
CollegeDistance$ethnicity == "afam",
1,
0)
CollegeDistance$is_urban =
ifelse(
CollegeDistance$urban == "yes",
1,
0)
CollegeDistance$is_female =
ifelse(
CollegeDistance$gender == "female",
1,
0)lm_controls = lm(log(wage) ~ education + unemp + is_hispanic + is_afam + is_female + is_urban, data = CollegeDistance)
summary(lm_controls)
Call:
lm(formula = log(wage) ~ education + unemp + is_hispanic + is_afam +
is_female + is_urban, data = CollegeDistance)
Residuals:
Min 1Q Median 3Q Max
-0.39998 -0.08223 0.02833 0.09486 0.37945
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.1519999 0.0168512 127.706 <2e-16 ***
education 0.0006723 0.0011121 0.605 0.5455
unemp 0.0135938 0.0007203 18.874 <2e-16 ***
is_hispanic -0.0535204 0.0052237 -10.246 <2e-16 ***
is_afam -0.0619139 0.0055990 -11.058 <2e-16 ***
is_female -0.0091150 0.0039785 -2.291 0.0220 *
is_urban 0.0089393 0.0048005 1.862 0.0626 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1361 on 4732 degrees of freedom
Multiple R-squared: 0.1026, Adjusted R-squared: 0.1015
F-statistic: 90.2 on 6 and 4732 DF, p-value: < 2.2e-16
3 IV regression
We employ distance as an instrument for education in both regressions using ivreg().
We first test whether instrument distance satisfies the relevance assumption:
cor(CollegeDistance$distance,
CollegeDistance$education) [1] -0.09318309
ivreg_no_control = ivreg(log(wage) ~ education | distance,
data = CollegeDistance)summary(ivreg_no_control)
Call:
ivreg(formula = log(wage) ~ education | distance, data = CollegeDistance)
Residuals:
Min 1Q Median 3Q Max
-0.36022 -0.06094 0.03149 0.07747 0.32330
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.221281 0.172731 12.860 <2e-16 ***
education 0.001441 0.012509 0.115 0.908
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1435 on 4737 degrees of freedom
Multiple R-Squared: 0.0005835, Adjusted R-squared: 0.0003725
Wald test: 0.01326 on 1 and 4737 DF, p-value: 0.9083
ivreg_controls =
ivreg(log(wage) ~ education + unemp + is_hispanic + is_afam + is_female + is_urban | distance + unemp + is_hispanic + is_afam + is_female + is_urban,
data = CollegeDistance)- Here I write out the formula explicitly
log(wage) ~ education + unemp + is_hispanic + is_afam + is_female + is_urban | distance + unemp + is_hispanic + is_afam + is_female + is_urban. A shorthand islog(wage) ~ education + unemp + is_hispanic + is_afam + is_female + is_urban | . - education + distance
summary(ivreg_controls)
Call:
ivreg(formula = log(wage) ~ education + unemp + is_hispanic +
is_afam + is_female + is_urban | distance + unemp + is_hispanic +
is_afam + is_female + is_urban, data = CollegeDistance)
Residuals:
Min 1Q Median 3Q Max
-0.5885016 -0.1191974 -0.0001799 0.1452146 0.4576460
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2171787 0.2018797 6.029 1.77e-09 ***
education 0.0673242 0.0143812 4.681 2.93e-06 ***
unemp 0.0142234 0.0009648 14.743 < 2e-16 ***
is_hispanic -0.0335043 0.0081520 -4.110 4.02e-05 ***
is_afam -0.0277621 0.0104342 -2.661 0.00782 **
is_female -0.0076101 0.0052865 -1.440 0.15007
is_urban 0.0064494 0.0063892 1.009 0.31283
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1805 on 4732 degrees of freedom
Multiple R-Squared: -0.5786, Adjusted R-squared: -0.5806
Wald test: 54.89 on 6 and 4732 DF, p-value: < 2.2e-16
4 Homework 3
Write your own version of the 2SLS algorithm to compute the 2SLS estimates.
Your results should be the same with the outputs of ivreg(). Note that you only need to replicate the coefficient estimates, not the standard errors.
Below is the boilerplate for the TSLS function.
# complete the function `TSLS()`
TSLS <- function(Y, X, W = NULL, Z, data) {
fs_model <- lm(as.formula(paste(..., collapse = "+"))), data = data)
X_fitted <- ...
ss_model <- lm(as.formula(paste(..., paste(..., collapse = "+"))), data = data)
return(coefficients(...))
}
# use `TSLS()` to reproduce the estimates from Exercise 3You should submit:
An Rmd or Qmd file and the generated HTML file.
In the submitted Rmd/Qmd, write the
TSLSfunction explicitly, and show that its outputs coincide withivreg_controls$coefficients.
ivreg_controls$coefficients (Intercept) education unemp is_hispanic is_afam is_female
1.217178737 0.067324190 0.014223415 -0.033504302 -0.027762084 -0.007610139
is_urban
0.006449366