Labor Project 2 — IV regression

1 Estimating the economic returns to schooling

Card (1993) investigates the economic returns to schooling using “distance to college” as an instrumental variable.

We load the AER package, which contains the survey data of high school graduates with variables coded for wages, education, average tuition, distance to college, and someother socio-economic measures.

library(AER)
data("CollegeDistance") # load the dataset

You can see details of the dataset by running str(CollegeDistance).

names(CollegeDistance)
 [1] "gender"    "ethnicity" "score"     "fcollege"  "mcollege"  "home"     
 [7] "urban"     "unemp"     "wage"      "distance"  "tuition"   "education"
[13] "income"    "region"   
dim(CollegeDistance)
[1] 4739   14
head(CollegeDistance)
  gender ethnicity score fcollege mcollege home urban unemp wage distance
1   male     other 39.15      yes       no  yes   yes   6.2 8.09      0.2
2 female     other 48.87       no       no  yes   yes   6.2 8.09      0.2
3   male     other 48.74       no       no  yes   yes   6.2 8.09      0.2
4   male      afam 40.40       no       no  yes   yes   6.2 8.09      0.2
5 female     other 40.48       no       no   no   yes   5.6 8.09      0.4
6   male     other 54.71       no       no  yes   yes   5.6 8.09      0.4
  tuition education income region
1 0.88915        12   high  other
2 0.88915        12    low  other
3 0.88915        12    low  other
4 0.88915        12    low  other
5 0.88915        13    low  other
6 0.88915        12    low  other
summary(CollegeDistance)
    gender        ethnicity        score       fcollege   mcollege    home     
 male  :2139   other   :3050   Min.   :28.95   no :3753   no :4088   no : 852  
 female:2600   afam    : 786   1st Qu.:43.92   yes: 986   yes: 651   yes:3887  
               hispanic: 903   Median :51.19                                   
                               Mean   :50.89                                   
                               3rd Qu.:57.77                                   
                               Max.   :72.81                                   
 urban          unemp             wage           distance         tuition      
 no :3635   Min.   : 1.400   Min.   : 6.590   Min.   : 0.000   Min.   :0.2575  
 yes:1104   1st Qu.: 5.900   1st Qu.: 8.850   1st Qu.: 0.400   1st Qu.:0.4850  
            Median : 7.100   Median : 9.680   Median : 1.000   Median :0.8245  
            Mean   : 7.597   Mean   : 9.501   Mean   : 1.803   Mean   :0.8146  
            3rd Qu.: 8.900   3rd Qu.:10.150   3rd Qu.: 2.500   3rd Qu.:1.1270  
            Max.   :24.900   Max.   :12.960   Max.   :20.000   Max.   :1.4042  
   education      income       region    
 Min.   :12.00   low :3374   other:3796  
 1st Qu.:12.00   high:1365   west : 943  
 Median :13.00                           
 Mean   :13.81                           
 3rd Qu.:16.00                           
 Max.   :18.00                           

2 Linear regressions

2.1 Regress the log wage on education

lm_no_control = lm(log(wage) ~ education, data = CollegeDistance)
summary(lm_no_control)

Call:
lm(formula = log(wage) ~ education, data = CollegeDistance)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.36208 -0.06271  0.03255  0.07794  0.32436 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.213220   0.016228 136.379   <2e-16 ***
education   0.002024   0.001166   1.737   0.0825 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1435 on 4737 degrees of freedom
Multiple R-squared:  0.0006364, Adjusted R-squared:  0.0004254 
F-statistic: 3.016 on 1 and 4737 DF,  p-value: 0.08249

2.2 Regress the log of wage on education and controls

Controls include unemp, hispanic, afam, female and urban.

Note that all controls except unemp are dummy variables.

CollegeDistance$is_hispanic = 
    ifelse(
        CollegeDistance$ethnicity == "hispanic", 
        1, 
        0)
CollegeDistance$is_afam = 
    ifelse(
        CollegeDistance$ethnicity == "afam", 
        1, 
        0)
CollegeDistance$is_urban = 
    ifelse(
        CollegeDistance$urban == "yes", 
        1, 
        0)
CollegeDistance$is_female = 
    ifelse(
        CollegeDistance$gender == "female", 
        1, 
        0)
lm_controls = lm(log(wage) ~ education + unemp + is_hispanic + is_afam + is_female + is_urban, data = CollegeDistance)
summary(lm_controls)

Call:
lm(formula = log(wage) ~ education + unemp + is_hispanic + is_afam + 
    is_female + is_urban, data = CollegeDistance)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.39998 -0.08223  0.02833  0.09486  0.37945 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.1519999  0.0168512 127.706   <2e-16 ***
education    0.0006723  0.0011121   0.605   0.5455    
unemp        0.0135938  0.0007203  18.874   <2e-16 ***
is_hispanic -0.0535204  0.0052237 -10.246   <2e-16 ***
is_afam     -0.0619139  0.0055990 -11.058   <2e-16 ***
is_female   -0.0091150  0.0039785  -2.291   0.0220 *  
is_urban     0.0089393  0.0048005   1.862   0.0626 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1361 on 4732 degrees of freedom
Multiple R-squared:  0.1026,    Adjusted R-squared:  0.1015 
F-statistic:  90.2 on 6 and 4732 DF,  p-value: < 2.2e-16

3 IV regression

We employ distance as an instrument for education in both regressions using ivreg().

We first test whether instrument distance satisfies the relevance assumption:

cor(CollegeDistance$distance, 
    CollegeDistance$education)  
[1] -0.09318309
ivreg_no_control = ivreg(log(wage) ~ education | distance,
      data = CollegeDistance)
summary(ivreg_no_control)

Call:
ivreg(formula = log(wage) ~ education | distance, data = CollegeDistance)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.36022 -0.06094  0.03149  0.07747  0.32330 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.221281   0.172731  12.860   <2e-16 ***
education   0.001441   0.012509   0.115    0.908    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1435 on 4737 degrees of freedom
Multiple R-Squared: 0.0005835,  Adjusted R-squared: 0.0003725 
Wald test: 0.01326 on 1 and 4737 DF,  p-value: 0.9083 
ivreg_controls = 
    ivreg(log(wage) ~ education + unemp + is_hispanic + is_afam + is_female + is_urban | distance + unemp + is_hispanic + is_afam + is_female + is_urban,
    data = CollegeDistance)
  • Here I write out the formula explicitly log(wage) ~ education + unemp + is_hispanic + is_afam + is_female + is_urban | distance + unemp + is_hispanic + is_afam + is_female + is_urban. A shorthand is log(wage) ~ education + unemp + is_hispanic + is_afam + is_female + is_urban | . - education + distance
summary(ivreg_controls)

Call:
ivreg(formula = log(wage) ~ education + unemp + is_hispanic + 
    is_afam + is_female + is_urban | distance + unemp + is_hispanic + 
    is_afam + is_female + is_urban, data = CollegeDistance)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.5885016 -0.1191974 -0.0001799  0.1452146  0.4576460 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.2171787  0.2018797   6.029 1.77e-09 ***
education    0.0673242  0.0143812   4.681 2.93e-06 ***
unemp        0.0142234  0.0009648  14.743  < 2e-16 ***
is_hispanic -0.0335043  0.0081520  -4.110 4.02e-05 ***
is_afam     -0.0277621  0.0104342  -2.661  0.00782 ** 
is_female   -0.0076101  0.0052865  -1.440  0.15007    
is_urban     0.0064494  0.0063892   1.009  0.31283    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1805 on 4732 degrees of freedom
Multiple R-Squared: -0.5786,    Adjusted R-squared: -0.5806 
Wald test: 54.89 on 6 and 4732 DF,  p-value: < 2.2e-16 

4 Homework 3

Write your own version of the 2SLS algorithm to compute the 2SLS estimates.

Your results should be the same with the outputs of ivreg(). Note that you only need to replicate the coefficient estimates, not the standard errors.

Below is the boilerplate for the TSLS function.

# complete the function `TSLS()`
TSLS <- function(Y, X, W = NULL, Z, data) {
    fs_model <- lm(as.formula(paste(..., collapse = "+"))), data = data)
    X_fitted <- ...
    
    ss_model <- lm(as.formula(paste(..., paste(..., collapse = "+"))),  data = data)

    return(coefficients(...))
    }
          
 # use `TSLS()` to reproduce the estimates from Exercise 3
Important

You should submit:

  1. An Rmd or Qmd file and the generated HTML file.

  2. In the submitted Rmd/Qmd, write the TSLS function explicitly, and show that its outputs coincide with ivreg_controls$coefficients.

ivreg_controls$coefficients
 (Intercept)    education        unemp  is_hispanic      is_afam    is_female 
 1.217178737  0.067324190  0.014223415 -0.033504302 -0.027762084 -0.007610139 
    is_urban 
 0.006449366