Causality: IV

Case Study: Impact of education on wages

Author

Mike Aguilar | https://www.linkedin.com/in/mike-aguilar-econ/

Introduction

This exercise is based on a series of studies in labor economics that attempt to determine the returns to education.

The unit of observation is worker (i)

Key Variables:

  • wages: $ earned per week
  • hours: number of hours worked per week
  • IQ: IQ score
  • educ: highest years of schooling completed
  • lwage: log of wages
  • age: yrs old
  • exper: yrs of experience
  • feduc: highests years of schooling completed by worker i’s fatherd

Using this document

Throughout this code example, you’ll see several questions indicated by “Q”. Each question is followed by a solution, indicated by “A”.

The best way to learn this material is through active participation. I suggest that you attempt to formulate your answers to each question before viewing the prepared “A” answer.

Housekeeping

knitr::opts_chunk$set(echo = TRUE)
rm(list=ls()) # clear workspace
cat("\014")  # clear console
library("MatchIt")
Warning: package 'MatchIt' was built under R version 4.3.2
library("marginaleffects")
Warning: package 'marginaleffects' was built under R version 4.3.2
library("dplyr")

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library("ggplot2")
library("tidyverse")
Warning: package 'tidyverse' was built under R version 4.3.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ lubridate 1.9.2     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.0
✔ readr     2.1.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("sensemakr")
Warning: package 'sensemakr' was built under R version 4.3.2
See details in:
Carlos Cinelli and Chad Hazlett (2020). Making Sense of Sensitivity: Extending Omitted Variable Bias. Journal of the Royal Statistical Society, Series B (Statistical Methodology).
library("AER")
Warning: package 'AER' was built under R version 4.3.2
Loading required package: car
Warning: package 'car' was built under R version 4.3.2
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:purrr':

    some

The following object is masked from 'package:dplyr':

    recode

Loading required package: lmtest
Warning: package 'lmtest' was built under R version 4.3.2
Loading required package: zoo
Warning: package 'zoo' was built under R version 4.3.2

Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

Loading required package: sandwich
Warning: package 'sandwich' was built under R version 4.3.2
Loading required package: survival

Load the data

library(wooldridge)
Warning: package 'wooldridge' was built under R version 4.3.2
mydata <- wage2

EDA

summary(mydata)
      wage            hours             IQ             KWW       
 Min.   : 115.0   Min.   :20.00   Min.   : 50.0   Min.   :12.00  
 1st Qu.: 669.0   1st Qu.:40.00   1st Qu.: 92.0   1st Qu.:31.00  
 Median : 905.0   Median :40.00   Median :102.0   Median :37.00  
 Mean   : 957.9   Mean   :43.93   Mean   :101.3   Mean   :35.74  
 3rd Qu.:1160.0   3rd Qu.:48.00   3rd Qu.:112.0   3rd Qu.:41.00  
 Max.   :3078.0   Max.   :80.00   Max.   :145.0   Max.   :56.00  
                                                                 
      educ           exper           tenure            age       
 Min.   : 9.00   Min.   : 1.00   Min.   : 0.000   Min.   :28.00  
 1st Qu.:12.00   1st Qu.: 8.00   1st Qu.: 3.000   1st Qu.:30.00  
 Median :12.00   Median :11.00   Median : 7.000   Median :33.00  
 Mean   :13.47   Mean   :11.56   Mean   : 7.234   Mean   :33.08  
 3rd Qu.:16.00   3rd Qu.:15.00   3rd Qu.:11.000   3rd Qu.:36.00  
 Max.   :18.00   Max.   :23.00   Max.   :22.000   Max.   :38.00  
                                                                 
    married          black            south            urban       
 Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :1.000   Median :0.0000   Median :0.0000   Median :1.0000  
 Mean   :0.893   Mean   :0.1283   Mean   :0.3412   Mean   :0.7176  
 3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
                                                                   
      sibs           brthord           meduc           feduc      
 Min.   : 0.000   Min.   : 1.000   Min.   : 0.00   Min.   : 0.00  
 1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 8.00   1st Qu.: 8.00  
 Median : 2.000   Median : 2.000   Median :12.00   Median :10.00  
 Mean   : 2.941   Mean   : 2.277   Mean   :10.68   Mean   :10.22  
 3rd Qu.: 4.000   3rd Qu.: 3.000   3rd Qu.:12.00   3rd Qu.:12.00  
 Max.   :14.000   Max.   :10.000   Max.   :18.00   Max.   :18.00  
                  NA's   :83       NA's   :78      NA's   :194    
     lwage      
 Min.   :4.745  
 1st Qu.:6.506  
 Median :6.808  
 Mean   :6.779  
 3rd Qu.:7.056  
 Max.   :8.032  
                
cleandata<-na.omit(mydata)
summary(cleandata)
      wage            hours             IQ             KWW       
 Min.   : 115.0   Min.   :25.00   Min.   : 54.0   Min.   :13.00  
 1st Qu.: 699.0   1st Qu.:40.00   1st Qu.: 94.0   1st Qu.:32.00  
 Median : 937.0   Median :40.00   Median :104.0   Median :37.00  
 Mean   : 988.5   Mean   :44.06   Mean   :102.5   Mean   :36.19  
 3rd Qu.:1200.0   3rd Qu.:48.00   3rd Qu.:113.0   3rd Qu.:41.00  
 Max.   :3078.0   Max.   :80.00   Max.   :145.0   Max.   :56.00  
      educ           exper          tenure            age       
 Min.   : 9.00   Min.   : 1.0   Min.   : 0.000   Min.   :28.00  
 1st Qu.:12.00   1st Qu.: 8.0   1st Qu.: 3.000   1st Qu.:30.00  
 Median :13.00   Median :11.0   Median : 7.000   Median :33.00  
 Mean   :13.68   Mean   :11.4   Mean   : 7.217   Mean   :32.98  
 3rd Qu.:16.00   3rd Qu.:15.0   3rd Qu.:11.000   3rd Qu.:36.00  
 Max.   :18.00   Max.   :22.0   Max.   :22.000   Max.   :38.00  
    married           black             south            urban       
 Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:1.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :1.0000   Median :0.00000   Median :0.0000   Median :1.0000  
 Mean   :0.9005   Mean   :0.08145   Mean   :0.3228   Mean   :0.7195  
 3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
      sibs           brthord           meduc           feduc      
 Min.   : 0.000   Min.   : 1.000   Min.   : 0.00   Min.   : 0.00  
 1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 9.00   1st Qu.: 8.00  
 Median : 2.000   Median : 2.000   Median :12.00   Median :11.00  
 Mean   : 2.846   Mean   : 2.178   Mean   :10.83   Mean   :10.27  
 3rd Qu.: 4.000   3rd Qu.: 3.000   3rd Qu.:12.00   3rd Qu.:12.00  
 Max.   :14.000   Max.   :10.000   Max.   :18.00   Max.   :18.00  
     lwage      
 Min.   :4.745  
 1st Qu.:6.550  
 Median :6.843  
 Mean   :6.814  
 3rd Qu.:7.090  
 Max.   :8.032  

Wages vs Log Wages

Q: Surmise a reason why the authors who usually explore this type of dataset often use log wages rather than wages as their object of interest.

plot(cleandata$educ,cleandata$wage)

plot(cleandata$educ,cleandata$lwage)

A:

It appears like there might be a nonlinear relationship between education and wages. Taking the log of wages tends to smooth out the relationship.

Q: Run a regression of wages on educ. Next, run a regression of log wages on educ. Contrast the meanings of these coefficients on educ.

model1<-lm(wage~educ,data = cleandata)
model2<-lm(lwage~educ,data = cleandata)
model1$coefficients[2]
    educ 
59.45181 
model2$coefficients[2]
     educ 
0.0595627 

A:

Level: A one-unit increase in years of education is associated with a $59.45 dollar increase in wages per week.

The second may seen as preferable since it does not assume constant marginal DOLLAR effects. A $59 increase on a small base is more powerful than the same change on a much larger base. Using log wages suggests that we have constant marginal PERCENT effects. i.e. education will impact percent of wages for those who make very little the same as those who make a lot.

Q1

Run an OLS regression of log wages on educ.

Q:

  • What is the ATE?
  • Interpret within our case.
  • Is it statistically significant.
ols <- lm(lwage ~ educ, data=cleandata)
summary(ols)

Call:
lm(formula = lwage ~ educ, data = cleandata)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.96929 -0.24539  0.03688  0.26627  1.25825 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.999465   0.094273  63.640   <2e-16 ***
educ        0.059563   0.006801   8.757   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3905 on 661 degrees of freedom
Multiple R-squared:  0.104, Adjusted R-squared:  0.1026 
F-statistic: 76.69 on 1 and 661 DF,  p-value: < 2.2e-16

A:

  • The ATE is the coefficient on educ
  • It is statistically significant as indicated by the small p-value
  • A one-unit increase in years of education is associated with a 5.95% increase in wages per week.

Q2

A colleague asks you to run an IV regression of log wages on educ using father’s education as an instrument.

iv <- ivreg(lwage ~ educ | feduc, data=cleandata)
summary(iv)

Call:
ivreg(formula = lwage ~ educ | feduc, data = cleandata)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.89557 -0.25696  0.04035  0.26725  1.28810 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.3993     0.2263  23.859  < 2e-16 ***
educ          0.1034     0.0165   6.268 6.62e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4026 on 661 degrees of freedom
Multiple R-Squared: 0.04756,    Adjusted R-squared: 0.04612 
Wald test: 39.28 on 1 and 661 DF,  p-value: 6.617e-10 

Q: What is the IV estimate of ATE?

A: .1034

Q: Interpret the IV estimate of ATE.

A:

  • A one-unit increase in years of education is associated with a 10.34% increase in wages per week.
  • For a policy maker, if they can increase education, they can expect individuals to be this much better off, on average, than they otherwise would have been.

Q: Notice there is a difference between the OLS and IV estimates of ATE. What does the fact that there is a difference suggest?

A:

  • That there is likely an OVB.
  • There are likely factors other than education that influence earnings (e.g. ability, determination, family connections, etc..).
  • Importantly, whatever that omitted variable might be, it must be related to education, otherwise the marginal effect on earnings would not be attenuated.

Q: What are the conditions for justifying a father’s education as a valid instrument?

A:

  • Relevance: It is believed that Father’s Education might influence the child’s education (e.g. a highly educated father might instill the value of education in a child)

  • Exclusion: Father’s education cannot influence earnings directly (father’s education is not related to the child’s earnings after controlling for the child’s education)

  • Independence: The Father’s education cannot be related to whatever is omitted (e.g. natural ability)

Q: Do you think a father’s education is a valid instrument?

A:

  • Relevance: The relevance condition seems logical.

  • Exclusion: However, the father’s education might be a proxy for success, and thereby offer opportunities to the child that could influence earnings directly (e.g. a family friend could offer the child a job), so the exogeneity/exclusion assumption might be in question.

  • Independence: Moreover, I wonder if natural abilities are hereditary.

Q3

Recreate the IV estimator via the first stage and reduced form.

iv.FirstStage <- lm(educ ~ feduc, data=cleandata)

iv.ReducedForm <- lm(lwage ~ feduc, data=cleandata)

iv.manual = iv.ReducedForm$coefficients[2]/iv.FirstStage$coefficients[2]
iv.manual
    feduc 
0.1034326 

Q4

Q: Compute the correlation between feduc and educ. What does that imply about the validity of our IV estimates?

cor(cleandata$educ, cleandata$feduc)
[1] 0.424906

A:

  • Correlation is .42
  • Seems large enough to suggest that the instrument feduc is relevant.

Q: Interpret the Weak Instruments test for the IV estimator

summary(iv, diagnostics = TRUE)

Call:
ivreg(formula = lwage ~ educ | feduc, data = cleandata)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.89557 -0.25696  0.04035  0.26725  1.28810 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.3993     0.2263  23.859  < 2e-16 ***
educ          0.1034     0.0165   6.268 6.62e-10 ***

Diagnostic tests:
                 df1 df2 statistic p-value    
Weak instruments   1 661   145.634 < 2e-16 ***
Wu-Hausman         1 660     9.281 0.00241 ** 
Sargan             0  NA        NA      NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4026 on 661 degrees of freedom
Multiple R-Squared: 0.04756,    Adjusted R-squared: 0.04612 
Wald test: 39.28 on 1 and 661 DF,  p-value: 6.617e-10 

A:

  • The Weak instruments test: H0: Weak vs Ha: Strong

  • A small p-value suggests we reject the null

  • We have confidence that the instrument is valid, in that is it sufficiently related to educ

Q: Interpret the W-Hausman test

A:

  • Wu Haussman for Endogeneity of educ: H0: Education is Exogeneous vs Ha: Educ is Endogenous

  • We reject this, suggesting that we do indeed have an endogeneity problem caused by OVB.

  • Supports our use of IV but does not explicitly test any of the criteria for the validity of our instrument.

Q: Run an OLS of lwages on feduc. Why doesn’t the significant coefficient on feduc invalidate its use as an instrument?

A:

summary(iv.ReducedForm)

Call:
lm(formula = lwage ~ feduc, data = cleandata)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.09105 -0.23850  0.02758  0.27645  1.16701 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.507909   0.051080 127.405  < 2e-16 ***
feduc       0.029825   0.004736   6.297 5.52e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4007 on 661 degrees of freedom
Multiple R-squared:  0.0566,    Adjusted R-squared:  0.05517 
F-statistic: 39.66 on 1 and 661 DF,  p-value: 5.517e-10
  • This is the reduced form we saw earlier
  • The key to the exclusion restriction is that the instrument is not related to the outcomes AFTER controlling for the treatment variable.
  • In our case, feduc and educ are related. The coefficient from this reduced form could be detecting the direct effects of feduc on wages as well as the indirect effects via educ.
  • If feduc was not related at all to wages, then the instrument likley wouldn’t be relevant.

Q5

Expand the IV by controlling for hours, IQ, and exper. What is the new ATE? What does the change in the ATE suggest about the new covariates?

iv2<-ivreg(lwage ~ educ + hours + IQ + exper | feduc + hours + IQ + exper, data = cleandata)
summary(iv2, diagnostics = TRUE)

Call:
ivreg(formula = lwage ~ educ + hours + IQ + exper | feduc + hours + 
    IQ + exper, data = cleandata)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.66289 -0.25809  0.01406  0.27684  1.35564 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.632473   0.341076  13.582  < 2e-16 ***
educ         0.159562   0.036883   4.326 1.75e-05 ***
hours       -0.006016   0.002277  -2.642  0.00845 ** 
IQ          -0.001741   0.002828  -0.615  0.53846    
exper        0.038821   0.007639   5.082 4.87e-07 ***

Diagnostic tests:
                 df1 df2 statistic  p-value    
Weak instruments   1 658     44.96 4.34e-11 ***
Wu-Hausman         1 657      9.46  0.00219 ** 
Sargan             0  NA        NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4121 on 658 degrees of freedom
Multiple R-Squared: 0.006749,   Adjusted R-squared: 0.000711 
Wald test: 23.63 on 4 and 658 DF,  p-value: < 2.2e-16 

A:

  • The new ATE is .1595
  • This is quite a bit larger than the previous estimate of .1034
  • The change implies that the new covariates not only were relevant to wages but also related to education.
  • In the previous implementation, these confounders were omitted variables.
  • By explicitly controlling for these directions of confounding we are reducing the OVB, thereby sharpening the ATE estimate.
temp<-cleandata %>%
  select(lwage,educ,hours,IQ,exper)
cor(temp)
            lwage        educ       hours         IQ       exper
lwage  1.00000000  0.32243197 -0.07356263  0.3102242  0.03618707
educ   0.32243197  1.00000000  0.08397036  0.5434635 -0.45082581
hours -0.07356263  0.08397036  1.00000000  0.0410471 -0.09736714
IQ     0.31022422  0.54346354  0.04104710  1.0000000 -0.23162567
exper  0.03618707 -0.45082581 -0.09736714 -0.2316257  1.00000000