Mike Aguilar | https://www.linkedin.com/in/mike-aguilar-econ/
Introduction
This exercise is based on a series of studies in labor economics that attempt to determine the returns to education.
The unit of observation is worker (i)
Key Variables:
wages: $ earned per week
hours: number of hours worked per week
IQ: IQ score
educ: highest years of schooling completed
lwage: log of wages
age: yrs old
exper: yrs of experience
feduc: highests years of schooling completed by worker i’s fatherd
Using this document
Throughout this code example, you’ll see several questions indicated by “Q”. Each question is followed by a solution, indicated by “A”.
The best way to learn this material is through active participation. I suggest that you attempt to formulate your answers to each question before viewing the prepared “A” answer.
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("sensemakr")
Warning: package 'sensemakr' was built under R version 4.3.2
See details in:
Carlos Cinelli and Chad Hazlett (2020). Making Sense of Sensitivity: Extending Omitted Variable Bias. Journal of the Royal Statistical Society, Series B (Statistical Methodology).
library("AER")
Warning: package 'AER' was built under R version 4.3.2
Loading required package: car
Warning: package 'car' was built under R version 4.3.2
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:purrr':
some
The following object is masked from 'package:dplyr':
recode
Loading required package: lmtest
Warning: package 'lmtest' was built under R version 4.3.2
Loading required package: zoo
Warning: package 'zoo' was built under R version 4.3.2
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
Loading required package: sandwich
Warning: package 'sandwich' was built under R version 4.3.2
Loading required package: survival
Load the data
library(wooldridge)
Warning: package 'wooldridge' was built under R version 4.3.2
mydata <- wage2
EDA
summary(mydata)
wage hours IQ KWW
Min. : 115.0 Min. :20.00 Min. : 50.0 Min. :12.00
1st Qu.: 669.0 1st Qu.:40.00 1st Qu.: 92.0 1st Qu.:31.00
Median : 905.0 Median :40.00 Median :102.0 Median :37.00
Mean : 957.9 Mean :43.93 Mean :101.3 Mean :35.74
3rd Qu.:1160.0 3rd Qu.:48.00 3rd Qu.:112.0 3rd Qu.:41.00
Max. :3078.0 Max. :80.00 Max. :145.0 Max. :56.00
educ exper tenure age
Min. : 9.00 Min. : 1.00 Min. : 0.000 Min. :28.00
1st Qu.:12.00 1st Qu.: 8.00 1st Qu.: 3.000 1st Qu.:30.00
Median :12.00 Median :11.00 Median : 7.000 Median :33.00
Mean :13.47 Mean :11.56 Mean : 7.234 Mean :33.08
3rd Qu.:16.00 3rd Qu.:15.00 3rd Qu.:11.000 3rd Qu.:36.00
Max. :18.00 Max. :23.00 Max. :22.000 Max. :38.00
married black south urban
Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :1.000 Median :0.0000 Median :0.0000 Median :1.0000
Mean :0.893 Mean :0.1283 Mean :0.3412 Mean :0.7176
3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
sibs brthord meduc feduc
Min. : 0.000 Min. : 1.000 Min. : 0.00 Min. : 0.00
1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 8.00 1st Qu.: 8.00
Median : 2.000 Median : 2.000 Median :12.00 Median :10.00
Mean : 2.941 Mean : 2.277 Mean :10.68 Mean :10.22
3rd Qu.: 4.000 3rd Qu.: 3.000 3rd Qu.:12.00 3rd Qu.:12.00
Max. :14.000 Max. :10.000 Max. :18.00 Max. :18.00
NA's :83 NA's :78 NA's :194
lwage
Min. :4.745
1st Qu.:6.506
Median :6.808
Mean :6.779
3rd Qu.:7.056
Max. :8.032
cleandata<-na.omit(mydata)summary(cleandata)
wage hours IQ KWW
Min. : 115.0 Min. :25.00 Min. : 54.0 Min. :13.00
1st Qu.: 699.0 1st Qu.:40.00 1st Qu.: 94.0 1st Qu.:32.00
Median : 937.0 Median :40.00 Median :104.0 Median :37.00
Mean : 988.5 Mean :44.06 Mean :102.5 Mean :36.19
3rd Qu.:1200.0 3rd Qu.:48.00 3rd Qu.:113.0 3rd Qu.:41.00
Max. :3078.0 Max. :80.00 Max. :145.0 Max. :56.00
educ exper tenure age
Min. : 9.00 Min. : 1.0 Min. : 0.000 Min. :28.00
1st Qu.:12.00 1st Qu.: 8.0 1st Qu.: 3.000 1st Qu.:30.00
Median :13.00 Median :11.0 Median : 7.000 Median :33.00
Mean :13.68 Mean :11.4 Mean : 7.217 Mean :32.98
3rd Qu.:16.00 3rd Qu.:15.0 3rd Qu.:11.000 3rd Qu.:36.00
Max. :18.00 Max. :22.0 Max. :22.000 Max. :38.00
married black south urban
Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
Median :1.0000 Median :0.00000 Median :0.0000 Median :1.0000
Mean :0.9005 Mean :0.08145 Mean :0.3228 Mean :0.7195
3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000
sibs brthord meduc feduc
Min. : 0.000 Min. : 1.000 Min. : 0.00 Min. : 0.00
1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 9.00 1st Qu.: 8.00
Median : 2.000 Median : 2.000 Median :12.00 Median :11.00
Mean : 2.846 Mean : 2.178 Mean :10.83 Mean :10.27
3rd Qu.: 4.000 3rd Qu.: 3.000 3rd Qu.:12.00 3rd Qu.:12.00
Max. :14.000 Max. :10.000 Max. :18.00 Max. :18.00
lwage
Min. :4.745
1st Qu.:6.550
Median :6.843
Mean :6.814
3rd Qu.:7.090
Max. :8.032
Wages vs Log Wages
Q: Surmise a reason why the authors who usually explore this type of dataset often use log wages rather than wages as their object of interest.
plot(cleandata$educ,cleandata$wage)
plot(cleandata$educ,cleandata$lwage)
A:
It appears like there might be a nonlinear relationship between education and wages. Taking the log of wages tends to smooth out the relationship.
Q: Run a regression of wages on educ. Next, run a regression of log wages on educ. Contrast the meanings of these coefficients on educ.
Level: A one-unit increase in years of education is associated with a $59.45 dollar increase in wages per week.
The second may seen as preferable since it does not assume constant marginal DOLLAR effects. A $59 increase on a small base is more powerful than the same change on a much larger base. Using log wages suggests that we have constant marginal PERCENT effects. i.e. education will impact percent of wages for those who make very little the same as those who make a lot.
Call:
lm(formula = lwage ~ educ, data = cleandata)
Residuals:
Min 1Q Median 3Q Max
-1.96929 -0.24539 0.03688 0.26627 1.25825
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.999465 0.094273 63.640 <2e-16 ***
educ 0.059563 0.006801 8.757 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3905 on 661 degrees of freedom
Multiple R-squared: 0.104, Adjusted R-squared: 0.1026
F-statistic: 76.69 on 1 and 661 DF, p-value: < 2.2e-16
A:
The ATE is the coefficient on educ
It is statistically significant as indicated by the small p-value
A one-unit increase in years of education is associated with a 5.95% increase in wages per week.
Q2
A colleague asks you to run an IV regression of log wages on educ using father’s education as an instrument.
iv <-ivreg(lwage ~ educ | feduc, data=cleandata)summary(iv)
Call:
ivreg(formula = lwage ~ educ | feduc, data = cleandata)
Residuals:
Min 1Q Median 3Q Max
-1.89557 -0.25696 0.04035 0.26725 1.28810
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.3993 0.2263 23.859 < 2e-16 ***
educ 0.1034 0.0165 6.268 6.62e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4026 on 661 degrees of freedom
Multiple R-Squared: 0.04756, Adjusted R-squared: 0.04612
Wald test: 39.28 on 1 and 661 DF, p-value: 6.617e-10
Q: What is the IV estimate of ATE?
A: .1034
Q: Interpret the IV estimate of ATE.
A:
A one-unit increase in years of education is associated with a 10.34% increase in wages per week.
For a policy maker, if they can increase education, they can expect individuals to be this much better off, on average, than they otherwise would have been.
Q: Notice there is a difference between the OLS and IV estimates of ATE. What does the fact that there is a difference suggest?
A:
That there is likely an OVB.
There are likely factors other than education that influence earnings (e.g. ability, determination, family connections, etc..).
Importantly, whatever that omitted variable might be, it must be related to education, otherwise the marginal effect on earnings would not be attenuated.
Q: What are the conditions for justifying a father’s education as a valid instrument?
A:
Relevance: It is believed that Father’s Education might influence the child’s education (e.g. a highly educated father might instill the value of education in a child)
Exclusion: Father’s education cannot influence earnings directly (father’s education is not related to the child’s earnings after controlling for the child’s education)
Independence: The Father’s education cannot be related to whatever is omitted (e.g. natural ability)
Q: Do you think a father’s education is a valid instrument?
A:
Relevance: The relevance condition seems logical.
Exclusion: However, the father’s education might be a proxy for success, and thereby offer opportunities to the child that could influence earnings directly (e.g. a family friend could offer the child a job), so the exogeneity/exclusion assumption might be in question.
Independence: Moreover, I wonder if natural abilities are hereditary.
Q3
Recreate the IV estimator via the first stage and reduced form.
Q: Compute the correlation between feduc and educ. What does that imply about the validity of our IV estimates?
cor(cleandata$educ, cleandata$feduc)
[1] 0.424906
A:
Correlation is .42
Seems large enough to suggest that the instrument feduc is relevant.
Q: Interpret the Weak Instruments test for the IV estimator
summary(iv, diagnostics =TRUE)
Call:
ivreg(formula = lwage ~ educ | feduc, data = cleandata)
Residuals:
Min 1Q Median 3Q Max
-1.89557 -0.25696 0.04035 0.26725 1.28810
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.3993 0.2263 23.859 < 2e-16 ***
educ 0.1034 0.0165 6.268 6.62e-10 ***
Diagnostic tests:
df1 df2 statistic p-value
Weak instruments 1 661 145.634 < 2e-16 ***
Wu-Hausman 1 660 9.281 0.00241 **
Sargan 0 NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4026 on 661 degrees of freedom
Multiple R-Squared: 0.04756, Adjusted R-squared: 0.04612
Wald test: 39.28 on 1 and 661 DF, p-value: 6.617e-10
A:
The Weak instruments test: H0: Weak vs Ha: Strong
A small p-value suggests we reject the null
We have confidence that the instrument is valid, in that is it sufficiently related to educ
Q: Interpret the W-Hausman test
A:
Wu Haussman for Endogeneity of educ: H0: Education is Exogeneous vs Ha: Educ is Endogenous
We reject this, suggesting that we do indeed have an endogeneity problem caused by OVB.
Supports our use of IV but does not explicitly test any of the criteria for the validity of our instrument.
Q: Run an OLS of lwages on feduc. Why doesn’t the significant coefficient on feduc invalidate its use as an instrument?
A:
summary(iv.ReducedForm)
Call:
lm(formula = lwage ~ feduc, data = cleandata)
Residuals:
Min 1Q Median 3Q Max
-2.09105 -0.23850 0.02758 0.27645 1.16701
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.507909 0.051080 127.405 < 2e-16 ***
feduc 0.029825 0.004736 6.297 5.52e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4007 on 661 degrees of freedom
Multiple R-squared: 0.0566, Adjusted R-squared: 0.05517
F-statistic: 39.66 on 1 and 661 DF, p-value: 5.517e-10
This is the reduced form we saw earlier
The key to the exclusion restriction is that the instrument is not related to the outcomes AFTER controlling for the treatment variable.
In our case, feduc and educ are related. The coefficient from this reduced form could be detecting the direct effects of feduc on wages as well as the indirect effects via educ.
If feduc was not related at all to wages, then the instrument likley wouldn’t be relevant.
Q5
Expand the IV by controlling for hours, IQ, and exper. What is the new ATE? What does the change in the ATE suggest about the new covariates?
iv2<-ivreg(lwage ~ educ + hours + IQ + exper | feduc + hours + IQ + exper, data = cleandata)summary(iv2, diagnostics =TRUE)
Call:
ivreg(formula = lwage ~ educ + hours + IQ + exper | feduc + hours +
IQ + exper, data = cleandata)
Residuals:
Min 1Q Median 3Q Max
-1.66289 -0.25809 0.01406 0.27684 1.35564
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.632473 0.341076 13.582 < 2e-16 ***
educ 0.159562 0.036883 4.326 1.75e-05 ***
hours -0.006016 0.002277 -2.642 0.00845 **
IQ -0.001741 0.002828 -0.615 0.53846
exper 0.038821 0.007639 5.082 4.87e-07 ***
Diagnostic tests:
df1 df2 statistic p-value
Weak instruments 1 658 44.96 4.34e-11 ***
Wu-Hausman 1 657 9.46 0.00219 **
Sargan 0 NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4121 on 658 degrees of freedom
Multiple R-Squared: 0.006749, Adjusted R-squared: 0.000711
Wald test: 23.63 on 4 and 658 DF, p-value: < 2.2e-16
A:
The new ATE is .1595
This is quite a bit larger than the previous estimate of .1034
The change implies that the new covariates not only were relevant to wages but also related to education.
In the previous implementation, these confounders were omitted variables.
By explicitly controlling for these directions of confounding we are reducing the OVB, thereby sharpening the ATE estimate.