This document was created with R Markdown, and then printed as pdf for peer-graded evaluation purposes.
Code chunks will not be echoed in the paper.
A challenging and very relevant economic problem is the measurement of the returns to schooling. In this question we will use the following variables on 3010 US men:
- logw: log wage
- educ: number of years of schooling
- age: age of the individual in years
- exper: working experience in years
- smsa: dummy indicating whether the individual lived in a metropolitan area • south: dummy indicating whether the individual lived in the south
- nearc: dummy indicating whether the individual lived near a 4-year college • dadeduc: education of the individual’s father (in years)
- momeduc: education of the individual’s mother (in years)
This data is a selection of the data used by D. Card (1995).
\[logw = β_1 + β_2educ + β_3exper + β_4exper^2 + β_5smsa + β_6south + ε\]
The linear model comes as follows:
##
## Call:
## lm(formula = logw ~ educ + exper + exper2 + smsa + south, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.71487 -0.22987 0.02268 0.24898 1.38552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6110144 0.0678950 67.914 < 2e-16 ***
## educ 0.0815797 0.0034990 23.315 < 2e-16 ***
## exper 0.0838357 0.0067735 12.377 < 2e-16 ***
## exper2 -0.0022021 0.0003238 -6.800 1.26e-11 ***
## smsa 0.1508006 0.0158360 9.523 < 2e-16 ***
## south -0.1751761 0.0146486 -11.959 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3813 on 3004 degrees of freedom
## Multiple R-squared: 0.2632, Adjusted R-squared: 0.2619
## F-statistic: 214.6 on 5 and 3004 DF, p-value: < 2.2e-16
All variables are significant (p-value is zero) but overall the explanatory power of the model is low (\(R^2\) is only 26%).
\(β_2\) is 8%. It means that every extra year of education let us predict an increase of 0.082 of the \(logwage\), that translates itself in a 8.5% increment in wage (exp(0.082)=1.085).
Experience and education as variables don’t show the whole picture. They need to be integrated by data on family background, wealth and social class.
The OLS model is still useful, as indicated by p-values.
Age can explain how many years of working experience a man has. On the other hand it should not influence wage, so it qualifies as a potential instrument of experience.
Our first step model for \(educ\) runs like this:
##
## Call:
## lm(formula = educ ~ age + age2 + smsa + south + nearc + daded +
## momed, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.2777 -1.5450 -0.2224 1.6957 7.2250
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.652354 3.976343 -1.421 0.155277
## age 0.989610 0.278714 3.551 0.000390 ***
## age2 -0.017019 0.004838 -3.518 0.000441 ***
## smsa 0.529566 0.101504 5.217 1.94e-07 ***
## south -0.424851 0.091037 -4.667 3.19e-06 ***
## nearc 0.264554 0.099085 2.670 0.007626 **
## daded 0.190443 0.015611 12.199 < 2e-16 ***
## momed 0.234515 0.017028 13.773 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.326 on 3002 degrees of freedom
## Multiple R-squared: 0.2466, Adjusted R-squared: 0.2448
## F-statistic: 140.4 on 7 and 3002 DF, p-value: < 2.2e-16
A shown by the model summary, the additional instruments are significant regressors of education. This is especially true about the later two (\(daded\) and \(momed\), education of father and mother) due to their high t-statistics, which makes perfect sense as highly educated parents are more likely to support and promote their children education as well.
So, the instrument variables and the endogenous variable \(educ\) are significantly related.
We proceed with regressing a new model with 2SLS:
##
## Call:
## ivreg(formula = logw ~ educ + exper + exper2 + smsa + south |
## age + age2 + smsa + south + nearc + daded + momed, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7494 -0.2360 0.0266 0.2498 1.3468
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4169039 0.1154208 38.268 < 2e-16 ***
## educ 0.0998429 0.0065738 15.188 < 2e-16 ***
## exper 0.0728669 0.0167134 4.360 1.35e-05 ***
## exper2 -0.0016393 0.0008381 -1.956 0.0506 .
## smsa 0.1349370 0.0167695 8.047 1.21e-15 ***
## south -0.1589869 0.0156854 -10.136 < 2e-16 ***
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments (educ) 5 3002 145.511 < 2e-16 ***
## Weak instruments (exper) 5 3002 1257.258 < 2e-16 ***
## Weak instruments (exper2) 5 3002 1098.430 < 2e-16 ***
## Wu-Hausman 2 3002 5.709 0.00335 **
## Sargan 2 NA 3.702 0.15705
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3844 on 3004 degrees of freedom
## Multiple R-Squared: 0.2512, Adjusted R-squared: 0.2499
## Wald test: 175.9 on 5 and 3004 DF, p-value: < 2.2e-16
As shown by the summary above, both education and experience still have a positive effect while the squared experience still has a negative effect to \(logw\). The slope of \(educ\) is steeper, moving from 0.08 to 0.10, while the 2SLS experience estimated effect size of about 0.073 is a bit smaller than the OLS estimation of about 0.084. And both 2SLS and OLS estimated a (small) negative 0.002 effect size for the squared experience variable on the \(logwage\).
At the end of the summary above we added some useful diagnostics tools.
\(R^2\): still very low, with only 25% of \(logwage\) variations explained.
Hausman Test: with p-value < 0.01 rejects the null hypothesis, so \(educ\), \(exper\) and \(exper2\) are endogenous, as expected, that is they are related to \(ε\), the model’s errors.
Sargan Test: with p-value > 0.15 does not reject the null hypothesis, so the instruments are not related with errors of the linear model called on \(logwage\) and are not omitted variables in the model, so they qualify correctly as instruments.