Econometrics: Methods and Applications by Erasmus University Rotterdam

Week 4 Assignment: Endogeneity

This document was created with R Markdown, and then printed as pdf for peer-graded evaluation purposes.

Code chunks will not be echoed in the paper.


Data set

A challenging and very relevant economic problem is the measurement of the returns to schooling. In this question we will use the following variables on 3010 US men:
- logw: log wage
- educ: number of years of schooling
- age: age of the individual in years
- exper: working experience in years
- smsa: dummy indicating whether the individual lived in a metropolitan area • south: dummy indicating whether the individual lived in the south
- nearc: dummy indicating whether the individual lived near a 4-year college • dadeduc: education of the individual’s father (in years)
- momeduc: education of the individual’s mother (in years)
This data is a selection of the data used by D. Card (1995).

Questions

(a) Use OLS to estimate the parameters of the model

\[logw = β_1 + β_2educ + β_3exper + β_4exper^2 + β_5smsa + β_6south + ε\]

Give an interpretation to the estimated \(β_2\) coefficient.

The linear model comes as follows:

## 
## Call:
## lm(formula = logw ~ educ + exper + exper2 + smsa + south, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.71487 -0.22987  0.02268  0.24898  1.38552 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.6110144  0.0678950  67.914  < 2e-16 ***
## educ         0.0815797  0.0034990  23.315  < 2e-16 ***
## exper        0.0838357  0.0067735  12.377  < 2e-16 ***
## exper2      -0.0022021  0.0003238  -6.800 1.26e-11 ***
## smsa         0.1508006  0.0158360   9.523  < 2e-16 ***
## south       -0.1751761  0.0146486 -11.959  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3813 on 3004 degrees of freedom
## Multiple R-squared:  0.2632, Adjusted R-squared:  0.2619 
## F-statistic: 214.6 on 5 and 3004 DF,  p-value: < 2.2e-16

All variables are significant (p-value is zero) but overall the explanatory power of the model is low (\(R^2\) is only 26%).

\(β_2\) is 8%. It means that every extra year of education let us predict an increase of 0.082 of the \(logwage\), that translates itself in a 8.5% increment in wage (exp(0.082)=1.085).


(b) OLS may be inconsistent in this case as \(educ\) and \(exper\) may be endogenous. Give a reason why this may be the case. Also indicate whether the estimate in part (a) is still useful.

Experience and education as variables don’t show the whole picture. They need to be integrated by data on family background, wealth and social class.

The OLS model is still useful, as indicated by p-values.


(c) Give a motivation why \(age\) and \(age^2\) can be used as instruments for \(exper\) and \(exper^2\).

Age can explain how many years of working experience a man has. On the other hand it should not influence wage, so it qualifies as a potential instrument of experience.


(d) Run the first-stage regression for \(educ\) for the two-stage least squares estimation of the parameters in the model above when \(age\), \(age^2\), \(nearc\), \(dadeduc\), and \(momeduc\) are used as additional instruments. What do you conclude about the suitability of these instruments for schooling?

Our first step model for \(educ\) runs like this:

## 
## Call:
## lm(formula = educ ~ age + age2 + smsa + south + nearc + daded + 
##     momed, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2777  -1.5450  -0.2224   1.6957   7.2250 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.652354   3.976343  -1.421 0.155277    
## age          0.989610   0.278714   3.551 0.000390 ***
## age2        -0.017019   0.004838  -3.518 0.000441 ***
## smsa         0.529566   0.101504   5.217 1.94e-07 ***
## south       -0.424851   0.091037  -4.667 3.19e-06 ***
## nearc        0.264554   0.099085   2.670 0.007626 ** 
## daded        0.190443   0.015611  12.199  < 2e-16 ***
## momed        0.234515   0.017028  13.773  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.326 on 3002 degrees of freedom
## Multiple R-squared:  0.2466, Adjusted R-squared:  0.2448 
## F-statistic: 140.4 on 7 and 3002 DF,  p-value: < 2.2e-16

A shown by the model summary, the additional instruments are significant regressors of education. This is especially true about the later two (\(daded\) and \(momed\), education of father and mother) due to their high t-statistics, which makes perfect sense as highly educated parents are more likely to support and promote their children education as well.

So, the instrument variables and the endogenous variable \(educ\) are significantly related.


(e) Estimate the parameters of the model for \(log wage\) using two-stage least squares where you correct for the endogeneity of education and experience. Compare your result to the estimate in part (a).

We proceed with regressing a new model with 2SLS:

## 
## Call:
## ivreg(formula = logw ~ educ + exper + exper2 + smsa + south | 
##     age + age2 + smsa + south + nearc + daded + momed, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7494 -0.2360  0.0266  0.2498  1.3468 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.4169039  0.1154208  38.268  < 2e-16 ***
## educ         0.0998429  0.0065738  15.188  < 2e-16 ***
## exper        0.0728669  0.0167134   4.360 1.35e-05 ***
## exper2      -0.0016393  0.0008381  -1.956   0.0506 .  
## smsa         0.1349370  0.0167695   8.047 1.21e-15 ***
## south       -0.1589869  0.0156854 -10.136  < 2e-16 ***
## 
## Diagnostic tests:
##                            df1  df2 statistic p-value    
## Weak instruments (educ)      5 3002   145.511 < 2e-16 ***
## Weak instruments (exper)     5 3002  1257.258 < 2e-16 ***
## Weak instruments (exper2)    5 3002  1098.430 < 2e-16 ***
## Wu-Hausman                   2 3002     5.709 0.00335 ** 
## Sargan                       2   NA     3.702 0.15705    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3844 on 3004 degrees of freedom
## Multiple R-Squared: 0.2512,  Adjusted R-squared: 0.2499 
## Wald test: 175.9 on 5 and 3004 DF,  p-value: < 2.2e-16

As shown by the summary above, both education and experience still have a positive effect while the squared experience still has a negative effect to \(logw\). The slope of \(educ\) is steeper, moving from 0.08 to 0.10, while the 2SLS experience estimated effect size of about 0.073 is a bit smaller than the OLS estimation of about 0.084. And both 2SLS and OLS estimated a (small) negative 0.002 effect size for the squared experience variable on the \(logwage\).


(f) Perform the Sargan test for validity of the instruments. What is your conclusion?

At the end of the summary above we added some useful diagnostics tools.
\(R^2\): still very low, with only 25% of \(logwage\) variations explained.
Hausman Test: with p-value < 0.01 rejects the null hypothesis, so \(educ\), \(exper\) and \(exper2\) are endogenous, as expected, that is they are related to \(ε\), the model’s errors.
Sargan Test: with p-value > 0.15 does not reject the null hypothesis, so the instruments are not related with errors of the linear model called on \(logwage\) and are not omitted variables in the model, so they qualify correctly as instruments.