• See website for how to submit your answers and how feedback is organized • This exercise uses the datafile TestExer4 Wage and requires a computer. • The dataset TestExer4 Wage is available on the website.
• Obtain insight in consequences of endogeneity • Practice with identifying causes of endogeneity • Practice with identifying valid instruments • Obtain hands-on experience with applying 2SLS and the Sargan test
A challenging and very relevant economic problem is the measurement of the returns to schooling. In this question we will use the following variables on 3010 US men:
• logw: log wage • educ: number of years of schooling • age: age of the individual in years • exper: working experience in years • smsa: dummy indicating whether the individual lived in a metropolitan area • south: dummy indicating whether the individual lived in the south • nearc: dummy indicating whether the individual lived near a 4-year college • dadeduc: education of the individual’s father (in years) • momeduc: education of the individual’s mother (in years)
This data is a selection of the data used by D. Card (1995)1
First we need to load the data and then fit the model
library(ggplot2)
library(dplyr)
library(lmtest)
library(readxl)
df <- read_excel("FINALDATA4.xls", col_names = TRUE)
head(df)
## # A tibble: 6 x 9
## logw educ age exper smsa south nearc daded momed
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6.31 7 29 16 1 0 0 9.94 10.2
## 2 6.18 12 27 9 1 0 0 8 8
## 3 6.58 12 34 16 1 0 0 14 12
## 4 5.52 11 27 10 1 0 1 11 12
## 5 6.59 12 34 16 1 0 1 8 7
## 6 6.21 12 26 8 1 0 1 9 12
We need to add the squared variable:
df$nbr <- 1:nrow(df)
df$exper2 <- df$exper^2
df$age2 <- df$age^2
str(df)
## tibble [3,010 x 12] (S3: tbl_df/tbl/data.frame)
## $ logw : num [1:3010] 6.31 6.18 6.58 5.52 6.59 ...
## $ educ : num [1:3010] 7 12 12 11 12 12 18 14 12 12 ...
## $ age : num [1:3010] 29 27 34 27 34 26 33 29 28 29 ...
## $ exper : num [1:3010] 16 9 16 10 16 8 9 9 10 11 ...
## $ smsa : num [1:3010] 1 1 1 1 1 1 1 1 1 1 ...
## $ south : num [1:3010] 0 0 0 0 0 0 0 0 0 0 ...
## $ nearc : num [1:3010] 0 0 0 1 1 1 1 1 1 1 ...
## $ daded : num [1:3010] 9.94 8 14 11 8 9 14 14 12 12 ...
## $ momed : num [1:3010] 10.2 8 12 12 7 ...
## $ nbr : int [1:3010] 1 2 3 4 5 6 7 8 9 10 ...
## $ exper2: num [1:3010] 256 81 256 100 256 64 81 81 100 121 ...
## $ age2 : num [1:3010] 841 729 1156 729 1156 ...
df %>% ggplot(aes(nbr, logw)) + geom_point()
So after that, we already continue with fitting the model:
fit <- lm(logw~ educ + exper + exper2+smsa+south, data= df)
summary(fit)
Call:
lm(formula = logw ~ educ + exper + exper2 + smsa + south, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.71487 -0.22987 0.02268 0.24898 1.38552
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.6110144 0.0678950 67.914 < 2e-16 ***
educ 0.0815797 0.0034990 23.315 < 2e-16 ***
exper 0.0838357 0.0067735 12.377 < 2e-16 ***
exper2 -0.0022021 0.0003238 -6.800 1.26e-11 ***
smsa 0.1508006 0.0158360 9.523 < 2e-16 ***
south -0.1751761 0.0146486 -11.959 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3813 on 3004 degrees of freedom
Multiple R-squared: 0.2632, Adjusted R-squared: 0.2619
F-statistic: 214.6 on 5 and 3004 DF, p-value: < 2.2e-16
So the interpretation for B2 :
#Rpta.
# Experience and education are variables that are not enough to fit the model. They need to be integrated by data on family background, wealth and social class.
#The OLS model is still useful, as indicated by p-values.
#As we say It is possible the wage, experience and education variables to be affected by some other variable (i.e. ability, social class, family support, etc.) in a way, such as, a higher ability to lead to a higher wage, longer education and less experience (due to long education as mixed effect) and vice versa.
#In this case, these variables would be endogenous and the OLS estimates would be biased and inconsistent, therefore not useful anymore, because they are correlated with the errors.
resid_fitted <- (fit$residuals)
df$Resid_fit <- resid_fitted
head(df)
## # A tibble: 6 x 13
## logw educ age exper smsa south nearc daded momed nbr exper2 age2
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 6.31 7 29 16 1 0 0 9.94 10.2 1 256 841
## 2 6.18 12 27 9 1 0 0 8 8 2 81 729
## 3 6.58 12 34 16 1 0 0 14 12 3 256 1156
## 4 5.52 11 27 10 1 0 1 11 12 4 100 729
## 5 6.59 12 34 16 1 0 1 8 7 5 256 1156
## 6 6.21 12 26 8 1 0 1 9 12 6 64 676
## # ... with 1 more variable: Resid_fit <dbl>
df %>% ggplot(aes(nbr, Resid_fit)) + geom_point()
# Rpta.
# Age is exogenous as it cannot be influenced by the people, and it is also obviously related to experience as younger people cannot have a very long experience. Age can explain how many years of working experience a man has. On the other hand it should not influence wage, so it qualifies as a potential instrument of experience.
# So it’s a good instrument for the experience variable. And the same applies for their squared values.
df %>% ggplot(aes(exper2, logw)) + geom_point()
df %>% ggplot(aes(age2,logw)) + geom_point()
# Rpta.
fit2 <-lm(formula = educ ~ age + age2 + smsa + south + nearc + daded + momed, data = df)
summary(fit2)
Call:
lm(formula = educ ~ age + age2 + smsa + south + nearc + daded +
momed, data = df)
Residuals:
Min 1Q Median 3Q Max
-11.2777 -1.5450 -0.2224 1.6957 7.2250
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.652354 3.976343 -1.421 0.155277
age 0.989610 0.278714 3.551 0.000390 ***
age2 -0.017019 0.004838 -3.518 0.000441 ***
smsa 0.529566 0.101504 5.217 1.94e-07 ***
south -0.424851 0.091037 -4.667 3.19e-06 ***
nearc 0.264554 0.099085 2.670 0.007626 **
daded 0.190443 0.015611 12.199 < 2e-16 ***
momed 0.234515 0.017028 13.773 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.326 on 3002 degrees of freedom
Multiple R-squared: 0.2466, Adjusted R-squared: 0.2448
F-statistic: 140.4 on 7 and 3002 DF, p-value: < 2.2e-16
A shown by the model summary, the additional instruments are significant regressors of education. The additional instruments (age, age2, nearc, daded, and momed) are significantly correlated with the education. This is especially true about the later two (daded and momed, education of father and mother) due to their high t-statistics, which makes perfect sense as highly educated parents are more likely to support and promote their children education as well.
Model characteristics: R2 = 0.2466
So, the instrument variables and the endogenous variable educ are significantly related.
Now fitting values for education:
educ_f<- fitted.values(fit2)
df$educ_fitted <- educ_f
summary(df$educ)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 12.00 13.00 13.26 16.00 18.00
summary(df$educ_fitted)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.87 12.53 13.39 13.26 14.14 17.10
with(df, plot(nbr, educ, type='l', col='blue', main="Education", xlab='# number'))
with(df, plot(nbr, educ_fitted, type='l', col='red'))
legend("bottomleft", c("Actual","Fitted"), horiz=TRUE,
lty=c(1,1), lwd=c(2,2), col=c("blue","red"), bg="grey")
library(AER)
fit3 <- ivreg(formula = logw ~ educ + exper + exper2 + smsa + south | age + age2 + smsa + south + nearc + daded + momed, data = df)
summary(fit3)
Call:
ivreg(formula = logw ~ educ + exper + exper2 + smsa + south |
age + age2 + smsa + south + nearc + daded + momed, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.7494 -0.2360 0.0266 0.2498 1.3468
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.4169039 0.1154208 38.268 < 2e-16 ***
educ 0.0998429 0.0065738 15.188 < 2e-16 ***
exper 0.0728669 0.0167134 4.360 1.35e-05 ***
exper2 -0.0016393 0.0008381 -1.956 0.0506 .
smsa 0.1349370 0.0167695 8.047 1.21e-15 ***
south -0.1589869 0.0156854 -10.136 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3844 on 3004 degrees of freedom
Multiple R-Squared: 0.2512, Adjusted R-squared: 0.2499
Wald test: 175.9 on 5 and 3004 DF, p-value: < 2.2e-16
logw_f<- fitted.values(fit3)
df$logw_fitted <- educ_f
summary(df$logw)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.605 5.977 6.287 6.262 6.564 7.785
summary(df$logw_fitted)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.87 12.53 13.39 13.26 14.14 17.10
with(df, plot(nbr, logw, type='l', col='gray', main="Log Wage", xlab='# number'))
with(df, plot(nbr, logw_fitted, type='l', col='green'))
legend("bottomleft", c("Actual","Fitted"), horiz=TRUE,
lty=c(1,1), lwd=c(2,2), col=c("gray","green"), bg="white")
final_r <- residuals.lm(fit3)
df$final_residual <- final_r
fres <-lm(formula = final_residual ~ smsa + south + age + age2 + nearc + daded + momed, data = df)
sargan.tstat = nrow(df) * summary(fres)$r.squared
sargan.tstat
## [1] 3.702389
residualPlot(fres)