1. explain why you might worry that education is endogenous and why quarter of birth could be a valid instrument for education in this model.
Why education is engodenous: It can be argued that one’s ability can affect one’s level of education attained as well as one’s wage level. However, since ability is something that cannot be observed or measured, we cannot include it explicitly in our model but lump it into the error term. Thus, the covariate education is correlated with the error term in the model and considered endogenous.
2. Plot the average years of completed education over years and quarters of birth using the observed data (like Figures I-III of the paper, on pages 983-984). Then construct an analogous plot with log weekly wages instead of years of education on the vertical axis. What do these plots tell you about the validity of this instrumental variables strategy?
# load data
dataset <- read.csv("/Users/ty/Desktop/R_422/ak95.csv")
dataset$lwage<-as.numeric(dataset$lwage)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(pander)
## Warning: package 'pander' was built under R version 3.6.2
library(lmtest)
## Warning: package 'lmtest' was built under R version 3.6.2
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(AER)
## Loading required package: car
## Warning: package 'car' was built under R version 3.6.2
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## Loading required package: sandwich
## Loading required package: survival
yandq<- dataset %>% group_by(yob,qob)
a<-yandq %>% summarise(edu=mean(educ),lw=mean(lwage))
###plot for year(up to quarter) vs education level
plot(a$yob+0.25*a$qob,a$edu,pch=15,type="b",
xlim = c(30,40),ylim=c(12.2,13.2),
ylab = 'Years of Completed Education',
xlab = 'Year of Birth',
sub = 'Years of Education and Season of Birth')
text(a$yob+0.25*a$qob,a$edu,a$qob,pos = 1)
###plot for year(up to quarter) vs log wage
plot(a$yob+0.25*a$qob,a$lw,pch=15,type="b",
xlim = c(30,40),
ylab = 'Log Weekly Wages',
xlab = 'Year of Birth',
sub = 'Years of Education and Season of Birth')
text(a$yob+0.25*a$qob,a$lw,a$qob,pos = 1)
It can be seen from the first lplot that years of educaton completed do vary over quarters of birth following a certain pattern within each year, with people born at later times of a year attaining more education than those born in early times within that year. Years of education is correlated with quarters of birth according to the plots, favouring the validity of the relevancy condition.
However, whether quarters of birth and log weekly wages are uncorrelated is not quite certain. We can see from the plot that the expected value of log weekly wages are stable across years, but the wages still vary over quarters following a certain pattern within each year. The pattern highly resembles the one that education varies over quaters. So, we might deduct that wages are not affected by quarters of birth except through years of completed eduction which is affected by quarters of birth. Nonetheless the exogeneity does not have a solid ground and remains to be tested.
3. Splitting the data into groups depending on whether the worker was born in the first quarter or not, reproduce the top of Panel B of Table III (on page 996). Calculate the means, differences and standard errors.
#split the dataset
dataset$first_q<-ifelse(dataset$qob==1,0,1)
q1<- subset(dataset,qob==1)
q234<-subset(dataset,qob!=1)
#summarize the information we have
d <- data.frame('Born in 1st quarter of the year' = c(mean(q1$lwage),mean(q1$educ)),
'Born in 2nd, 3rd, or 4th quarter of the year' = c(mean(q234$lwage), mean(q234$educ)),
'Difference' = c(mean(q1$lwage)-mean(q234$lwage),mean(q1$educ)-mean(q234$educ)),
std.error = c(sqrt(var(q1$lwage)/81671+var(q234$lwage)/247838),
sqrt(var(q1$educ)/81671+var(q234$educ)/247838)))
pandoc.table(d)
##
## --------------------------------------------------------------------------------
## Born.in.1st.quarter.of.the.year Born.in.2nd..3rd..or.4th.quarter.of.the.year
## --------------------------------- ----------------------------------------------
## 5.892 5.903
##
## 12.69 12.8
## --------------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## ------------------------
## Difference std.error
## ------------ -----------
## -0.0111 0.002745
##
## -0.1088 0.01332
## ------------------------
All results match with Panel B of Table III of the paper.
4.Now compute the Wald estimate using the averages computed in question 3 (like in the lower half of Table III, Panel B). Next, compute the Wald estimate using regressions of lwage and educ on your first-quarter indicator and the corresponding OLS estimate of the return to education and discuss what you find.
#Wald estimate in two approach
beta_wald<-(mean(q1$lwage)-mean(q234$lwage))/(mean(q1$educ)-mean(q234$educ))
beta_wald_reg<-lm(lwage~first_q,data=dataset)$coef[2]/lm(educ~first_q,data=dataset)$coef[2]
#simple OLS
ols<-lm(lwage~educ,data=dataset)
beta_ols <- ols$coef[2]
print(c(beta_wald, beta_wald_reg, beta_ols))
## first_q educ
## 0.10199499 0.10199499 0.07085104
Using the formula \[\hat\beta_{Wald} = \frac{\bar y_1 - \bar y_0}{\bar d_1 - \bar d_0} \] ,we have Wald estimate 0.1020. The ratio of two regression coefficients also shows the same result: slope from a regression model of lwage on instrument and slope from a regression model of educ on instruments.
However, the OLS return to education has a value of 0.0709 which is smaller than the Wald estimate. Since we assume endogeneity of variable ‘educ’, the OLS estimate becomes biased, and we should use Wald estimate (even if the OLS estimate shows strong significance) as a better result on the return to education.
Next, you will reproduce the first four columns of Table V in the paper (on p. 1000).
5. Explain how the addition of the year of birth variables may improve your estimate of the education effect in this model. Estimate the basic model plus year of birth variables using least squares as in the first column of Table V.
An individual’s wage will be influenced by the economic environment. Adding the year of birth could explain the effect from different economic environments. So, we add the year of birth into the OLS model, and summary is shown below. The coefficient of ‘educ’ (i.e. education effect) is 0.0711 with standard error 0.0003, match with the result in Table V.
summary(update(ols,.~.+dataset$yob))
##
## Call:
## lm(formula = lwage ~ educ + dataset$yob, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7537 -0.2339 0.0727 0.3372 4.6390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.1620767 0.0137642 375.04 <2e-16 ***
## educ 0.0710812 0.0003390 209.68 <2e-16 ***
## dataset$yob -0.0049081 0.0003829 -12.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6376 on 329506 degrees of freedom
## Multiple R-squared: 0.1177, Adjusted R-squared: 0.1177
## F-statistic: 2.198e+04 on 2 and 329506 DF, p-value: < 2.2e-16
6. Estimate the returns to schooling using two-stage least squares, using all four quarters of birth as instruments, as in column 2 of Table V (except for the last entry). Is it possible to make a Wald estimate of β1 in this model? If so, calculate the Wald estimate, and if not, explain why you can’t make that estimate.
Fit the two-stage least squares by ivreg() from AER package. The model summary of the second stage is listed below:
library(AER)
tsls <- ivreg(lwage~educ+as.factor(yob)|
as.factor(qob):as.factor(yob)+as.factor(yob),data=dataset)
summary(tsls)
##
## Call:
## ivreg(formula = lwage ~ educ + as.factor(yob) | as.factor(qob):as.factor(yob) +
## as.factor(yob), data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.87808 -0.24040 0.07032 0.34172 4.74912
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.792727 0.200684 23.882 < 2e-16 ***
## educ 0.089115 0.016110 5.532 3.17e-08 ***
## as.factor(yob)31 -0.008881 0.005529 -1.606 0.108228
## as.factor(yob)32 -0.018030 0.005750 -3.136 0.001715 **
## as.factor(yob)33 -0.021796 0.006301 -3.459 0.000542 ***
## as.factor(yob)34 -0.025788 0.006584 -3.917 8.98e-05 ***
## as.factor(yob)35 -0.038827 0.007267 -5.343 9.13e-08 ***
## as.factor(yob)36 -0.038762 0.007977 -4.859 1.18e-06 ***
## as.factor(yob)37 -0.044817 0.008758 -5.117 3.10e-07 ***
## as.factor(yob)38 -0.046545 0.009911 -4.696 2.65e-06 ***
## as.factor(yob)39 -0.058527 0.010458 -5.597 2.19e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6404 on 329498 degrees of freedom
## Multiple R-Squared: 0.1102, Adjusted R-squared: 0.1101
## Wald test: 4.167 on 10 and 329498 DF, p-value: 8.581e-06
The quantity of interest is coefficient ‘educ’ and out estimate is 0.0891 with a standard error of 0.0161. We use the four quarters of birth as our instruments. Becuase there are four instruments used instead of two, we are not able to make Wald estimate for the education effect.
7. Show the result of the first-stage regression and a summary of the values you will use for the second stage. Check the relevance of the instruments using the “rule of thumb” given in the slides — do you think the first stage looks valid?
Firstly, we regress the assumed endogenous variable on the year of birth, and its interaction with the quarters of birth in the first stage. The summary of fitted values of the first-stage regression is shown below.
stage1 <- lm(educ~as.factor(qob):as.factor(yob)+as.factor(yob),data=dataset)
summary(stage1$fitted)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.28 12.65 12.75 12.77 12.94 13.12
summary of stage1 regression
summary(stage1)
##
## Call:
## lm(formula = educ ~ as.factor(qob):as.factor(yob) + as.factor(yob),
## data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.1165 -1.0134 -0.6509 2.2031 7.7196
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.280405 0.035755 343.459 < 2e-16 ***
## as.factor(yob)31 0.260029 0.051796 5.020 5.16e-07 ***
## as.factor(yob)32 0.253526 0.050784 4.992 5.97e-07 ***
## as.factor(yob)33 0.392785 0.051490 7.628 2.38e-14 ***
## as.factor(yob)34 0.366858 0.051551 7.116 1.11e-12 ***
## as.factor(yob)35 0.370502 0.051194 7.237 4.59e-13 ***
## as.factor(yob)36 0.462637 0.050878 9.093 < 2e-16 ***
## as.factor(yob)37 0.551890 0.050885 10.846 < 2e-16 ***
## as.factor(yob)38 0.658272 0.050109 13.137 < 2e-16 ***
## as.factor(yob)39 0.722584 0.050120 14.417 < 2e-16 ***
## as.factor(qob)2:as.factor(yob)30 0.148013 0.050564 2.927 0.003420 **
## as.factor(qob)3:as.factor(yob)30 0.211455 0.050089 4.222 2.43e-05 ***
## as.factor(qob)4:as.factor(yob)30 0.344270 0.051041 6.745 1.53e-11 ***
## as.factor(qob)2:as.factor(yob)31 -0.009389 0.053012 -0.177 0.859419
## as.factor(qob)3:as.factor(yob)31 0.146282 0.052434 2.790 0.005274 **
## as.factor(qob)4:as.factor(yob)31 0.071685 0.053565 1.338 0.180802
## as.factor(qob)2:as.factor(yob)32 0.075668 0.051819 1.460 0.144229
## as.factor(qob)3:as.factor(yob)32 0.126519 0.050911 2.485 0.012952 *
## as.factor(qob)4:as.factor(yob)32 0.193180 0.051572 3.746 0.000180 ***
## as.factor(qob)2:as.factor(yob)33 -0.038483 0.052822 -0.729 0.466279
## as.factor(qob)3:as.factor(yob)33 0.080762 0.052568 1.536 0.124455
## as.factor(qob)4:as.factor(yob)33 0.019077 0.052697 0.362 0.717338
## as.factor(qob)2:as.factor(yob)34 0.080704 0.052848 1.527 0.126739
## as.factor(qob)3:as.factor(yob)34 0.060785 0.051436 1.182 0.237298
## as.factor(qob)4:as.factor(yob)34 0.150605 0.052050 2.893 0.003810 **
## as.factor(qob)2:as.factor(yob)35 0.146019 0.051803 2.819 0.004822 **
## as.factor(qob)3:as.factor(yob)35 0.210444 0.050627 4.157 3.23e-05 ***
## as.factor(qob)4:as.factor(yob)35 0.168732 0.051833 3.255 0.001133 **
## as.factor(qob)2:as.factor(yob)36 0.068034 0.051584 1.319 0.187206
## as.factor(qob)3:as.factor(yob)36 0.143192 0.050607 2.829 0.004663 **
## as.factor(qob)4:as.factor(yob)36 0.184973 0.051560 3.588 0.000334 ***
## as.factor(qob)2:as.factor(yob)37 0.011753 0.051211 0.229 0.818484
## as.factor(qob)3:as.factor(yob)37 0.129877 0.049741 2.611 0.009027 **
## as.factor(qob)4:as.factor(yob)37 0.138687 0.050917 2.724 0.006454 **
## as.factor(qob)2:as.factor(yob)38 0.068986 0.049978 1.380 0.167484
## as.factor(qob)3:as.factor(yob)38 0.047877 0.048762 0.982 0.326174
## as.factor(qob)4:as.factor(yob)38 0.091455 0.049714 1.840 0.065827 .
## as.factor(qob)2:as.factor(yob)39 0.010410 0.049840 0.209 0.834550
## as.factor(qob)3:as.factor(yob)39 -0.013729 0.048394 -0.284 0.776647
## as.factor(qob)4:as.factor(yob)39 0.113541 0.049475 2.295 0.021740 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.276 on 329469 degrees of freedom
## Multiple R-squared: 0.003292, Adjusted R-squared: 0.003174
## F-statistic: 27.9 on 39 and 329469 DF, p-value: < 2.2e-16
The rule of thumb proposed by Stock, Wright & Yogo (2002) suggests this F-statistic should be above 10. The F-test for this model has test statistic 27.9 on 39 and 329469 degrees of freedom which is greater than 10, so we say the quarter of birth and years of education are descently correlated for us to continue towards the second stage.
8. Next,add age and age squared variables to the model with year-of-birth indicators, and reproduce columns 3 and 4 of Table V (except for the last entry of column 4).
We transform the yob into individual’s age up to the quarter by the formula: \[ age = 1980 - (1900 +yob +(qob-1/4)) \]
#construct variable age and its square
dataset$age<-1980-(1900+ dataset$yob + ((dataset$qob - 1) / 4))
dataset$age_sq<-dataset$age^2
new OLS regression summary
ols_2<-lm(lwage~educ+age+age_sq,data=dataset)
summary(ols_2)
##
## Call:
## lm(formula = lwage ~ educ + age + age_sq, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7506 -0.2344 0.0730 0.3375 4.6398
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.1369414 0.2996881 17.141 <2e-16 ***
## educ 0.0710840 0.0003390 209.671 <2e-16 ***
## age -0.0112729 0.0133287 -0.846 0.398
## age_sq 0.0001782 0.0001477 1.207 0.228
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6376 on 329505 degrees of freedom
## Multiple R-squared: 0.1177, Adjusted R-squared: 0.1177
## F-statistic: 1.465e+04 on 3 and 329505 DF, p-value: < 2.2e-16
new TSLS summary
tsls_2<- ivreg(lwage~educ+as.factor(yob)+age+age_sq|
as.factor(qob):as.factor(yob)+as.factor(yob)+age+age_sq,data=dataset)
summary(tsls_2)
##
## Call:
## ivreg(formula = lwage ~ educ + as.factor(yob) + age + age_sq |
## as.factor(qob):as.factor(yob) + as.factor(yob) + age + age_sq,
## data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.78624 -0.23531 0.07145 0.33982 4.67455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.8851603 1.4019445 4.911 9.06e-07 ***
## educ 0.0760832 0.0289606 2.627 0.00861 **
## as.factor(yob)31 -0.0055308 0.0094045 -0.588 0.55646
## as.factor(yob)32 -0.0142916 0.0170480 -0.838 0.40185
## as.factor(yob)33 -0.0190994 0.0239052 -0.799 0.42431
## as.factor(yob)34 -0.0261532 0.0304215 -0.860 0.38996
## as.factor(yob)35 -0.0434617 0.0354410 -1.226 0.22008
## as.factor(yob)36 -0.0493922 0.0399116 -1.238 0.21589
## as.factor(yob)37 -0.0631266 0.0438629 -1.439 0.15010
## as.factor(yob)38 -0.0738085 0.0470496 -1.569 0.11671
## as.factor(yob)39 -0.0970774 0.0513056 -1.892 0.05847 .
## age -0.0801672 0.0644586 -1.244 0.21361
## age_sq 0.0008317 0.0007343 1.133 0.25736
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6378 on 329496 degrees of freedom
## Multiple R-Squared: 0.1172, Adjusted R-squared: 0.1171
## Wald test: 3.713 on 12 and 329496 DF, p-value: 1.229e-05
9. Because you have four quarter of birth instruments, conduct tests to check whether the instruments seem truly exogenous in your 2SLS model estimates, like the final entries of columns 2 and 4 of Table V. Discuss what the result of your test implies in plain English.
summary(tsls,diagnostics = TRUE) #Sargan test for the first TSLS
##
## Call:
## ivreg(formula = lwage ~ educ + as.factor(yob) | as.factor(qob):as.factor(yob) +
## as.factor(yob), data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.87808 -0.24040 0.07032 0.34172 4.74912
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.792727 0.200684 23.882 < 2e-16 ***
## educ 0.089115 0.016110 5.532 3.17e-08 ***
## as.factor(yob)31 -0.008881 0.005529 -1.606 0.108228
## as.factor(yob)32 -0.018030 0.005750 -3.136 0.001715 **
## as.factor(yob)33 -0.021796 0.006301 -3.459 0.000542 ***
## as.factor(yob)34 -0.025788 0.006584 -3.917 8.98e-05 ***
## as.factor(yob)35 -0.038827 0.007267 -5.343 9.13e-08 ***
## as.factor(yob)36 -0.038762 0.007977 -4.859 1.18e-06 ***
## as.factor(yob)37 -0.044817 0.008758 -5.117 3.10e-07 ***
## as.factor(yob)38 -0.046545 0.009911 -4.696 2.65e-06 ***
## as.factor(yob)39 -0.058527 0.010458 -5.597 2.19e-08 ***
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments 30 329469 4.907 <2e-16 ***
## Wu-Hausman 1 329497 1.264 0.261
## Sargan 29 NA 25.439 0.655
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6404 on 329498 degrees of freedom
## Multiple R-Squared: 0.1102, Adjusted R-squared: 0.1101
## Wald test: 4.167 on 10 and 329498 DF, p-value: 8.581e-06
In the first Sargan test, we have test statistic 25.439 on 29 degrees of freedom. The corresponding p-value = 0.655, which does not provide enough evidence against the null hypothesis. So, the Sargan test suggests that the instrument variables are not related to residuals \(\hat u\).
summary(tsls_2,diagnostics = TRUE)#Sargan test for the second TSLS
##
## Call:
## ivreg(formula = lwage ~ educ + as.factor(yob) + age + age_sq |
## as.factor(qob):as.factor(yob) + as.factor(yob) + age + age_sq,
## data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.78624 -0.23531 0.07145 0.33982 4.67455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.8851603 1.4019445 4.911 9.06e-07 ***
## educ 0.0760832 0.0289606 2.627 0.00861 **
## as.factor(yob)31 -0.0055308 0.0094045 -0.588 0.55646
## as.factor(yob)32 -0.0142916 0.0170480 -0.838 0.40185
## as.factor(yob)33 -0.0190994 0.0239052 -0.799 0.42431
## as.factor(yob)34 -0.0261532 0.0304215 -0.860 0.38996
## as.factor(yob)35 -0.0434617 0.0354410 -1.226 0.22008
## as.factor(yob)36 -0.0493922 0.0399116 -1.238 0.21589
## as.factor(yob)37 -0.0631266 0.0438629 -1.439 0.15010
## as.factor(yob)38 -0.0738085 0.0470496 -1.569 0.11671
## as.factor(yob)39 -0.0970774 0.0513056 -1.892 0.05847 .
## age -0.0801672 0.0644586 -1.244 0.21361
## age_sq 0.0008317 0.0007343 1.133 0.25736
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments 29 329468 1.558 0.0282 *
## Wu-Hausman 1 329495 0.030 0.8626
## Sargan 29 NA 23.128 0.7706
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6378 on 329496 degrees of freedom
## Multiple R-Squared: 0.1172, Adjusted R-squared: 0.1171
## Wald test: 3.713 on 12 and 329496 DF, p-value: 1.229e-05
In the second Sargan test, we have test statistic 23.128 on 27 degree of freedom since we have used two more degree of freedom on age and squared age. The corresponding p-value = 0.0.7706, which does not provide enough evidence against the null hypothesis. So, the Sargan test suggests that the instrument variables are not related to residuals \(\hat u\)̂. By and large, two tests imply similar result that our instruments are exogenous in TSLS model estimates.
10. Finally, discuss the way that the LATE theorem limits some of the estimates you have made in the previous parts. Discuss the main assumptions, what they mean in the context of this particular analysis, and what you can conclude as a result about what you have found. The LATE theorem limits some of the estimates we have made due to the monotonicity assumption. The main assumptions for the LATE theorem are the instrument independence, monotonicity, and the exclusion restriction assumptions. In the context of this particular analysis, the quarter of birth has a small effect on the level of schooling an individual might get. This is because the TSLS and OLS estimates are not statistically significant, and the bias of the OLS estimate caused by omitted variables is negative. Quarters of birth are randomly assigned, and the individuals with higher education earn more wages because of their extra years of education.