library(pacman)
p_load(causalweight, lmtest, sandwich, AER, ivmodel, haven, estimatr, tidyverse,
lubridate, usmap, gridExtra, stringr, readxl,
reshape2, scales, broom, data.table, ggplot2, stargazer,
foreign, ggthemes, ggforce, ggridges, latex2exp, viridis, extrafont,
kableExtra, snakecase, janitor)
Author: David Card
Title: Using Geographic Variation in College
Proximity to Estimate the Returns to Schooling
Journal: NBER
Year: 1993
DOI: 10.3386/w4483
The author wanted to estimate returns to education by analyzing the causal link between education and earnings.
Outcome variable: wage
Treatment variable: years of schooling
The treatment variable is not exogenous because education levels are not randomly assigned among population and those individual characteristics that determine education tend to also affect earnings. Individuals with higher “ability” may acquire more education and have higher marginal returns on earnings. And the returns to schooling are inherently different across population, and people would make education choices based on their true returns. In addition, measurement error in schooling is another major source of endogeneity in such models.
Instrumental variable: college proximity – indicator of the presence of an accredited 4-year college in the local labor market
library(dagitty)
library(ggdag)
dag <- dagitty("dag {
YS -> E
X -> E
CP -> YS
X -> YS
}")
# YS: years of schooling
# E: earnings
# CP: college proximity
ggdag(dag) +
theme_dag()
Individuals live in areas with nearby colleges bear lower costs of schooling, which can potentially encourage people to go to college especially among lower income families. The author ran a linear model and found that the mean level of education in higher for those who grew up near a college for every quantile of predicted education, with strongest effects in the lowest quantile.
According to the author’s reduced form estimates, college proximity has very little and insignificant direct effects on earnings. Instead, college proximity affects earnings mainly through schooling decisions: families who choose to live near a college tend to value education more, and areas with college may have higher education quality.
Using the survey of NLS Young Men Cohort and college proximity as an IV, the author found that men who grew up in areas with a nearby 4-year college have significantly higher schooling and significantly higher earnings. This effect is concentrated among men with poorly-educated parents. Conventional OLS method substantially under-estimate the returns to education.
In Oh and Vukina’s paper, they use the Hausman-Nevo method to instrument for egg prices in California. They ran first-stage regression and showed that the estimates on the IV is statistically significant, implying that the relevance condition holds. Since the prices are also affected by demand-side shocks that are correlated across markets, it’s trickier to prove the exclusion restriction condition. To solve for that, the authors controlled for observed variables that might explain demand variations across market, including Easter and Christmas seasonal dummies. They argued that the sellers would manipulate prices through different marketing strategies in response to rising demand for eggs during these holiday seasons. The authors also controlled for individual banners’ preference parameters that affect the likelihood of purchasing a particular egg UPC at a particular retailer, because retailers have different product differentiation which attract different groups of consumers. Thirdly, the authors included an interaction term between 2015 dummy and cage-free attribute due to the potential influence of the avian flue episode on consumers’ perception on cage-free eggs. Lastly, they argues that the prices of the same egg product across markets are only correlated through common cost shocks as retailers mostly procure eggs locally.
Evidently from the mixed logit demand model, the authors found significant estimates that support their arguments. The likelihood of buying eggs is larger during Easter season the rest of the year; households do exhibit some degree of retail chain loyalty; and households tend to buy less cage-free eggs during two years after the avian flue episode.
library(causalweight)
data(JC)
?JC
table(JC$assignment, JC$trainy1)
##
## 0 1
## 0 1809 1854
## 1 857 4720
# without regression
ITT_manual <- mean(JC$earny4[JC$assignment == 1]) - mean(JC$earny4[JC$assignment == 0])
print(paste("ITT without regression:", ITT_manual))
## [1] "ITT without regression: 16.0551273200988"
# with regression
ITT_model <- lm(earny4 ~ assignment, data = JC)
summary(ITT_model)
##
## Call:
## lm(formula = earny4 ~ assignment, data = JC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -213.98 -164.65 -24.02 99.25 2211.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 197.926 3.212 61.620 < 2e-16 ***
## assignment 16.055 4.134 3.883 0.000104 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 194.4 on 9238 degrees of freedom
## Multiple R-squared: 0.00163, Adjusted R-squared: 0.001522
## F-statistic: 15.08 on 1 and 9238 DF, p-value: 0.0001038
library(AER)
# without regression
LATE_manual <- ITT_manual / complier_share_manual
print(paste("LATE without regression:", LATE_manual))
## [1] "LATE without regression: 47.1944993785445"
# with regression
LATE_model <- ivreg(earny4 ~ trainy1 | assignment, data = JC)
summary(LATE_model)
##
## Call:
## ivreg(formula = earny4 ~ trainy1 | assignment, data = JC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -221.23 -165.43 -22.55 100.01 2235.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 174.039 8.909 19.536 < 2e-16 ***
## trainy1 47.194 12.192 3.871 0.000109 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 195 on 9238 degrees of freedom
## Multiple R-Squared: -0.004797, Adjusted R-squared: -0.004905
## Wald test: 14.98 on 1 and 9238 DF, p-value: 0.0001092
On average, participating in the JC program increased the compliers’ earning by $47.195. While the average effect is $16.055 among all who received the assignment. These estimations are both statistically significant at 0.1%. The ITT is smaller because it includes those who did not comply with the assignment thus not getting treated eventually.
library(haven)
setwd("/Users/gracegao/Desktop/EEFE530/PS4")
card <- read_dta("Card1995.dta")
library(modelsummary)
card$exp <- card$age76 - card$ed76 - 6
# OLS
ols_1 <- lm(lwage76 ~ ed76 + exp + I(exp^2/100) + black + reg76r + smsa76r, data = card)
# IV
iv_1 <- ivreg(lwage76 ~ ed76 + exp + I(exp^2/100) + black + reg76r + smsa76r | nearc4 + exp + I(exp^2/100) + black + reg76r + smsa76r, data = card)
# first stage
first_stage_1 <- lm(ed76 ~ nearc4 + exp + I(exp^2/100) + black + reg76r + smsa76r, data = card)
# reduced form
reduced_form_1 <- lm(lwage76 ~ nearc4 + exp + I(exp^2/100) + black + reg76r + smsa76r, data = card)
# summary table
models <- list(
"OLS" = ols_1,
"IV" = iv_1,
"First-Stage" = first_stage_1,
"Reduced Form" = reduced_form_1
)
coef_map <- c("ed76" = "Education",
"nearc4" = "College Proximity",
"exp" = "Experience",
"I(exp^2/100)" = "Experience^2/100",
"black" = "Black",
"reg76r" = "Southern Region",
"smsa76r" = "Urban")
modelsummary(models,
stars = TRUE,
title = "Regression Results: OLS, IV, First-Stage, and Reduced Form",
fmt = 5,
coef_map = coef_map)
| OLS | IV | First-Stage | Reduced Form | |
|---|---|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||||
| Education | 0.07401*** | 0.13229** | ||
| (0.00351) | (0.04923) | |||
| College Proximity | 0.30628*** | 0.04462** | ||
| (0.07666) | (0.01701) | |||
| Experience | 0.08360*** | 0.10750*** | -0.35881*** | 0.05326*** |
| (0.00665) | (0.02130) | (0.03040) | (0.00695) | |
| Experience^2/100 | -0.22409*** | -0.22841*** | -0.21620 | -0.21872*** |
| (0.03178) | (0.03341) | (0.14590) | (0.03402) | |
| Black | -0.18963*** | -0.13080* | -1.03873*** | -0.26390*** |
| (0.01763) | (0.05287) | (0.08358) | (0.01848) | |
| Southern Region | -0.12486*** | -0.10490*** | -0.32964*** | -0.14346*** |
| (0.01512) | (0.02307) | (0.07385) | (0.01634) | |
| Urban | 0.16142*** | 0.13132*** | 0.39091*** | 0.18475*** |
| (0.01557) | (0.03013) | (0.07788) | (0.01750) | |
| Num.Obs. | 3010 | 3010 | 3613 | 3010 |
| R2 | 0.291 | 0.225 | 0.481 | 0.187 |
| R2 Adj. | 0.289 | 0.224 | 0.480 | 0.185 |
| AIC | 2633.4 | 2898.4 | 15205.5 | 3043.1 |
| BIC | 2681.5 | 2946.5 | 15255.0 | 3091.2 |
| Log.Lik. | -1308.702 | -7594.738 | -1513.546 | |
| F | 204.932 | 557.743 | 115.164 | |
| RMSE | 0.37 | 0.39 | 1.98 | 0.40 |
# OLS
ols_2 <- lm(lwage76 ~ ed76 + exp + I(exp^2/100) + black + reg76r + smsa76r, data = card)
# IV
iv_2 <- ivreg(lwage76 ~ ed76 + exp + I(exp^2/100) + black + reg76r + smsa76r | nearc4a + nearc4b + exp + I(exp^2/100) + black + reg76r + smsa76r, data = card)
# first stage
first_stage_2 <- lm(ed76 ~ nearc4a + nearc4b + exp + I(exp^2/100) + black + reg76r + smsa76r, data = card)
# reduced form
reduced_form_2 <- lm(lwage76 ~ nearc4a + nearc4b + exp + I(exp^2/100) + black + reg76r + smsa76r, data = card)
# summary table
models <- list(
"OLS" = ols_2,
"IV" = iv_2,
"First-Stage" = first_stage_2,
"Reduced Form" = reduced_form_2
)
coef_map <- c("ed76" = "Education",
"nearc4a" = "Public College Proximity",
"nearc4b" = "Private College Proximity",
"exp" = "Experience",
"I(exp^2/100)" = "Experience^2/100",
"black" = "Black",
"reg76r" = "Southern Region",
"smsa76r" = "Urban")
modelsummary(models,
stars = TRUE,
title = "Regression Results: OLS, IV, First-Stage, and Reduced Form",
fmt = 5,
coef_map = coef_map)
| OLS | IV | First-Stage | Reduced Form | |
|---|---|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||||
| Education | 0.07401*** | 0.16109*** | ||
| (0.00351) | (0.04077) | |||
| Public College Proximity | 0.39599*** | 0.06401*** | ||
| (0.08119) | (0.01801) | |||
| Private College Proximity | 0.09648 | -0.00011 | ||
| (0.09929) | (0.02191) | |||
| Experience | 0.08360*** | 0.11931*** | -0.36039*** | 0.05258*** |
| (0.00665) | (0.01818) | (0.03036) | (0.00694) | |
| Experience^2/100 | -0.22409*** | -0.23054*** | -0.20488 | -0.21464*** |
| (0.03178) | (0.03503) | (0.14573) | (0.03399) | |
| Black | -0.18963*** | -0.10173* | -1.03976*** | -0.26394*** |
| (0.01763) | (0.04531) | (0.08347) | (0.01846) | |
| Southern Region | -0.12486*** | -0.09504*** | -0.30427*** | -0.13838*** |
| (0.01512) | (0.02165) | (0.07414) | (0.01639) | |
| Urban | 0.16142*** | 0.11645*** | 0.38804*** | 0.18389*** |
| (0.01557) | (0.02705) | (0.07778) | (0.01748) | |
| Num.Obs. | 3010 | 3010 | 3613 | 3010 |
| R2 | 0.291 | 0.145 | 0.483 | 0.190 |
| R2 Adj. | 0.289 | 0.143 | 0.482 | 0.188 |
| AIC | 2633.4 | 3196.0 | 15196.5 | 3034.6 |
| BIC | 2681.5 | 3244.0 | 15252.2 | 3088.7 |
| Log.Lik. | -1308.702 | -7589.231 | -1508.323 | |
| F | 204.932 | 480.965 | 100.513 | |
| RMSE | 0.37 | 0.41 | 1.98 | 0.40 |
Using college proximity as instruments, the return on education is expected to be 16% increase in earnings for every additional years of schooling. The first stage estimation confirms the relevance between eduction choice and public college proximity, implying that living near a public college increases time of schooling by 0.40 years. The OLS model under estimates the effect by about 8.7%.
Education is endogenous because factors like personal abilities and family background are correlated with both education and earnings. Experience is also likely to be endogenous because we define it using years of education. Then, factors like abilities introduce endogeneity into the variable.
card$interaction1 <- card$nearc4a * card$age76
card$interaction2 <- card$nearc4a * (card$age76)^2 / 100
# 2SLS
structural_model <- ivreg(lwage76 ~ ed76 + exp + I(exp^2/100) + black + reg76r + smsa76r | nearc4a + nearc4b + interaction1 + interaction2 + exp + I(exp^2/100) + black + reg76r + smsa76r, data = card)
summary(structural_model)
##
## Call:
## ivreg(formula = lwage76 ~ ed76 + exp + I(exp^2/100) + black +
## reg76r + smsa76r | nearc4a + nearc4b + interaction1 + interaction2 +
## exp + I(exp^2/100) + black + reg76r + smsa76r, data = card)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.61638 -0.22444 0.02206 0.24233 1.34656
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.590107 0.106727 43.008 < 2e-16 ***
## ed76 0.082539 0.006030 13.688 < 2e-16 ***
## exp 0.087094 0.006952 12.529 < 2e-16 ***
## I(exp^2/100) -0.224720 0.031817 -7.063 2.02e-12 ***
## black -0.181022 0.018325 -9.878 < 2e-16 ***
## reg76r -0.121940 0.015226 -8.009 1.64e-15 ***
## smsa76r 0.157018 0.015793 9.942 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3746 on 3003 degrees of freedom
## Multiple R-Squared: 0.2891, Adjusted R-squared: 0.2877
## Wald test: 161.6 on 6 and 3003 DF, p-value: < 2.2e-16
This estimated effect is half of the size of the coefficient in the last model.
summary(structural_model, diagnostics = TRUE)
##
## Call:
## ivreg(formula = lwage76 ~ ed76 + exp + I(exp^2/100) + black +
## reg76r + smsa76r | nearc4a + nearc4b + interaction1 + interaction2 +
## exp + I(exp^2/100) + black + reg76r + smsa76r, data = card)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.61638 -0.22444 0.02206 0.24233 1.34656
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.590107 0.106727 43.008 < 2e-16 ***
## ed76 0.082539 0.006030 13.688 < 2e-16 ***
## exp 0.087094 0.006952 12.529 < 2e-16 ***
## I(exp^2/100) -0.224720 0.031817 -7.063 2.02e-12 ***
## black -0.181022 0.018325 -9.878 < 2e-16 ***
## reg76r -0.121940 0.015226 -8.009 1.64e-15 ***
## smsa76r 0.157018 0.015793 9.942 < 2e-16 ***
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments 4 3000 384.015 <2e-16 ***
## Wu-Hausman 1 3002 3.034 0.0817 .
## Sargan 3 NA 10.978 0.0118 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3746 on 3003 degrees of freedom
## Multiple R-Squared: 0.2891, Adjusted R-squared: 0.2877
## Wald test: 161.6 on 6 and 3003 DF, p-value: < 2.2e-16
The Wu-Hausman test gives a test statistic of 3.034 with p-value of 0.0817, which is slightly above conventional cutoff of 0.05. The test gives week evidence of an exogenous ed76 in the structural model, so an IV might still be necessary.