library(pacman)
p_load(causalweight, lmtest, sandwich, AER, ivmodel, haven, estimatr, tidyverse,
lubridate, usmap, gridExtra, stringr, readxl,
reshape2, scales, broom, data.table, ggplot2, stargazer,
foreign, ggthemes, ggforce, ggridges, latex2exp, viridis, extrafont,
kableExtra, snakecase, janitor)

i. Example of an IV paper

a

Author: David Card
Title: Using Geographic Variation in College Proximity to Estimate the Returns to Schooling
Journal: NBER
Year: 1993
DOI: 10.3386/w4483

b

The author wanted to estimate returns to education by analyzing the causal link between education and earnings.

c

Outcome variable: wage
Treatment variable: years of schooling

The treatment variable is not exogenous because education levels are not randomly assigned among population and those individual characteristics that determine education tend to also affect earnings. Individuals with higher “ability” may acquire more education and have higher marginal returns on earnings. And the returns to schooling are inherently different across population, and people would make education choices based on their true returns. In addition, measurement error in schooling is another major source of endogeneity in such models.

d

Instrumental variable: college proximity – indicator of the presence of an accredited 4-year college in the local labor market

e

library(dagitty)
library(ggdag)

dag <- dagitty("dag {
  YS -> E
  X -> E
  CP -> YS
  X -> YS
}")

# YS: years of schooling
# E: earnings
# CP: college proximity

ggdag(dag) +
  theme_dag()

f

Individuals live in areas with nearby colleges bear lower costs of schooling, which can potentially encourage people to go to college especially among lower income families. The author ran a linear model and found that the mean level of education in higher for those who grew up near a college for every quantile of predicted education, with strongest effects in the lowest quantile.

g

According to the author’s reduced form estimates, college proximity has very little and insignificant direct effects on earnings. Instead, college proximity affects earnings mainly through schooling decisions: families who choose to live near a college tend to value education more, and areas with college may have higher education quality.

h

Using the survey of NLS Young Men Cohort and college proximity as an IV, the author found that men who grew up in areas with a nearby 4-year college have significantly higher schooling and significantly higher earnings. This effect is concentrated among men with poorly-educated parents. Conventional OLS method substantially under-estimate the returns to education.

ii. Hausman-Nevo instrument

In Oh and Vukina’s paper, they use the Hausman-Nevo method to instrument for egg prices in California. They ran first-stage regression and showed that the estimates on the IV is statistically significant, implying that the relevance condition holds. Since the prices are also affected by demand-side shocks that are correlated across markets, it’s trickier to prove the exclusion restriction condition. To solve for that, the authors controlled for observed variables that might explain demand variations across market, including Easter and Christmas seasonal dummies. They argued that the sellers would manipulate prices through different marketing strategies in response to rising demand for eggs during these holiday seasons. The authors also controlled for individual banners’ preference parameters that affect the likelihood of purchasing a particular egg UPC at a particular retailer, because retailers have different product differentiation which attract different groups of consumers. Thirdly, the authors included an interaction term between 2015 dummy and cage-free attribute due to the potential influence of the avian flue episode on consumers’ perception on cage-free eggs. Lastly, they argues that the prices of the same egg product across markets are only correlated through common cost shocks as retailers mostly procure eggs locally.

Evidently from the mixed logit demand model, the authors found significant estimates that support their arguments. The likelihood of buying eggs is larger during Easter season the rest of the year; households do exhibit some degree of retail chain loyalty; and households tend to buy less cage-free eggs during two years after the avian flue episode.

iii. ITT and LATE

library(causalweight)
data(JC)
?JC
table(JC$assignment, JC$trainy1)
##    
##        0    1
##   0 1809 1854
##   1  857 4720

a. ITT

# without regression
ITT_manual <- mean(JC$earny4[JC$assignment == 1]) - mean(JC$earny4[JC$assignment == 0])
print(paste("ITT without regression:", ITT_manual))
## [1] "ITT without regression: 16.0551273200988"
# with regression
ITT_model <- lm(earny4 ~ assignment, data = JC)
summary(ITT_model)
## 
## Call:
## lm(formula = earny4 ~ assignment, data = JC)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -213.98 -164.65  -24.02   99.25 2211.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  197.926      3.212  61.620  < 2e-16 ***
## assignment    16.055      4.134   3.883 0.000104 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 194.4 on 9238 degrees of freedom
## Multiple R-squared:  0.00163,    Adjusted R-squared:  0.001522 
## F-statistic: 15.08 on 1 and 9238 DF,  p-value: 0.0001038

b. Complier share

# without regression
complier_share_manual <- mean(JC$trainy1[JC$assignment == 1]) - mean(JC$trainy1[JC$assignment == 0])
print(paste("Complier share without regression:", complier_share_manual))
## [1] "Complier share without regression: 0.340190647882956"
# with regression
complier_share_model <- lm(trainy1 ~ assignment, data = JC)
summary(complier_share_model)
## 
## Call:
## lm(formula = trainy1 ~ assignment, data = JC)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8463 -0.5061  0.1537  0.1537  0.4939 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.506143   0.006964   72.68   <2e-16 ***
## assignment  0.340191   0.008963   37.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4215 on 9238 degrees of freedom
## Multiple R-squared:  0.1349, Adjusted R-squared:  0.1348 
## F-statistic:  1440 on 1 and 9238 DF,  p-value: < 2.2e-16

c. LATE

library(AER)

# without regression
LATE_manual <- ITT_manual / complier_share_manual
print(paste("LATE without regression:", LATE_manual))
## [1] "LATE without regression: 47.1944993785445"
# with regression
LATE_model <- ivreg(earny4 ~ trainy1 | assignment, data = JC)
summary(LATE_model)
## 
## Call:
## ivreg(formula = earny4 ~ trainy1 | assignment, data = JC)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -221.23 -165.43  -22.55  100.01 2235.87 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  174.039      8.909  19.536  < 2e-16 ***
## trainy1       47.194     12.192   3.871 0.000109 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 195 on 9238 degrees of freedom
## Multiple R-Squared: -0.004797,   Adjusted R-squared: -0.004905 
## Wald test: 14.98 on 1 and 9238 DF,  p-value: 0.0001092

On average, participating in the JC program increased the compliers’ earning by $47.195. While the average effect is $16.055 among all who received the assignment. These estimations are both statistically significant at 0.1%. The ITT is smaller because it includes those who did not comply with the assignment thus not getting treated eventually.

iv. Replication and extension

library(haven)
setwd("/Users/gracegao/Desktop/EEFE530/PS4")
card <- read_dta("Card1995.dta")

a. IV: proximity to college

library(modelsummary)

card$exp <- card$age76 - card$ed76 - 6

# OLS
ols_1 <- lm(lwage76 ~ ed76 + exp + I(exp^2/100) + black + reg76r + smsa76r, data = card)

# IV
iv_1 <- ivreg(lwage76 ~ ed76 + exp + I(exp^2/100) + black + reg76r + smsa76r | nearc4 + exp + I(exp^2/100) + black + reg76r + smsa76r, data = card)

# first stage
first_stage_1 <- lm(ed76 ~ nearc4 + exp + I(exp^2/100) + black + reg76r + smsa76r, data = card)

# reduced form
reduced_form_1 <- lm(lwage76 ~ nearc4 + exp + I(exp^2/100) + black + reg76r + smsa76r, data = card)

# summary table
models <- list(
  "OLS" = ols_1,
  "IV" = iv_1,
  "First-Stage" = first_stage_1,
  "Reduced Form" = reduced_form_1
)

coef_map <- c("ed76" = "Education",
              "nearc4" = "College Proximity",
              "exp" = "Experience",
              "I(exp^2/100)" = "Experience^2/100",
              "black" = "Black",
              "reg76r" = "Southern Region",
              "smsa76r" = "Urban")

modelsummary(models,
             stars = TRUE,
             title = "Regression Results: OLS, IV, First-Stage, and Reduced Form",
             fmt = 5,
             coef_map = coef_map)
Regression Results: OLS, IV, First-Stage, and Reduced Form
OLS IV First-Stage Reduced Form
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Education 0.07401*** 0.13229**
(0.00351) (0.04923)
College Proximity 0.30628*** 0.04462**
(0.07666) (0.01701)
Experience 0.08360*** 0.10750*** -0.35881*** 0.05326***
(0.00665) (0.02130) (0.03040) (0.00695)
Experience^2/100 -0.22409*** -0.22841*** -0.21620 -0.21872***
(0.03178) (0.03341) (0.14590) (0.03402)
Black -0.18963*** -0.13080* -1.03873*** -0.26390***
(0.01763) (0.05287) (0.08358) (0.01848)
Southern Region -0.12486*** -0.10490*** -0.32964*** -0.14346***
(0.01512) (0.02307) (0.07385) (0.01634)
Urban 0.16142*** 0.13132*** 0.39091*** 0.18475***
(0.01557) (0.03013) (0.07788) (0.01750)
Num.Obs. 3010 3010 3613 3010
R2 0.291 0.225 0.481 0.187
R2 Adj. 0.289 0.224 0.480 0.185
AIC 2633.4 2898.4 15205.5 3043.1
BIC 2681.5 2946.5 15255.0 3091.2
Log.Lik. -1308.702 -7594.738 -1513.546
F 204.932 557.743 115.164
RMSE 0.37 0.39 1.98 0.40

b. IV: proximity to private and public college

# OLS
ols_2 <- lm(lwage76 ~ ed76 + exp + I(exp^2/100) + black + reg76r + smsa76r, data = card)

# IV
iv_2 <- ivreg(lwage76 ~ ed76 + exp + I(exp^2/100) + black + reg76r + smsa76r | nearc4a + nearc4b + exp + I(exp^2/100) + black + reg76r + smsa76r, data = card)

# first stage
first_stage_2 <- lm(ed76 ~ nearc4a + nearc4b + exp + I(exp^2/100) + black + reg76r + smsa76r, data = card)

# reduced form
reduced_form_2 <- lm(lwage76 ~ nearc4a + nearc4b + exp + I(exp^2/100) + black + reg76r + smsa76r, data = card)

# summary table
models <- list(
  "OLS" = ols_2,
  "IV" = iv_2,
  "First-Stage" = first_stage_2,
  "Reduced Form" = reduced_form_2
)

coef_map <- c("ed76" = "Education",
              "nearc4a" = "Public College Proximity",
              "nearc4b" = "Private College Proximity",
              "exp" = "Experience",
              "I(exp^2/100)" = "Experience^2/100",
              "black" = "Black",
              "reg76r" = "Southern Region",
              "smsa76r" = "Urban")

modelsummary(models,
             stars = TRUE,
             title = "Regression Results: OLS, IV, First-Stage, and Reduced Form",
             fmt = 5,
             coef_map = coef_map)
Regression Results: OLS, IV, First-Stage, and Reduced Form
OLS IV First-Stage Reduced Form
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Education 0.07401*** 0.16109***
(0.00351) (0.04077)
Public College Proximity 0.39599*** 0.06401***
(0.08119) (0.01801)
Private College Proximity 0.09648 -0.00011
(0.09929) (0.02191)
Experience 0.08360*** 0.11931*** -0.36039*** 0.05258***
(0.00665) (0.01818) (0.03036) (0.00694)
Experience^2/100 -0.22409*** -0.23054*** -0.20488 -0.21464***
(0.03178) (0.03503) (0.14573) (0.03399)
Black -0.18963*** -0.10173* -1.03976*** -0.26394***
(0.01763) (0.04531) (0.08347) (0.01846)
Southern Region -0.12486*** -0.09504*** -0.30427*** -0.13838***
(0.01512) (0.02165) (0.07414) (0.01639)
Urban 0.16142*** 0.11645*** 0.38804*** 0.18389***
(0.01557) (0.02705) (0.07778) (0.01748)
Num.Obs. 3010 3010 3613 3010
R2 0.291 0.145 0.483 0.190
R2 Adj. 0.289 0.143 0.482 0.188
AIC 2633.4 3196.0 15196.5 3034.6
BIC 2681.5 3244.0 15252.2 3088.7
Log.Lik. -1308.702 -7589.231 -1508.323
F 204.932 480.965 100.513
RMSE 0.37 0.41 1.98 0.40

Using college proximity as instruments, the return on education is expected to be 16% increase in earnings for every additional years of schooling. The first stage estimation confirms the relevance between eduction choice and public college proximity, implying that living near a public college increases time of schooling by 0.40 years. The OLS model under estimates the effect by about 8.7%.

c. Endogeneity

Education is endogenous because factors like personal abilities and family background are correlated with both education and earnings. Experience is also likely to be endogenous because we define it using years of education. Then, factors like abilities introduce endogeneity into the variable.

card$interaction1 <- card$nearc4a * card$age76
card$interaction2 <- card$nearc4a * (card$age76)^2 / 100

# 2SLS
structural_model <- ivreg(lwage76 ~ ed76 + exp + I(exp^2/100) + black + reg76r + smsa76r | nearc4a + nearc4b + interaction1 + interaction2 + exp + I(exp^2/100) + black + reg76r + smsa76r, data = card)

summary(structural_model)
## 
## Call:
## ivreg(formula = lwage76 ~ ed76 + exp + I(exp^2/100) + black + 
##     reg76r + smsa76r | nearc4a + nearc4b + interaction1 + interaction2 + 
##     exp + I(exp^2/100) + black + reg76r + smsa76r, data = card)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.61638 -0.22444  0.02206  0.24233  1.34656 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.590107   0.106727  43.008  < 2e-16 ***
## ed76          0.082539   0.006030  13.688  < 2e-16 ***
## exp           0.087094   0.006952  12.529  < 2e-16 ***
## I(exp^2/100) -0.224720   0.031817  -7.063 2.02e-12 ***
## black        -0.181022   0.018325  -9.878  < 2e-16 ***
## reg76r       -0.121940   0.015226  -8.009 1.64e-15 ***
## smsa76r       0.157018   0.015793   9.942  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3746 on 3003 degrees of freedom
## Multiple R-Squared: 0.2891,  Adjusted R-squared: 0.2877 
## Wald test: 161.6 on 6 and 3003 DF,  p-value: < 2.2e-16

This estimated effect is half of the size of the coefficient in the last model.

summary(structural_model, diagnostics = TRUE)
## 
## Call:
## ivreg(formula = lwage76 ~ ed76 + exp + I(exp^2/100) + black + 
##     reg76r + smsa76r | nearc4a + nearc4b + interaction1 + interaction2 + 
##     exp + I(exp^2/100) + black + reg76r + smsa76r, data = card)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.61638 -0.22444  0.02206  0.24233  1.34656 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.590107   0.106727  43.008  < 2e-16 ***
## ed76          0.082539   0.006030  13.688  < 2e-16 ***
## exp           0.087094   0.006952  12.529  < 2e-16 ***
## I(exp^2/100) -0.224720   0.031817  -7.063 2.02e-12 ***
## black        -0.181022   0.018325  -9.878  < 2e-16 ***
## reg76r       -0.121940   0.015226  -8.009 1.64e-15 ***
## smsa76r       0.157018   0.015793   9.942  < 2e-16 ***
## 
## Diagnostic tests:
##                   df1  df2 statistic p-value    
## Weak instruments    4 3000   384.015  <2e-16 ***
## Wu-Hausman          1 3002     3.034  0.0817 .  
## Sargan              3   NA    10.978  0.0118 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3746 on 3003 degrees of freedom
## Multiple R-Squared: 0.2891,  Adjusted R-squared: 0.2877 
## Wald test: 161.6 on 6 and 3003 DF,  p-value: < 2.2e-16

The Wu-Hausman test gives a test statistic of 3.034 with p-value of 0.0817, which is slightly above conventional cutoff of 0.05. The test gives week evidence of an exogenous ed76 in the structural model, so an IV might still be necessary.