GENERAL NOTES

Objective of Problem Set 4

  • The objective of this problem set is to review and apply the instrumental variables method in R.

Submission process

  • When you are done with the problem set, publish it on Rpubs using your temporary account.
  • Copy the RPubs link of your work and submit it on Canvas.
  • For any entirely equal submissions, whoever sent me their RPubs link last has copied the others. So, timely submissions are important. Own your work. I can randomly ask your R script and .Rmd files for double-checking purposes. As a standard practice, work in a script file before making your code chunks in the .Rmd file. Your .Rmd file and Rpubs submission page MUST show the code used to produce any of the outputs you present in your answers.

Academic integrity

Academic integrity is the pursuit of scholarly activity in an open, honest and responsible manner. Academic integrity is a basic guiding principle for all academic activity at The Pennsylvania State University, and all members of the University community are expected to act in accordance with this principle. Consistent with this expectation, the University’s Code of Conduct states that all students should act with personal integrity, respect other students’ dignity, rights and property, and help create and maintain an environment in which all can succeed through the fruits of their efforts.

Academic integrity includes a commitment by all members of the University community not to engage in or tolerate acts of falsification, misrepresentation or deception. Such acts of dishonesty violate the fundamental ethical principles of the University community and compromise the worth of work completed by others.

# Load packages
library(pacman)
p_load(causalweight, lmtest, sandwich, AER, ivmodel, haven, estimatr, tidyverse, 
       lubridate, usmap, gridExtra, stringr, readxl,  
       reshape2, scales, broom, data.table, ggplot2, stargazer,  
       foreign, ggthemes, ggforce, ggridges, latex2exp, viridis, extrafont, 
       kableExtra, snakecase, janitor)

PROBLEM. Instrumental variables

i. Example of an IV paper.

Find one paper in the applied literature which uses an IV approach in their identification strategy. Provide the following list of information:

    1. Full reference to the paper: authors, title of the paper, journal, year, and DOI.
    1. Very short description of the main research question the paper tries to answer (max 3 lines).
    1. Outcome variable, treatment variable, and main argument of why the treatment is likely endogenous in the causal relationship of interest (max 5 lines).
    1. Instrumental variable(s).
    1. Draw the directed acyclic graph (DAG) corresponding with the identification strategy.
    1. Main argument of why the instrument(s) is/are relevant (max 3 lines).
    1. Main argument of why the instrument(s) is/are valid (max 3 lines).
    1. Main finding of the paper (max 3 lines).

ii. Hausman-Nevo instrument.

One of the instruments commonly used in demand estimation is the Hausman-Nevo instrument, which instruments the price of a specific good (e.g., price of good \(i\) in time \(t\) in a geographical market \(m\)) with the contemporaneous price of the same good in neighboring markets (e.g., average price of good \(i\) in time \(t\) across geographical markets (\(-m\))). The identifying assumption is that a contemporaneous variation in prices in geographical markets (\(-m\)) reflects a contemporaneous variation in marginal production costs for the same good, which will be correlated with the contemporaneous variation in prices in the geographical market \(m\), and marginal cost shifters or shocks are uncorrelated with demand shocks. You can read more about this in Hausman (NBER 1996), Nevo (Econometrica 2001), Nevo (NBER 2012), Berry & Haile (NBER 2021), or Hahn et al. (2024).

  • Discussion. Despite having its issues, the use of the Hausman-Nevo instrument remains acceptable in reputable journals. In what follows, you are asked to discuss its validity or exclusion restriction in a published paper.

    • Pick one paper: DellaVigna & Gentzkow (QJE 2019) or Oh & Vukina (AJAE 2022) or any recent paper that relies on the Hausman-Nevo instrument in their identification strategy.

    • What evidence did the authors use to support their argument that the instrument satisfies the exclusion restriction?

iii. ITT and LATE

Practice data. The practice data of this question is the dataset of the “National Job Corps (JC) Study” JC. JC is a large education program from the US Department of Labor that enrolled disadvantaged individuals aged 16-24 in education and/or vocational training from late 1994 to early 1996 with the goal to increase their employment and earnings some years after the program and decrease their criminal activity. You can read more about this randomized experiment in Schochet et al. (Mathematica 2001) and Schochet et al. (AER 2008), etc. You can load the package causalweight by Bodory and Huber (2018) that contains the JC dataset.

The JC sample has a size of 9,240 observations from individuals randomly assigned to JC treatment group (5,577 observations) and control group (3,663 observations), which you can check by examining the assignment variable (assignment). There is a discrepancy between the random program assignment (assignment) and the actual treatment variable in the first year after assignment (trainy1) due to noncompliance. The outcome variable is weekly earnings in US dollars (USD) in the fourth year post-treatment (earny4). There are multiple other variables, which we will use later in a de-biased/double/causal machine learning (DML/CML) exercise. I suggest you call ?JC to open the help file and check the description of the dataset, even though you will only use the variables assignment, trainy1, and earny4 for this problem set.

# Load data
data(JC)
table(JC$assignment)
## 
##    0    1 
## 3663 5577
table(JC$trainy1)
## 
##    0    1 
## 2666 6574
#table(JC$trainy2)
table(JC$assignment, JC$trainy1)
##    
##        0    1
##   0 1809 1854
##   1  857 4720
prop.table(table(JC$assignment, JC$trainy1))
##    
##              0          1
##   0 0.19577922 0.20064935
##   1 0.09274892 0.51082251
as.data.frame(table(JC$assignment, JC$trainy1))
##   Var1 Var2 Freq
## 1    0    0 1809
## 2    1    0  857
## 3    0    1 1854
## 4    1    1 4720
#?JC
  • (a) Intent-To-Treat (ITT) Effect.

    • Without using a regression, compute the estimated ITT. Hint: This is the mean difference in earnings between individuals randomly assigned to JC treatment and those randomly assigned to control.

    • Use a regression to obtain the estimated ITT and its standard error. Hint: This is the effect of JC random assignment on earnings, assuming full compliance.

  • (b) Complier share.

    • Note that you are facing a double-sided noncompliance issue. Compute the following difference: actual treatment take up rate among individuals randomly assigned to treatment minus actual treatment take up rate among those randomly assigned to control.

    • Use a regression to obtain the same difference in actual treatment take up rates.

  • (c) Complier/Local Average Treatment Effect (LATE).

    • Without using a regression, compute the estimated LATE among compliers.

    • Load the AER package by Kleiber & Zeileis (2008), which contains the ivreg command for 2SLS regression. Use the ivreg command to estimate the LATE and its standard error. Alternatively, you may also load other packages for IV, such as the ivmodel package by Kang et al. (2020) and use the ivmodel command to estimate the LATE and its standard error.

    • Interpret ITT versus LATE estimates of the JC program effect on earnings.

iv. Replication and extension.

You will replicate and extend the work of Card (NBER 1993; in Christofides et al. 1995, Aspects of Labour Market Behavior: Essays in Honour of John Vanderkamp ) on the returns to schooling using college proximity as an IV for education in a wage regression. The paper’s identifying assumption is that living closer to a college reduces the cost barrier to attending college, thereby increasing the likelihood of enrollment; however, proximity to a college is not assumed to directly impact a student’s skills or abilities and, therefore, should not directly affect their market wage.

Practice data. The data is from the National Longitudinal Survey of Young Men (NLSYM) for 1976. It is sourced from this link and posted as Card1995 on Canvas.

In this exercise, we will focus on using proximity to public college (nearc4a) and private college (nearc4b) as instruments for education (years of schooling) in 1976 (ed76). The dependent variable is the log of weekly earnings (lwage76). Other variables you will need for this exercise include age (years) in 1976 (age76); years of work experience \((exp)\), calculated as \((age76 - ed76 -6)\); \((exp^{2}/100)\); an indicator for black (black); an indicator for residence in the southern region of the U.S. (reg76r); an indicator for urban residence in a standard metropolitan statistical area (smsa76r); and an indicator for growing up in the same county as a 4-year college (nearc4). See the description file for the variable definitions on Canvas.

  • OLS, IV, First-stage, and Reduced form estimations using proximity to college. Estimate and show in the same formatted table the four models described below. You may use packages like stargazer, texreg, or any other package that helps you produce well-formatted estimation tables. Interpret your results.

    • \(lwage76 = \beta_{0} + \beta_{1} ed76 + \beta_{2} exp + \beta_{3} (exp^{2}/100) + \beta_{4} black + \beta_{5} reg76r + \beta_{6} smsa76r + \epsilon\), which should replicate the first OLS model in Table 2 of Card (NBER 1993)’s paper.

    • the IV model that uses nearc4 as an instrument for ed76.

    • the first-stage regression.

    • the reduced form regression, i.e., the multivariate linear regression of lwage76 as a function of the same independent variables used in the first-stage regression.

  • IV, First-stage, and Reduced form estimations using proximity to public and private colleges as instruments. Estimate and show in the same formatted table the IV model that uses nearc4a and nearc4b as instruments for ed76, first-stage regression, and reduced form regression. You may use packages like stargazer, texreg, or any other package that helps you produce well-formatted estimation tables. Interpret your results.

  • Endogeneity discussion.

    • Is education endogenous in the wage regression? If it is, then is experience endogenous?

    • Create the interactions \((nearc4a * age76)\) and \((nearc4a * age76^{2}/100)\).

    • Estimate the structural equation by 2SLS using nearc4a, nearc4b, and the interactions above as instruments for ed76, \(exp\), and \((exp^{2}/100)\).

    • How do the results differ from your earlier ones instrumenting only for ed76 using nearc4a and nearc4b?

    • Test the hypothesis that ed76 is exogenous for the structural return to schooling.

HAVE FUN AND KEEP FAITH IN THE FUN!