R
. In the
second problem set, you will build on this foundation by combining the
difference-in-differences approach with other analytic methods in
R
.Rpubs
using your temporary account.RPubs
link of your work and submit it on
Canvas.RPubs
link last has copied the others. So, timely
submissions are important. Own your work. I can randomly ask your
R
script and .Rmd files for double-checking purposes. As a
standard practice, work in a script file before making your code chunks
in the .Rmd file. Your .Rmd file and Rpubs
submission page
MUST show the code used to produce any of the outputs you present in
your answers.Academic integrity is the pursuit of scholarly activity in an open, honest and responsible manner. Academic integrity is a basic guiding principle for all academic activity at The Pennsylvania State University, and all members of the University community are expected to act in accordance with this principle. Consistent with this expectation, the University’s Code of Conduct states that all students should act with personal integrity, respect other students’ dignity, rights and property, and help create and maintain an environment in which all can succeed through the fruits of their efforts.
Academic integrity includes a commitment by all members of the University community not to engage in or tolerate acts of falsification, misrepresentation or deception. Such acts of dishonesty violate the fundamental ethical principles of the University community and compromise the worth of work completed by others.
# Load packages
library(pacman)
p_load(wooldridge, multiwayvcov, causalweight, lmtest, did, usmap,
sandwich, AER, ivmodel, haven, estimatr, tidyverse, lubridate,
gridExtra, stringr, readxl, reshape2, scales, broom,
ggplot2, fixest, lfe, stargazer, foreign, ggthemes, ggforce,
ggridges, latex2exp, viridis, extrafont, kableExtra, snakecase,
janitor, tableone, causaldata, multcomp, etable)
Practice data. The practice data of this question is
Kiel & McClain
(JEEM 1995)’s dataset on house prices in North Andover,
Massachusetts, before and after the construction of a garbage
incinerator. You can load the wooldridge
package by Shea
(2021) to access the kielmc
dataset. Our treatment variable
nearinc
is the binary indicator for a house being located
nearer the incinerator (\(nearinc=1\))
versus farther away from the incinerator (\(nearinc=0\)). We have two periods of data
captured by the variable y81
: 1978 or before the
construction of the incinerator (\(y81=0\)) and 1981 or after the construction
of the incinerator (\(y81=1\)). The
outcome variable is the real price of the house in 1978 US dollars
(rprice
). Other variables include the distance from the
house to the central business district discretely coded
(cbd
), number of bedrooms (rooms
), number of
bathrooms (baths
), age of the house in years
(age
) and squared (agesq
), living area in
square feet (area
) and log-transformed
(larea
), lot size in square feet (land
) and
log-transformed (lland
), etc. You may call the dataset
(?kielmc
) for further description.
# Load `kielmc` data
data(kielmc)
attach(kielmc)
#table(nearinc)
#table(y81)
as.data.frame(table(nearinc, y81))
## nearinc y81 Freq
## 1 0 0 123
## 2 1 0 56
## 3 0 1 102
## 4 1 1 40
summary(rprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 26000 59000 82000 83721 100230 300000
length(unique(cbd)) # number of clusters
## [1] 34
#?kielmc
cbd
level. You may use the
multiwaycov
package by Graham et al. (2016) and the
lmtest
package to compute the cluster-robust standard
errors. You may use packages like stargazer
,
modelsummary
, texreg
, or any other package
that helps you produce well-formatted estimation tables. Interpret your
results.Re-estimate the treatment effect on house prices using the 2x2
difference-in-differences estimator with the set of covariates
rooms
, baths
, age
,
agesq
, larea
, and lland
, while
clustering the standard errors at the cbd
level.
Re-estimate the treatment effect on house prices using the 2x2
difference-in-difference estimator with the set of covariates
rooms
, baths
, age
,
area
, and land
, while clustering the standard
errors at the cbd
level. Present a well-formatted
estimation table and compare your results to the precedent
ones.
Using the set of covariates rooms
,
baths
, age
, agesq
,
larea
, and lland
, re-estimate the treatment
effect on house prices, while applying a difference-in-differences
estimator based on inverse probability weighting (IPW) with cluster
bootstrap. You may use the didweight
command of the
causalweight
package. Show the treatment effect, standard
errors, and p-value.
Using the set of covariates rooms
,
baths
, age
, area
, and
land
, re-estimate the treatment effect on house prices,
while applying a difference-in-differences estimator based on inverse
probability weighting (IPW) with cluster bootstrap. Show the treatment
effect, standard errors, and p-value.
Interpret and compare these estimation results using covariates.
Of the five difference-in-differences estimates, what will be your preferred estimate? Why?
You observed a discussion between two friends who had access to county-year panel data spanning two time periods (\(t=1,2\)). The dataset includes the outcome variable for county \(i\) at time \(t\) \(\left(Y_{it}\right)\) and a binary treatment indicator \(\left(D_{it}\right)\) for county \(i\) at time \(t\). In this setup, no counties were treated in the first period \(\left(D_{i1}=0 \quad \forall i \right)\), but some counties received treatment in the second period \(\left(D_{i2}=0,1\right)\).
Your friends defined \(Y_{it}(0)\) as the potential outcome for county \(i\) in period \(t\) if it remained untreated in both periods, and \(Y_{it}(1)\) as the potential outcome for county \(i\) in period \(t\) if it was untreated at \(t=1\) but received treatment at \(t=2\). They assumed parallel trends in the average outcome for the treated and untreated counties in the absence of treatment:
\[\begin{equation} \tag{A1} \mathop{E}\left[Y_{i2}(0)-Y_{i1}(0)|D_{i}=1\right] = \mathop{E}\left[Y_{i2}(0)-Y_{i1}(0)|D_{i}=0\right], \end{equation}\] where \(D_{i}=max \lbrace D_{i1}, D_{i2} \rbrace = D_{i2}\) is the treatment status of county \(i\).
Both friends were interested in estimating the average treatment effect on the treated \(\left(ATT\right)\) at \(t=2\) that would be robust to treatment effect heterogeneity. Friend 1 proposed estimating Equation (1) to obtain \(\tau\), while Friend 2 suggested using Equation (2) to estimate \(\delta\). In Equation (1), \(\eta_{i}\) and \(\theta_{t}\) represent county and time fixed effects, respectively, while in Equation (2), \(P_{t}\) serves as the indicator for the second period.
\[\begin{equation} \tag{1} Y_{it} = \eta_{i} + \theta_{t} + \tau D_{it} + \epsilon_{it}. \end{equation}\]
\[\begin{equation} \tag{2} Y_{it} = \alpha + \beta D_{i} + \lambda P_{t} + \delta \left(D_{i} \times P_{t}\right) + \varepsilon_{it}. \end{equation}\]
Your first task is to adjudicate between the two approaches. Demonstrate your reasoning.
Friend 3 joined your discussion and suggested estimating the equation below as in Equation (3) of Roth, Sant’Anna, Bilinski, & Poe (JoE 2023).
\[\begin{equation} \tag{3} Y_{it} = \gamma_{i} + \phi_{t} + \rho \left(D_{i} \times P_{t}\right) + \nu_{it}. \end{equation}\]
Using a subset of their dataset provided below, compare \(\hat{\tau}\), \(\hat{\delta}\), and \(\hat{\rho}\), clustering standard errors at
\(i\) level. You may use packages like
stargazer
, modelsummary
, texreg
,
or any other package that helps you produce well-formatted estimation
tables. Draw your conclusions.
# Friend's data
i <- c(1, 1, 2, 2, 3, 3, 4, 4)
t <- c(1, 2, 1, 2, 1, 2, 1, 2)
P_t <- c(0, 1, 0, 1, 0, 1, 0, 1)
D_it <- c(0, 0, 0, 0, 0, 1, 0, 1)
D_i <- c(0, 0, 0, 0, 1, 1, 1, 1)
Y_it <- c(6, 7, 10, 9, 8, 14, 8, 12)
df <- data.frame(i=i, t=t, P_t=P_t, D_it=D_it, D_i=D_i, Y_it=Y_it)
df
## i t P_t D_it D_i Y_it
## 1 1 1 0 0 0 6
## 2 1 2 1 0 0 7
## 3 2 1 0 0 0 10
## 4 2 2 1 0 0 9
## 5 3 1 0 0 1 8
## 6 3 2 1 1 1 14
## 7 4 1 0 0 1 8
## 8 4 2 1 1 1 12
rm(i, t, P_t, D_it, D_i, Y_it)
Practice data. The practice data of this question is
Callaway &
Sant’Anna (JoE 2021)’s dataset on teen employment in 500 US counties
from 2003 to 2007, before and after introducing a minimum wage policy.
You can load the did
package by Callaway & Sant’Anna
(2020) to access the dataset mpdta
. Counties are identified
by the variable countyreal
, and the year of the observation
by year
. The variable treat
is the binary
indicator for a county introducing a minimum wage in the staggered
sense. The variable first.treat
is the first year in which
the county introduced a minimum wage, which takes value 0 for counties
without minimum wage policy (\(treat=0\)) and values 2004-2007 for
counties with minimum wage policy (\(treat=1\)). The variable lemp
is the log of employment among teenagers in the county in that year.
Other variables include the log of population in thousands measured only
once for the county (lpop
), etc. You may call the dataset
(?mpdta
) for further description.
# Load `mpdta` data
library(did)
data(mpdta)
table(mpdta$treat, mpdta$year)
##
## 2003 2004 2005 2006 2007
## 0 309 309 309 309 309
## 1 191 191 191 191 191
length(unique(mpdta$first.treat)) # cohorts or groups (incl. never-treated)
## [1] 4
table(mpdta$first.treat, mpdta$year)
##
## 2003 2004 2005 2006 2007
## 0 309 309 309 309 309
## 2004 20 20 20 20 20
## 2006 40 40 40 40 40
## 2007 131 131 131 131 131
summary(mpdta$first.treat[mpdta$treat==0])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 0
summary(mpdta$first.treat[mpdta$treat==1])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2004 2006 2007 2006 2007 2007
summary(mpdta$lemp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.099 4.868 5.697 5.773 6.684 10.443
#?mpdta
Create a single figure combining two county-level maps: one
showing the treat
data, another showing the
first.treat
data. You may use the usmap
package or any other package that helps you produce high-quality maps.
Interpret the figure.
Create county-level maps showing the means of lemp
by first.treat
categories, year-by-year. Combine the maps
as desired. You may use the usmap
package or any other
package that helps you produce high-quality maps. Interpret the
figure.
Using the att_gt
command of the did
package, compute the group-time average treatment effects on the treated
based on outcome regression estimation with the never-treated as the
control group, lpop
as the covariate, and clustering
standard errors at the countyreal
level. Interpret your
results.
Using the att_gt
command of the did
package, compute the group-time average treatment effects on the treated
based on inverse probability weighting (IPW) estimation with the
never-treated as the control group, lpop
as the covariate,
and clustering standard errors at the countyreal
level. Do
your results qualitatively change as compared to the regression
adjustment above?
Using the att_gt
command of the did
package, compute the group-time average treatment effects on the treated
based on a doubly robust (DR) estimation with the never-treated as the
control group, lpop
as the covariate, and clustering
standard errors at the countyreal
level. Do your results
qualitatively change as compared to the regression adjustment and IPW
procedures above?
Using the ggdid
command of the did
package, plot the group-time average treatment effects on the treated
based on a doubly robust (DR) estimation. Do your earlier maps hint to
these results?
Using the aggte
command of the did
package, compute the group-specific and overall average treatment
effects on the treated based on a doubly robust (DR) estimation with the
never-treated as the control group, lpop
as the covariate,
and clustering standard errors at the countyreal
level.
Interpret your results.
Based on Table 2 from Callaway &
Sant’Anna (JoE 2021) or Table 2 from de Chaisemartin &
D’Haultfœuille (Econometrics J 2022), identify three R
packages, apart from did
, that are suitable for
heterogeneity-robust estimations in contexts of staggered treatment
timing. Employ each of these packages to re-estimate the average
treatment effect on the treated and compare the results with those
obtained using the did
package.
HAVE FUN AND KEEP FAITH IN THE FUN!