R
. You will be asked
to combine the difference-in-differences approach with other methods
(synthetic control, regression discontinuity design). You will also
practice machine learning and event studies.Rpubs
using your temporary account.RPubs
link of your work and submit it on
Canvas.RPubs
link last has copied the others. So, timely
submissions are important. Own your work. I can randomly ask your
R
script and .Rmd files for double-checking purposes. As a
standard practice, work in a script file before making your code chunks
in the .Rmd file. Your .Rmd file and Rpubs
submission page
MUST show the code used to produce any of the outputs you present in
your answers.Academic integrity is the pursuit of scholarly activity in an open, honest and responsible manner. Academic integrity is a basic guiding principle for all academic activity at The Pennsylvania State University, and all members of the University community are expected to act in accordance with this principle. Consistent with this expectation, the University’s Code of Conduct states that all students should act with personal integrity, respect other students’ dignity, rights and property, and help create and maintain an environment in which all can succeed through the fruits of their efforts.
Academic integrity includes a commitment by all members of the University community not to engage in or tolerate acts of falsification, misrepresentation or deception. Such acts of dishonesty violate the fundamental ethical principles of the University community and compromise the worth of work completed by others.
# Load packages
library(pacman)
p_load(devtools, synthdid, did, DRDID, gtrendsR, glmnet, causalweight, nnls, grf,
rdrobust, rdd, RDHonest, tidyverse, lubridate, usmap, gridExtra, stringr, readxl,
reshape2, scales, broom, data.table, ggplot2, stargazer, modelsummary,
foreign, ggthemes, ggforce, ggridges, latex2exp, viridis, extrafont,
kableExtra, snakecase, janitor, tableone, lfe, fixest, multcomp, etable)
Table 1. Replicate Table 1 of the paper Arkhangelsky, Athey,
Hirshberg, Imbens, and Wager (AER 2021), focusing on the first three
estimators: synthetic difference-in-differences (SDID), synthetic
control (SC), and difference-in-differences (DID). You may use the
replication package folder 146381-V1
of the paper and the
script file synthdid-sdid-paper.R
, both available on
Canvas.
Figure 1. Replicate Figure 1 of the same paper and interpret it using your own words.
Restricted donor pool. Remove entirely the states of Nevada and New Hampshire from the data and re-do the analysis. Generate the corresponding Table 1 and Figure 1. You may name the new outputs Table 2 and Figure 2 to differentiate them from part (i). How do the results change?
Restricted pre-treatment periods. Remove entirely the pre-treatment periods 1970-1974 from the data and re-do the analysis. Name the new outputs Table 3 and Figure 3. How do the results change?
Restricted post-treatment periods. Remove entirely the post-treatment periods 1996-2000 from the data and re-do the analysis. Name the new outputs Table 4 and Figure 4. How do the results change?
In the last five years in the US, web searches for ‘Tariffs’ began to soar only at the end of the election period in 2024, peaking in late March through April 2025. A similar trend is observed in Canada and Mexico.
Your task is to evaluate the static and dynamic effects of a ‘Tariffs Web Searches’ event on an outcome of your choice. Set up the ‘Tariffs Web Searches’ event in October 2024, and use monthly data on your chosen outcome spanning from April 2020 through March 2025 (or at least some time window covering the periods before and after the event) for three countries of interest: the US, Canada, and Mexico. Using an estimation method of your choice, analyze the impact of the event on the outcome.
Data on your chosen outcome and covariates (if any) must be publicly available. For example, you can use country-year-month-level data from sources like https://fred.stlouisfed.org/, https://www.bea.gov/, international data repositories, etc.
Practice dataset. You will use the housing dataset
testdata20250121.RDS
posted in Canvas and previously used
in Problem Sets 1 & 2. It has information on sale price of the
house, state, year in which the house was built, number of bedrooms,
bathrooms, fireplaces, stories, square footage, and presence or absence
of AC, etc.
(a). Clean the dataset as previously preprocessed in Problem Set
1. Then, divide it into two subsets: a training
dataset
that excludes data from New York, Colorado, and California, and a
test
dataset comprising only data from these three
states.
(b). Estimate a hedonic price model on the training
dataset using LASSO regression, where your response \(y\) is log_sale_amount
; your
\(x\) matrix includes
year_built
, bedrooms_all_buildings
,
number_of_bathrooms
, number_of_fireplaces
,
stories_number
, log_land_square_footage
, and
AC_presence
; you use 5-fold cross validation
and evaluate models using mse
.
(c). What is the optimal tuning parameter lambda
?
What are the LASSO coefficients in your retained model? Based on the
LASSO regression, what is the marginal willingness to pay for AC in your
training
dataset?
(d). Based on the LASSO coefficients derived from the
training
dataset and the average housing attributes in the
test
dataset, calculate the difference in predicted average
house price between homes with AC and homes without AC in the
test
dataset.
(e). Compute the difference in observed average house price
between homes with AC and homes without AC in the test
dataset. How close is this observed difference to the predicted
difference?
(f). Using the test
data, run an OLS regression of
\(y\) on \(x\) separately for each state. Determine
the marginal willingness to pay for AC in New York, Colorado, and
California. Compare these values to the LASSO-estimated marginal
willingness to pay for AC obtained from the training
dataset. How similar are the results?
You have access to billions of observations in your dataset and intend to estimate Equation (1) using OLS. However, the model takes a very long time to run and eventually crashes. You explained to your friend that you were unable to successfully estimate Equation (1) using OLS. Fortunately, you discovered that estimating Equation (2) or Equation (3) using OLS is considerably faster and runs successfully. \[\begin{equation} \tag{1} Y_{i} = X_{1}^{'}\beta_{1}+X_{2}^{'}\beta_{2}+\varepsilon_{i}. \end{equation}\]
\[\begin{equation} \tag{2} Y_{i} = X_{1}^{'}\overline{\beta}_{1}+\mu_{i}. \end{equation}\]
\[\begin{equation} \tag{3} Y_{i} = X_{2}^{'}\overline{\beta}_{2}+\nu_{i}. \end{equation}\]
Your friend proposed two alternative solutions. Alternative 1: you estimate Equation (2), obtain the residuals \(\hat{\mu}_{i}\), and then regress these residuals on the set of regressors \(X_{2}\) as in Equation (4) using OLS. Alternative 2: you estimate Equation (3), obtain the residuals \(\hat{\nu}_{i}\), and then regress these residuals on the set of regressors \(X_{1}\) as in Equation (5) using OLS. Their claim is that \(\tilde{\beta}_{2}=\beta_{2}\) and \(\tilde{\beta}_{1}=\beta_{1}\).
\[\begin{equation} \tag{4} \hat{\mu}_{i} = X_{2}^{'}\tilde{\beta}_{2}+\omega_{i}. \end{equation}\]
\[\begin{equation} \tag{5} \hat{\nu}_{i} = X_{1}^{'}\tilde{\beta}_{1}+\rho_{i}. \end{equation}\]
(a). Using the California subset of the test
dataset, apply double machine learning to re-estimate the marginal
willingness to pay for AC. Utilize the treatDML
command
from the causalweight
package, with LASSO as the machine
learning method and 3-fold cross-fitting.
(b). How could you improve the results?
Practice dataset. You will work with the built-in
data rdrobust_RDsenate
of the rdrobust
package
from Calonico, Cattaneo, Farrell, and Titiunik (2021) in R
.
You have about 1,400 observations on US Senate elections and you want to
assess whether winning a seat in the US Senate in an election provides
an advantage for winning the same seat in the next election. The
running variable margin
is the Democratic Party’s
margin of winning relative to the Republican Party in the precedent US
Senate elections: if positive, then the Democratic Party won the seat in
the previous elections; if negative, then the Democratic Party lost the
seat in the previous elections. The outcome vote
is the percent of votes of the Democratic Party in the next US Senate
elections. The treatment is the binary indicator for the
Democratic Party winning the previous US Senate elections. You can
assume that at the threshold value zero of margin
,
treated and control groups are comparable in terms of regional,
political, and other voting background characteristics.
(a). Plot the outcome against the running variable and show the sharp discontinuity in mean outcomes at the threshold value zero of the running variable.
(b). The selection of an optimal bandwidth for the effect estimation has been discussed in several papers, including Calonico, Cattaneo, and Titiunik (Econometrica 2014).
Using the rdrobust
command, present conventional and
robust inference results of the impact evaluation. Interpret the
output.
Without relying on the rdrobust
command, reproduce
the conventional inference results.
(c). Is the running variable continuous at the threshold?
Plot a histogram for the density of running variable and the number of observations in each bin of the running variable.
Using the DCdensity
command of the rdd
package by Dimmery (2016), perform McCrary (JoE
2008)’s test for a discontinuity in the density of running variable
at the threshold. Interpret the output.
Practice dataset. You will work with the built-in
data rcp
of the RDHonest
package from Kolesar
(2021) in R
. You have about 30,000 observations on
household consumption, actual retirement, and pension eligibility. You
want to assess the effect of actual retirement on household consumption.
The treatment retired
is the binary indicator for
actual retirement. The outcome cn
is household
expenditure on non-durable goods in euros per year. The running
variable elig_year
is the age in years from reaching
pension eligibility, which is centered such that the threshold
is zero. You can consider being just above or below the threshold as an
instrument for the treatment.
(a). Create a plot of the treatment against the running variable, and another plot of the outcome against the running variable. Based on these visualizations, summarize your argument in support of a fuzzy regression discontinuity design.
(b). Using the rdrobust
command with its
fuzzy argument, present conventional and robust inference
results of the impact evaluation. Interpret the output.
(c). Without relying on the rdrobust
command,
reproduce the conventional inference results.
Find one paper in the applied literature which bases its identification strategy on combining regression discontinuity design and difference-in-differences approaches. One example is Ruan, Cai, & Jin (AJAE 2021). But, feel free to suggest any recent paper in your reputable field journals that relies on these methods in their identification strategy.
Provide the following list of information:
(a). Full reference to the paper: DOI link.
(b). Very short description of the research question the paper tries to answer (1-2 paragraphs).
(c). Very short description of the data (1-2 paragraphs).
(d). Main argument of why regression discontinuity design and difference-in-differences are employed, justifying why regression discontinuity design alone would have provided weak causal inference and why difference-in-differences alone would have provided weak causal inference (2-3 paragraphs).
(e). Main finding(s) of the paper (1-2 paragraphs).
(f). Robustness checks that are conducted, including the verification of the main assumption(s) of the difference-in-differences and the validity checks for the regression discontinuity design (2-3 paragraphs).
HAVE FUN AND KEEP FAITH IN THE FUN!