GENERAL NOTES

Objective of Problem Set 6

  • In this final problem set of the semester, you will practice multiple causal inference methods in R. You will be asked to combine the difference-in-differences approach with other methods (synthetic control, regression discontinuity design). You will also practice machine learning and event studies.

Submission process

  • When you are done with the problem set, publish it on Rpubs using your temporary account.
  • Copy the RPubs link of your work and submit it on Canvas.
  • For any entirely equal submissions, whoever sent me their RPubs link last has copied the others. So, timely submissions are important. Own your work. I can randomly ask your R script and .Rmd files for double-checking purposes. As a standard practice, work in a script file before making your code chunks in the .Rmd file. Your .Rmd file and Rpubs submission page MUST show the code used to produce any of the outputs you present in your answers.
  • As discussed in class, you may choose to treat two of the problems as optional and focus on solving the remaining two.

Academic integrity

Academic integrity is the pursuit of scholarly activity in an open, honest and responsible manner. Academic integrity is a basic guiding principle for all academic activity at The Pennsylvania State University, and all members of the University community are expected to act in accordance with this principle. Consistent with this expectation, the University’s Code of Conduct states that all students should act with personal integrity, respect other students’ dignity, rights and property, and help create and maintain an environment in which all can succeed through the fruits of their efforts.

Academic integrity includes a commitment by all members of the University community not to engage in or tolerate acts of falsification, misrepresentation or deception. Such acts of dishonesty violate the fundamental ethical principles of the University community and compromise the worth of work completed by others.

# Load packages
library(pacman)
p_load(devtools, synthdid, did, DRDID, gtrendsR, glmnet, causalweight, nnls, grf, 
       rdrobust, rdd, RDHonest, tidyverse, lubridate, usmap, gridExtra, stringr, readxl, 
       reshape2, scales, broom, data.table, ggplot2, stargazer, modelsummary,
       foreign, ggthemes, ggforce, ggridges, latex2exp, viridis, extrafont, 
       kableExtra, snakecase, janitor, tableone, lfe, fixest, multcomp, etable)

PROBLEM 1. Diff.-in-Diff. and Synthetic Control

i. Partial replication

  • Table 1. Replicate Table 1 of the paper Arkhangelsky, Athey, Hirshberg, Imbens, and Wager (AER 2021), focusing on the first three estimators: synthetic difference-in-differences (SDID), synthetic control (SC), and difference-in-differences (DID). You may use the replication package folder 146381-V1 of the paper and the script file synthdid-sdid-paper.R, both available on Canvas.

  • Figure 1. Replicate Figure 1 of the same paper and interpret it using your own words.

ii. Extensions

  • Restricted donor pool. Remove entirely the states of Nevada and New Hampshire from the data and re-do the analysis. Generate the corresponding Table 1 and Figure 1. You may name the new outputs Table 2 and Figure 2 to differentiate them from part (i). How do the results change?

  • Restricted pre-treatment periods. Remove entirely the pre-treatment periods 1970-1974 from the data and re-do the analysis. Name the new outputs Table 3 and Figure 3. How do the results change?

  • Restricted post-treatment periods. Remove entirely the post-treatment periods 1996-2000 from the data and re-do the analysis. Name the new outputs Table 4 and Figure 4. How do the results change?

PROBLEM 2. Event Study

i. “Tariffs Web Searches” event study

In the last five years in the US, web searches for ‘Tariffs’ began to soar only at the end of the election period in 2024, peaking in late March through April 2025. A similar trend is observed in Canada and Mexico.

Your task is to evaluate the static and dynamic effects of a ‘Tariffs Web Searches’ event on an outcome of your choice. Set up the ‘Tariffs Web Searches’ event in October 2024, and use monthly data on your chosen outcome spanning from April 2020 through March 2025 (or at least some time window covering the periods before and after the event) for three countries of interest: the US, Canada, and Mexico. Using an estimation method of your choice, analyze the impact of the event on the outcome.

Data on your chosen outcome and covariates (if any) must be publicly available. For example, you can use country-year-month-level data from sources like https://fred.stlouisfed.org/, https://www.bea.gov/, international data repositories, etc.

PROBLEM 3. Machine Learning

i. Ridge regression

  • Suppose you have the model \(Y = X'\beta + U\), where \(\mathop{E}\left[U|X\right]=0\) and \(X\) has full rank. You want to predict \(Y\) using the ridge regression estimator \(\left(\beta_{R}=(X'X + \lambda I)^{-1}X'Y\right)\), with a tuning parameter \(\lambda > 0\). Will the estimator \(\beta_{R}\) be biased for \(\beta\) ?

ii. LASSO

Practice dataset. You will use the housing dataset testdata20250121.RDS posted in Canvas and previously used in Problem Sets 1 & 2. It has information on sale price of the house, state, year in which the house was built, number of bedrooms, bathrooms, fireplaces, stories, square footage, and presence or absence of AC, etc.

  • (a). Clean the dataset as previously preprocessed in Problem Set 1. Then, divide it into two subsets: a training dataset that excludes data from New York, Colorado, and California, and a test dataset comprising only data from these three states.

  • (b). Estimate a hedonic price model on the training dataset using LASSO regression, where your response \(y\) is log_sale_amount; your \(x\) matrix includes year_built, bedrooms_all_buildings, number_of_bathrooms, number_of_fireplaces, stories_number, log_land_square_footage, and AC_presence; you use 5-fold cross validation and evaluate models using mse.

  • (c). What is the optimal tuning parameter lambda? What are the LASSO coefficients in your retained model? Based on the LASSO regression, what is the marginal willingness to pay for AC in your training dataset?

  • (d). Based on the LASSO coefficients derived from the training dataset and the average housing attributes in the test dataset, calculate the difference in predicted average house price between homes with AC and homes without AC in the test dataset.

  • (e). Compute the difference in observed average house price between homes with AC and homes without AC in the test dataset. How close is this observed difference to the predicted difference?

  • (f). Using the test data, run an OLS regression of \(y\) on \(x\) separately for each state. Determine the marginal willingness to pay for AC in New York, Colorado, and California. Compare these values to the LASSO-estimated marginal willingness to pay for AC obtained from the training dataset. How similar are the results?

iii. Partialling out

You have access to billions of observations in your dataset and intend to estimate Equation (1) using OLS. However, the model takes a very long time to run and eventually crashes. You explained to your friend that you were unable to successfully estimate Equation (1) using OLS. Fortunately, you discovered that estimating Equation (2) or Equation (3) using OLS is considerably faster and runs successfully. \[\begin{equation} \tag{1} Y_{i} = X_{1}^{'}\beta_{1}+X_{2}^{'}\beta_{2}+\varepsilon_{i}. \end{equation}\]

\[\begin{equation} \tag{2} Y_{i} = X_{1}^{'}\overline{\beta}_{1}+\mu_{i}. \end{equation}\]

\[\begin{equation} \tag{3} Y_{i} = X_{2}^{'}\overline{\beta}_{2}+\nu_{i}. \end{equation}\]

Your friend proposed two alternative solutions. Alternative 1: you estimate Equation (2), obtain the residuals \(\hat{\mu}_{i}\), and then regress these residuals on the set of regressors \(X_{2}\) as in Equation (4) using OLS. Alternative 2: you estimate Equation (3), obtain the residuals \(\hat{\nu}_{i}\), and then regress these residuals on the set of regressors \(X_{1}\) as in Equation (5) using OLS. Their claim is that \(\tilde{\beta}_{2}=\beta_{2}\) and \(\tilde{\beta}_{1}=\beta_{1}\).

\[\begin{equation} \tag{4} \hat{\mu}_{i} = X_{2}^{'}\tilde{\beta}_{2}+\omega_{i}. \end{equation}\]

\[\begin{equation} \tag{5} \hat{\nu}_{i} = X_{1}^{'}\tilde{\beta}_{1}+\rho_{i}. \end{equation}\]

  • Is your friend’s claim true or false? Explain your reasoning.

iv. Debiased machine learning

  • (a). Using the California subset of the test dataset, apply double machine learning to re-estimate the marginal willingness to pay for AC. Utilize the treatDML command from the causalweight package, with LASSO as the machine learning method and 3-fold cross-fitting.

  • (b). How could you improve the results?

PROBLEM 4. Regression Discontinuity Design

i. Conventional and Robust RDD (Sharp design)

Practice dataset. You will work with the built-in data rdrobust_RDsenate of the rdrobust package from Calonico, Cattaneo, Farrell, and Titiunik (2021) in R. You have about 1,400 observations on US Senate elections and you want to assess whether winning a seat in the US Senate in an election provides an advantage for winning the same seat in the next election. The running variable margin is the Democratic Party’s margin of winning relative to the Republican Party in the precedent US Senate elections: if positive, then the Democratic Party won the seat in the previous elections; if negative, then the Democratic Party lost the seat in the previous elections. The outcome vote is the percent of votes of the Democratic Party in the next US Senate elections. The treatment is the binary indicator for the Democratic Party winning the previous US Senate elections. You can assume that at the threshold value zero of margin, treated and control groups are comparable in terms of regional, political, and other voting background characteristics.

  • (a). Plot the outcome against the running variable and show the sharp discontinuity in mean outcomes at the threshold value zero of the running variable.

  • (b). The selection of an optimal bandwidth for the effect estimation has been discussed in several papers, including Calonico, Cattaneo, and Titiunik (Econometrica 2014).

    • Using the rdrobust command, present conventional and robust inference results of the impact evaluation. Interpret the output.

    • Without relying on the rdrobust command, reproduce the conventional inference results.

  • (c). Is the running variable continuous at the threshold?

    • Plot a histogram for the density of running variable and the number of observations in each bin of the running variable.

    • Using the DCdensity command of the rdd package by Dimmery (2016), perform McCrary (JoE 2008)’s test for a discontinuity in the density of running variable at the threshold. Interpret the output.

ii. Conventional and Robust RDD (Fuzzy design)

Practice dataset. You will work with the built-in data rcp of the RDHonest package from Kolesar (2021) in R. You have about 30,000 observations on household consumption, actual retirement, and pension eligibility. You want to assess the effect of actual retirement on household consumption. The treatment retired is the binary indicator for actual retirement. The outcome cn is household expenditure on non-durable goods in euros per year. The running variable elig_year is the age in years from reaching pension eligibility, which is centered such that the threshold is zero. You can consider being just above or below the threshold as an instrument for the treatment.

  • (a). Create a plot of the treatment against the running variable, and another plot of the outcome against the running variable. Based on these visualizations, summarize your argument in support of a fuzzy regression discontinuity design.

  • (b). Using the rdrobust command with its fuzzy argument, present conventional and robust inference results of the impact evaluation. Interpret the output.

  • (c). Without relying on the rdrobust command, reproduce the conventional inference results.

iii. RDD and Diff.-in-Diff.

Find one paper in the applied literature which bases its identification strategy on combining regression discontinuity design and difference-in-differences approaches. One example is Ruan, Cai, & Jin (AJAE 2021). But, feel free to suggest any recent paper in your reputable field journals that relies on these methods in their identification strategy.

Provide the following list of information:

  • (a). Full reference to the paper: DOI link.

  • (b). Very short description of the research question the paper tries to answer (1-2 paragraphs).

  • (c). Very short description of the data (1-2 paragraphs).

  • (d). Main argument of why regression discontinuity design and difference-in-differences are employed, justifying why regression discontinuity design alone would have provided weak causal inference and why difference-in-differences alone would have provided weak causal inference (2-3 paragraphs).

  • (e). Main finding(s) of the paper (1-2 paragraphs).

  • (f). Robustness checks that are conducted, including the verification of the main assumption(s) of the difference-in-differences and the validity checks for the regression discontinuity design (2-3 paragraphs).

HAVE FUN AND KEEP FAITH IN THE FUN!