Part 1: Please skim through the entire article first - A Guide To Using The Difference-In-Differences Regression Model
A. In a simple diff-in-diff paper, we are looking for either the 2 X 2 matrix of regression equations (so that you can construct the diff-in-diff estimator), or you are looking for the following regression form: Y = α + β1(time)+ β2(treatment) + β3(time*treatment), where
-β1 is the expected mean change in outcome from before to after the onset of the intervention era among the control group. It reflects, if you will, the pure effect of the passage of time in the absence of the actual intervention. -β2 (coefficient of the treatment variable) is the estimated mean difference in Y between the treatment and control groups prior to the intervention: it represents whatever “baseline” differences existed between the groups before the intervention was applied to the control group. -β3 by itself is the difference in differences estimator. In most contexts, it is β3 that is the focus of interest. It tells us whether the expected mean change in outcome from before to after was different in the two groups. (That would typically be the hallmark of an effective intervention, assuming adequate power, etc.) -To get the estimated mean difference in Y between the treatment and control groups after the intervention, you need to look at β1 + β3. It is possible that you will find that β1 + β3 is significantly different from zero, even though neither β1, nor β3 by itself is.
B. Now, you will run a simple linear regression in R on the datasetLinks to an external site. in the guide Links to an external site.example above where
-the dependent variable is HPI_CHG -the independent variables are Time_Period (dummy), Disaster Affected (dummy), and their interaction term (just multiply the two dummiesLinks to an external site.).
# Load necessary librarieslibrary(readr)
Warning: package 'readr' was built under R version 4.3.3
library(dplyr)
Warning: package 'dplyr' was built under R version 4.3.3
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
# Load the datadata <-read_csv("./us_fred_coastal_us_states_avg_hpi_before_after_2005-1.csv") # Adjust the path to your file
Rows: 48 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): STATE
dbl (5): HPI_CHG, Time_Period, Disaster_Affected, NUM_DISASTERS, NUM_IND_ASSIST
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Creating the interaction termdata <-mutate(data, Interaction = Time_Period * Disaster_Affected)# Running the linear regression modelmodel <-lm(HPI_CHG ~ Time_Period + Disaster_Affected + Interaction, data = data)# Output the summary of the modelsummary(model)
Call:
lm(formula = HPI_CHG ~ Time_Period + Disaster_Affected + Interaction,
data = data)
Residuals:
Min 1Q Median 3Q Max
-0.023081 -0.007610 -0.000171 0.004656 0.035981
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.037090 0.002819 13.157 < 2e-16 ***
Time_Period -0.027847 0.003987 -6.985 1.2e-08 ***
Disaster_Affected -0.013944 0.006176 -2.258 0.0290 *
Interaction 0.019739 0.008734 2.260 0.0288 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01229 on 44 degrees of freedom
Multiple R-squared: 0.5356, Adjusted R-squared: 0.504
F-statistic: 16.92 on 3 and 44 DF, p-value: 1.882e-07
C. Let’s try to understand and interpret what you have done.
-What is the control and the control group, and what is the treatment and the treatment group? 4 lines max.
Control: The condition of being unaffected by the disaster. Control Group: Areas that experienced the control condition, meaning they were not impacted by the disaster. Treatment: The disaster event itself. Treatment Group: Areas that were directly affected by the disaster.
-The basic idea is to compare the difference in outcomes between the treatment and control groups before and after the treatment is introduced. By differencing these differences, what do you hope to achieve, and why should it work? Intuitively, why does methodology work/describe it in plain simple English to your grandma.
The basic idea of this approach, called difference-in-differences, is to really understand how much impact something specific (like a natural disaster) has on house prices by comparing two sets of areas: one that was affected by the disaster (treatment group) and one that wasn’t (control group). Here’s why it should work and what we hope to achieve: By looking at how house prices change in both groups before and after the disaster, we can sort out what changes are just normal fluctuations that would have happened anyway (like market trends) and what changes are directly due to the disaster. This way, we don’t mistakenly attribute normal price changes to the disaster. In simpler terms for explaining: Imagine you and your neighbor both grew tomatoes in your gardens. This year, you tried a new fertilizer (the “treatment”) but your neighbor didn’t. At the end of the season, to see if the fertilizer really works, you compare how much more your tomatoes grew compared to your neighbor’s. Just like comparing gardens helps us see the fertilizer’s effect, comparing house prices in different areas helps us see the disaster’s real impact.
-Create the 2 X 2 matrix of regression equations with actual values from the data. HINT - take the mean of your y variable, not the count in each cell. Does your difference in difference coefficient in the linear regression above match the difference in difference effect of the two groups from the 2*2 table created? It should match by the way.
# Assuming data has been loaded into a dataframe called 'data'# and that it includes 'Time_Period', 'Disaster_Affected', and 'HPI_CHG'# Calculate the mean HPI_CHG for each group and time periodlibrary(dplyr)# Creating a summary table to calculate means for each groupsummary_table <- data %>%group_by(Time_Period, Disaster_Affected) %>%summarise(Mean_HPI_CHG =mean(HPI_CHG, na.rm =TRUE)) %>%ungroup()
`summarise()` has grouped output by 'Time_Period'. You can override using the
`.groups` argument.
# Display the summary table to see the meansprint(summary_table)
# Creating the 2x2 matrix# Assuming Time_Period and Disaster_Affected are both dummy variables 0 or 1matrix_2x2 <-matrix(nrow =2, ncol =2)colnames(matrix_2x2) <-c("Control", "Treatment")rownames(matrix_2x2) <-c("Pre-Treatment", "Post-Treatment")# Fill the matrix with calculated meansfor (i in0:1) {for (j in0:1) { matrix_2x2[i +1, j +1] <- summary_table$Mean_HPI_CHG[summary_table$Time_Period == i & summary_table$Disaster_Affected == j] }}# Print the 2x2 matrixprint(matrix_2x2)
Control Treatment
Pre-Treatment 0.037090020 0.02314612
Post-Treatment 0.009242792 0.01503835
# Compute Difference-in-Differences estimate from the 2x2 matrixDiD_estimate <- (matrix_2x2[2,2] - matrix_2x2[1,2]) - (matrix_2x2[2,1] - matrix_2x2[1,1])print(paste("Difference-in-Differences Estimate from 2x2 Matrix: ", DiD_estimate))
[1] "Difference-in-Differences Estimate from 2x2 Matrix: 0.0197394560315789"
# Compare with regression coefficient of interaction term# Assuming you have already run the regression and have the model stored in 'model'interaction_coefficient <-summary(model)$coefficients["Interaction", "Estimate"]print(paste("Regression Coefficient for Interaction Term: ", interaction_coefficient))
[1] "Regression Coefficient for Interaction Term: 0.0197394560315789"
# Checking if they matchif (round(DiD_estimate, 5) ==round(interaction_coefficient, 5)) {print("The DiD estimate matches the regression coefficient.")} else {print("The DiD estimate does not match the regression coefficient.")}
[1] "The DiD estimate matches the regression coefficient."
The difference-in-differences (DiD) coefficient from my linear regression matches the DiD effect calculated manually from the 2x2 table I created. Both methods resulted in a DiD estimate of approximately 0.019739456, confirming that the regression model accurately reflects the impact of the treatment on the outcome variable when compared to the manual calculations
D. What are the “threats to identification”? In other words, what are the “implicit assumptions” like for simple OLS we have the 5 Gauss Markov Assumptions/conditions? Alternatively, under what conditions can you trust the point estimates, and when would you buy the study results with a grain of salt? Googling may help with weaknesses of the study, or when does Diff-in-Diff methodology Download Diff-in-Diff methodologyfails in general. (HINT: parallel trend assumption) (2-3 sentences at least).
Threats to Identification: The most critical threat in DiD analysis is the violation of the parallel trends assumption. This assumption is essential as it underpins that, without the treatment, the outcome trajectories for both treated and control groups would have been similar over time. If these trends diverge or if the groups are impacted differently by external shocks (violating the common shocks assumption), the DiD estimator may become biased. These biases occur because the estimator might attribute changes driven by other factors to the treatment effect. Trust in Point Estimates: Trust in the point estimates from a DiD analysis is contingent on the strength and validity of its underlying assumptions. The parallel trends assumption must hold for the results to be credible. Additionally, there should be no anticipation effects, meaning participants should not change their behavior before the treatment based on their expectations of the treatment. When these assumptions are met, and the model appropriately accounts for potential confounders and external shocks, the point estimates can be considered reliable. However, if there’s evidence that these conditions are compromised, then the study results should be taken with caution, acknowledging the potential for biased outcomes. These insights are drawn from the thorough exploration of DiD methodology discussed in the documents you provided, which emphasize the importance of these assumptions for the integrity and reliability of DiD estimates.
E. In the paper here that uses triple Diff, what are the three margins? Type out the estimating equation and explain the study design - why does it work (who are we comparing against whom)? 8 lines max.
Age Groups (Treatment vs. Control): Younger girls (14-15) eligible for bicycles versus older girls (16-17) who were not. Gender: Changes in girls’ enrollment compared to boys’ within the same age groups. Geographic Comparison: Bihar (where the program was implemented) versus neighboring Jharkhand (without the program). \[ y=\beta_0 + \beta_1 * Female * Treatment * Bihar + controls + \epsilon \] This model works by comparing enrollment changes among younger girls eligible for bicycles in Bihar against older girls in Bihar, younger and older boys in Bihar, and similar groups in Jharkhand. This comprehensive comparison isolates the specific impact of the bicycle program from other demographic and regional influences on school enrollment.
Part 2: Download and read the following article: Card&Kreuger_Diff-in-Diff_paper_reading.pdf Download Card&Kreuger_Diff-in-Diff_paper_reading.pdf(I would begin with reading the abstract, then the conclusion, and then the first half of the paper/sections such that you understand Table1, Table2, the two graphs of before and after wage distribution {pg 777}, and Table 3.). Thus, try reading the original article from page 772 to page 779 (end of section III. Employment Effects of the Minimum-Wage Increase A. Differences in Differences).
The most important task is that you understand Table 3 on page 780Links to an external site., and there are many articles online that are short summaries of the paper and also provide paper table replication codes like Leppert 2020 Links to an external site.or Mamula 2021Links to an external site. if you are interested in replicating the paper yourself.
Also, note that the difference in difference estimate can be computed in two ways - by running a regression Download running a regressionand looking at the interaction termLinks to an external site. of the two margins (treatment versus control group, and before versus after policy change group/time)Links to an external site., or by looking at the margins table constructed from the same raw data - see example 2 on page 4 of Card&Krueger_Diff-in-Diff_summary.pdf Download Card&Krueger_Diff-in-Diff_summary.pdf. Both approaches will give you the same point estimate of 2.75 in our study.
After reading the discussion article and resources above, post your comments on what you found to be interesting and what you learned using the following template -
ARTICLE SUMMARY
What are the authors trying to do? In other words, what is the economic hypothesis that the authors are trying to test for? (2-3 sentences in your own words - EG do not simply copy/paste from the article abstract or conclusion) -> DOES MW REDUCE EMPLOYMENT?
The authors, Card and Krueger, are examining the impact of minimum wage increases on employment levels. They aim to test the conventional economic hypothesis that raising the minimum wage leads to lower employment. Their study specifically investigates whether this holds true by analyzing the effects of a minimum wage increase in New Jersey compared to Pennsylvania, where the minimum wage remained unchanged.The final results do not show that the minimum wage increase reduces employment.
What do the authors find? Do the results make sense under a perfectly competitive labor market (where binding minimum wages cause more unemployment)? Explain. (2-3 sentences) -> MW INCREASES EMPLOYMENT! NO, RESULTS DO NOT MAKE SENSE UNDER STANDARD PERFECTLY COMPETITIVE (LABOR) MKT ASSUMPTION.
EXTRA: How can you reconcile the study results within an economic model? EG. The results make sense under a monopsony market structure where wages increase from W2 to W1Links to an external site. where you can see that as wages increase employment also increases. -> MONOPSONY (different from monopoly) MARKET STRUCTURE RECONCILES THE STUDY RESULTS.
The authors, Card and Krueger, find that an increase in minimum wage did not decrease employment levels and might have actually increased employment. This outcome contradicts the predictions of a perfectly competitive labor market, where higher minimum wages should lead to less employment due to increased labor costs. These unexpected results suggest that the labor market may not be perfectly competitive and are better explained by a monopsony model, where fewer employers control the market and higher wages can actually increase employment by reducing turnover and filling vacancies more effectively.
METHODOLOGY / CRITICAL ANALYSIS
What is the treatment here? What is the treatment group, and what is the control group? -> CHANGE IN MW LAWS, NJ IS THE TREATMENT GROUP THAT SAW IN INCREASE IN MW, WHILE THE CONTROL GROUP IS PA.
The treatment in this study is the increase in minimum wage laws in New Jersey. Consequently, New Jersey serves as the treatment group that experienced the minimum wage increase. Pennsylvania, where the minimum wage remained unchanged, serves as the control group, allowing for a comparative analysis of employment changes due to the wage policy alteration.
Intuitively, why does methodology work? Describe it in plain simple English to your grandma. Also, do you buy the study’s results and are willing to change the minimum wage of your state if you were the governor? How convinced are you with the study’s methodology? (2-3 sentences at least) -> YOU HAVE EFFECTIVELY CREATED A COUNTERFACTUAL/BASELINE AND CONTROL FOR COMMON TRENDS OR FACTORS THAT AFFECT BOTH GROUPS EQUALLY…simplify more for grandma…NO, I AM NOT CONVINCED ENOUGH TO MAKE POLICY CHANGES BASED ON THE STUDY (personal choice).
This study compares what happened to jobs in New Jersey, where they raised the minimum wage, to Pennsylvania, where they didn’t change the wage. This method works because it’s like watching two neighbors’ gardens to see if the fertilizer one uses makes flowers bloom better than in the garden without it. This comparison helps us figure out if the fertilizer—here, the wage increase—really does help. Personally, though I find the study interesting, I would hesitate to immediately raise the minimum wage based solely on these results. I’d want more evidence before making such a big decision because it’s important to be sure that the change would have the desired effect without unintended consequences.
What are the “threats to identification”? In other words, what are the “implicit assumptions” like for simple OLS we have the 5 Gauss Markov Assumptions/conditions? Alternatively, under what conditions can you trust the point estimates, and when would you buy the study results with a grain of salt? Googling may help with weaknesses of the study, or when does Diff-in-Diff methodology Download Diff-in-Diff methodologyfails in general. (HINT: parallel trend assumption to begin with) (2-3 sentences at least)
One of the main threats to identification in a Difference-in-Differences (DiD) analysis like this is the violation of the parallel trends assumption. This assumption requires that, absent the intervention (the change in minimum wage), the employment trends in both the treatment group (New Jersey) and the control group (Pennsylvania) would have followed similar trajectories. If this assumption doesn’t hold—if, for example, New Jersey was already on a different economic path due to unrelated policy changes or economic shifts—the results might be misleading. Therefore, while the study provides valuable insights, it’s important to consider the possibility of pre-existing differences between the groups, as these could bias the findings and should prompt caution in interpreting the results too definitively without further validation.
Part 3:
OPTIONAL: Another Econometric Debate (on impact of low skilled immigration)
Mariel BoatLift
Background in pictures Paper Paper Explained (the economic debates that followed are interesting). SummaryLinks to an external site. of debates. Crime and the Mariel Boatlift
The Mariel Boatlift of 1980, where approximately 125,000 Cubans migrated to Miami, provides a crucial case study on the economic and social impacts of large-scale immigration. David Card’s initial 1990 study indicated that this influx had no significant adverse effects on the wages and employment of local low-skilled workers, challenging traditional economic predictions that a labor supply surge would depress wages and raise unemployment. However, subsequent critiques, particularly by George Borjas, suggested that specific subgroups did experience wage declines, though further analyses criticized the methodologies used and suggested demographic shifts within samples might have influenced these findings. Additionally, a study by Alexander Billy and Michael Packard using synthetic control methods showed that the Boatlift led to an increase in property crimes and murders, attributed partly to the demographic characteristics of the immigrants, many of whom were young males with some having criminal backgrounds. This complex scenario highlights the nuanced effects of immigration on local economies and communities, underscoring the need for careful policy planning and integration strategies to manage such dynamics effectively.
Some source may be helpful: paper by Jessica Lynn Peck titled “Does Uber Reduce Drunk Driving? Evidence from a Natural Experiment in Las Vegas” (2017)