Difference in Difference

Dr. Arvind Sharma

Introduction

Difference-in-differences (DiD) is a statistical technique commonly used in Econometrics and social sciences to estimate the causal effect of an intervention or policy change (aka “treatment”) on an outcome of interest
Origins of Diff-in-Diff: John Snow’s Cholera¹ Hypothesis (water borne) in 1855

Fictional character in the medieval fantasy novel, “A Song of Ice and Fire” by George R. R. Martin, and its HBO television adaptation Game of Thrones

English physician, founder of early germ theory and leader in the development of medical hygiene

Card and Kruger (1994) popularized the method in Economics (classic minimum wage study)

Treatment = Event of Interest.

If you are interetsed in the impact of a worker training program on employment level, the people who are enrolled in the program/get the training are the treatment group. The people who are not enrolled in the program/do not get the training are the control group. Treatment/intervention is the training program.

If you’re studying the impact of a new educational program on student performance, the treatment would be the implementation of that program. TG (set of students who got the new educational program); CG.

In a healthcare setting, the treatment could be the administration of a new drug or therapy. TG (set of people who got the new drug); CG

In a policy evaluation, the treatment might be the implementation of a new law or regulation. TG (set of state which implement the regulation, or the firms who get affected; CG

It is now one of the most popular reduced form causal method.

Methodology Intuition

We compare the change in the outcome variable between a treatment group (those affected by the intervention or policy change) and a control group (those not affected by the intervention) before and after the intervention i.e. over time
- By comparing these changes, DiD attempts to isolate the causal effect of the treatment from other factors that might also influence the outcome
The logic behind DiD is that if the event never happens, the differences between treatment group and the control group should stay the same over time

Identifying Assumptions

In the canonical difference-in-differences model, where two time periods are available, there is a treated population of units that receives a treatment of interest beginning in the second period, and a comparison population that does not receive the treatment in either period

The key identifying assumption is that the average outcome among the treated and comparison populations would have followed ‘parallel trends’ in the absence of treatment
We also assume that the treatment has no causal effect before its implementation (no anticipation)

Together, these assumptions allow us to identify the average treatment effect on the treated (ATT)

Structure of Diff in Diff Model

Regression & Two Way Table Setup

\[ y = \beta_0 + \beta_1 \ Time + \beta_2 \ Treated + \beta_3 \ Time * Treated + \epsilon \]

Coefficient	Calculation	Interpretation
\(\beta_0\)	B	Baseline Average
\(\beta_1\)	D-B	Time Trend in control group
\(\beta_2\)	A-B	Difference between two groups pre-intervention
\(\beta_3\)	(C-A) - (D-B)	Difference in changes over time

Example

Use the DiD model to estimate the effect of coastal weather events (2005 Atlantic hurricane season) on house prices

Path of the Hurricane (Gulf of Mexico)

Example

Use the DiD model to estimate the effect of coastal weather events (2005 Atlantic hurricane season) on house prices

Home Damage by Hurricane

Example

Use the DiD model to estimate the effect of coastal weather events (2005 Atlantic hurricane season) on house prices

Federal Emergency Management Agency (FEMA) released funds to rebuilding homes in adversely affected states

Example

Use the DiD model to estimate the effect of coastal weather events (2005 Atlantic hurricane season) on house prices

FEMA released funds to rebuilding homes.

\(1^{st}\) Margin: Pre / Post

2005 Atlantic hurricane season which was the most active hurricane season in recorded history up until 2020
We can create a dummy for time when the hurricane/treatment was in effect

\(2^{nd}\) Margin: Control / Treatment

We can create treatment dummy for states having a coastline to sea that were affected by 2005 hurricane, and thus given FEMA funding for housing reconstruction.

Loading Library & Importing Data

# Load libraries
library(stargazer)  # summary statistics
library(tidyverse)  # data manipulation
library(dplyr)      

# Importing Data
df <- read.csv("us_fred_coastal_us_states_avg_hpi_before_after_2005.csv")

stargazer(... = df, 
          type = "text")


================================================
Statistic         N  Mean  St. Dev.  Min    Max 
------------------------------------------------
HPI_CHG           48 0.022  0.017   -0.006 0.061
Time_Period       48 0.500  0.505     0      1  
Disaster_Affected 48 0.208  0.410     0      1  
NUM_DISASTERS     48 3.208  2.143     1     10  
NUM_IND_ASSIST    48 8.583  14.946    0     55  
------------------------------------------------

Observation counts by pre/post and treatment/control groups

# Create a two-way table with labels

raw_table <- table(Time_Period       = ifelse(test = df$Time_Period == 0,       yes = "Pre",     no = "Post"     ),
                   Treatment_Status  = ifelse(test = df$Disaster_Affected == 0, yes = "Control", no = "Treatment")
                   )

raw_table

           Treatment_Status
Time_Period Control Treatment
       Post      19         5
       Pre       19         5

Estimating Equation \(\small HPI\_Price_{st} = \beta_0 + \beta_1 \ Time\_Period_t + \beta_2 \ Disaster\_Affected_s + \beta_3 \ Time\_Period_t * Disaster\_Affected_s + \epsilon_{st}\)
Implement in R


Call:
lm(formula = HPI_CHG ~ Time_Period * Disaster_Affected, data = df)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.023081 -0.007610 -0.000171  0.004656  0.035981 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    0.037090   0.002819  13.157  < 2e-16 ***
Time_Period                   -0.027847   0.003987  -6.985  1.2e-08 ***
Disaster_Affected             -0.013944   0.006176  -2.258   0.0290 *  
Time_Period:Disaster_Affected  0.019739   0.008734   2.260   0.0288 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01229 on 44 degrees of freedom
Multiple R-squared:  0.5356,    Adjusted R-squared:  0.504 
F-statistic: 16.92 on 3 and 44 DF,  p-value: 1.882e-07

DiD estimate

Time_Period:Disaster_Affected 
                   0.01973946

Average house price by pre/post and treatment/control groups

mean_table <- tapply(X     = df$HPI_CHG, 
                     INDEX = list(Time_Period      = ifelse(df$Time_Period == 0,       "Pre",     "Post"),
                                  Treatment_Status = ifelse(df$Disaster_Affected == 0, "Control", "Treatment") ),
                     FUN   = mean )

# Display the table
print(mean_table)

           Treatment_Status
Time_Period     Control  Treatment
       Post 0.009242792 0.01503835
       Pre  0.037090020 0.02314612

Calculate DiD effect

DiD_effect <- ( mean_table[1, 2] - mean_table[1, 1] )  - ( mean_table[2, 2] - mean_table[2, 1] )
print(DiD_effect)

[1] 0.01973946

Same as \(\beta_3\) of estimating equation!

Tutorial Summary

Summary

The standard two-group, two period difference-in-differences setup relies on the assumption of parallel trends
- Parallel trends assumes that any trends in the outcome y would trend at the same rate in the absence of the intervention
- Prior to the intervention, y should move in the same direction for both groups
The standard DiD estimator measures the difference in estimated trends between the two groups
DiD estimate is equivalent to the interaction term of treatment and time dummy, or the difference in y for treament and control group between post and pre period
- If the parallel trends assumption is violated, we cannot be sure that the DiD estimator is identifying the effects of the policy or simply some other unaccounted factor causing different trends between these groups

Extensions

DiD with covariates
- Leverage available information about observed characteristics like covariate-specific trends, or unobserved heterogenity with fixed effects
DiD with multiple periods
- Estimating the treatment effect over multiple time periods rather than just before and after the intervention
DiD with variation in treatment timing (Staggered DiD²)
- Nurse Licensure Compact and Mobility, “Equal Pay” Laws, …
Triple Difference (DiDiD)
- Cycling to School: Increasing Secondary School Enrollment for Girls in India, …

Appendix

References

John Snow Diff-in-Diff Table

Footnotes

Cholera is a vicious disease that attacks victims suddenly, with acute symptoms such as vomiting and diarrhea. In the nineteenth century, it was usually fatal.
Units being exposed to treatment at different points in time.