Difference-in-Differences

Author

Alex Cory

Published

September 24, 2025

What is Observational Causal Inference?

  • Traditionally used by social sciences, popularized in tech by Amazon

  • Compares treatment group to control group

  • Does NOT require an experiment or random assignment

  • Avoids ethical concerns

  • Allows for modeling bad outcomes such as a PR issue’s effect on stock prices

  • Allows for retrospective analysis, such as finding the effect of a new feature on a website after it has launched

  • Used when experiment is not feasible

Difference-In-Differences

  • Compares effect over time in an untreated control group to a treatment group

  • Requires parallel trend assumption

  • Looks at the difference between control and treatment groups compared to the difference before the intervention

Estimation

  • Treatment Change = Post - Pre

  • Control Change = Post - Pre

  • Difference-in-Difference = Treatment Change - Control Change

# Parameters
n_time <- 100
intervention_time <- 70
control_intercept <- 5
treatment_intercept <- 15
treat_effect <- 30

time <- 1:n_time

noise_control <- runif(n_time, min=-3, max=3)
noise_treatment <- runif(n_time, min=-3, max=3)

control <- control_intercept + time + noise_control
treatment <- treatment_intercept + time + noise_treatment
treatment[time > intervention_time] <- treatment[time > intervention_time] + treat_effect

Visual examination of this graph tells us that the pre-treatment growth between groups is roughly parallel.

# Pre-intervention period (time <= intervention_time)
pre_means <- aggregate(outcome ~ group, data=subset(df, time <= intervention_time), mean)

# Post-intervention period (time > intervention_time)
post_means <- aggregate(outcome ~ group, data=subset(df, time > intervention_time), mean)

# Create matrix of means
Y <- matrix(
  c(pre_means$outcome[pre_means$group=="Control"], post_means$outcome[post_means$group=="Control"],
    pre_means$outcome[pre_means$group=="Treatment"], post_means$outcome[post_means$group=="Treatment"]),
  nrow = 2, byrow = TRUE
)
Group Means Pre and Post Intervention
  Pre Post
Control 40.46 90.57
Treatment 50.61 130.3
# Calculate DiD
did_estimate <- (Y["Treatment", "Post"] - Y["Treatment", "Pre"]) -
                (Y["Control", "Post"]   - Y["Control", "Pre"])
did_estimate
[1] 29.582
# Run DiD regression
model <- lm(outcome ~ group * post, data=df)
summary(model)

Call:
lm(formula = outcome ~ group * post, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-37.186 -12.675  -0.972  12.851  36.388 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)           40.460      2.167  18.672  < 2e-16 ***
groupTreatment        10.149      3.064   3.312   0.0011 ** 
post                  50.107      3.956  12.666  < 2e-16 ***
groupTreatment:post   29.582      5.595   5.287  3.3e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.13 on 196 degrees of freedom
Multiple R-squared:  0.7602,    Adjusted R-squared:  0.7565 
F-statistic: 207.1 on 3 and 196 DF,  p-value: < 2.2e-16

Conclusions

Using our manual estimate we observed a treatment effect of 29.5819972, which is extremely close to the true treatment effect of 30. The regression-based approach provided an estimate of 29.58, which we can see is statistically significant via the p-value.