Difference-in-Differences Doesn’t Just Work

What comes after the basic DID model

Nick Huntington-Klein

Where we’re starting from

I’m assuming you already know:

2x2 difference-in-differences
TWFE
The parallel trends assumption
Issues with staggered treatment

The 2x2, in one breath

We want: how much more did the treated group change than the untreated group?

\[(\bar{Y}^{T}_{after}-\bar{Y}^{T}_{before}) - (\bar{Y}^{C}_{after}-\bar{Y}^{C}_{before})\]

The untreated group’s change is our guess at the treated group’s counterfactual.

The 2x2 When It’s Not 2x2

The logic behind 2x2 and what we’re trying to do (estimate a counterfactual) is what DID is all about
However, applying that logic and estimating (change in treated group) - (change in counterfactual) gets very tricky very fast!
You’ve already seen some of these issues in the case of staggered treatment. TWFE seems like a way of expanding beyond 2x2 but it breaks unexpectedly in a common case

The Design vs. the Estimator

The Issue

DID as a research design is wonderfully flexible. The logic — use the control group’s change as the treated group’s counterfactual — survives almost anything you throw at it.

DID as an estimation method (TWFE + OLS) is fragile. It breaks the moment you vary almost anything.

Today: what breaks DID, usually the estimator but sometimes the design, and how can we fix it?

A litany of things that don’t “Just Work”

You’d think these would be harmless. They are not:

Adding control variables → doesn’t Just Work
A binary outcome in logit/probit → doesn’t Just Work
A non-binary (continuous) treatment → doesn’t Just Work
A group-specific time trend to fix prior trends → doesn’t Just Work
Logging a skewed outcome → doesn’t Just Work
Staggered treatment timing in TWFE → doesn’t Just Work

Controls in DID

What are controls even for here?

In ordinary regression, controls close back doors / justify conditional independence.

In DID, the design already handles level differences between groups. So controls are not there to make groups comparable or to handle endogeneity in treatment assignment.

Controls in DID exist to rescue parallel trends:

“I don’t think parallel trends holds raw — but it holds conditional on this variable.”

If that’s not the claim you’re making, you don’t want the control.

Three reasons you’d actually want one

Levels imply trends. Low-income schools were already on a faster score trajectory → gap would have widened anyway.
Composition changes. Treated schools start attracting higher-income students → scores drift up for non-treatment reasons.
More-similar groups, more-similar trends. A sort of colloquial assumption that groups similar on covariates are also similar on trends

So just toss them in the regression?

No. TWFE-with-controls quietly does the wrong thing:

For time-varying controls, it effectively conditions on the change in the covariate — but you wanted the level to shift the trend
For time-fixed controls, the fixed effects already swallow the level — they don’t deliver the trend difference you needed
You also get a bizarre weighted average that over-weights the most unusual covariate groups

Demonstration: TWFE with a control, true effect = 0

Parallel trends violated only through income; true treatment effect baked in at 0; income is a clean confounder (not caused by treatment). 1,000 runs:

The density sits well off the truth — and there’s no post-treatment bias anywhere. It still fails.

Some controls are doomed from the start

Post-treatment bias. Any covariate measured after treatment may itself be caused by treatment.

Effect of Medicaid expansion, controlling for hospital funding
Treated schools attract richer students because of the treatment → control it and you control away the effect

Rule of thumb: use covariates measured at baseline only. Many software packages won’t even let you do otherwise.

Even baseline controls can bite: regression to the mean

Match on a pre-treatment covariate that differs a lot between groups, say covariate is low for treated and high for untreated:

You pick the highest-covariate treated observation and lowest-covariate control school (the closest pair, e.g. B and Y)
Those are unusually-far-from-their-mean draws
Both regress back to the pack → a spurious trend you didn’t have before

Demonstration: matching creates the bias

Average DID over 1,000 sims (true effect = 0): full data ≈ 0.006 vs. matched pair ≈ -0.480. Matching the closest pair manufactures a downward “effect” out of pure regression to the mean.

What actually works for controls

Stick to covariates fixed over time, or measured only at baseline
Use an estimator built for it (the same ones that help with staggered DID):
- Callaway & Sant’Anna (2021), Wooldridge ETWFE (2021), Doubly-robust DID (Sant’Anna & Zhao 2020)
Time-varying controls? Mostly forbidden; narrow methods exist
What if you need to control for something time-varying? Sorta out of luck

Nonlinear Outcomes & Functional Form

The log trap, revisited

Outcome is skewed (counts, spending, wages) → instinct says “log it”
But if parallel trends held for \(Y\), it does not hold for \(\ln Y\) — and vice versa
Logging to fix skew silently changes the assumption you’re betting your paper on
Remember: parallel trends is not just a causal assumption, it’s a functional form assumption!

You must decide on which scale you believe parallel trends, and use that outcome.

Binary outcomes: logit & probit DID

Outcome is 0/1, so you reach for logit/probit. Reasonable!
But in a nonlinear model the DID effect is not the coefficient on the interaction term
The cross-difference you want \(\ne\) the interaction coefficient (Puhani 2012)
And the definition of parallel trends changes too!
This is a big reason people just run OLS / a linear probability model on binary DID outcomes

What to do instead

Be explicit about the scale parallel trends lives on (levels? ratios? odds?)
Compute the average marginal cross-difference, not the raw coefficient
For counts, Poisson TWFE is often well-behaved (and ratio-based parallel trends is natural)
Nonlinear ETWFE (Wooldridge follow-up) handles logit/Poisson with staggered timing. Or just run OLS!

Continuous & Multi-Valued Treatment

When treatment is a dose, not a switch

Minimum wage rises by different amounts across states
A tax changes by different percentages
A program reaches different intensities

The design still feels obvious: more dose → more change. Right?

Why dose-DID breaks

You now need parallel trends at every dose level — and a stronger “no selection into dose” condition
Comparing higher-dose to lower-dose units makes already-dosed units into controls
Same disease as staggered timing: negative weights, effects that can flip sign
(de Chaisemartin & D’Haultfœuille; Callaway, Goodman-Bacon & Sant’Anna 2021)

What you can actually recover

A well-defined ATT at a given dose, or a dose-response curve — with care
Be honest about the comparison: “effect of more dose vs. some dose” is not “effect of dose vs. none”
Use the continuous-treatment estimators, and report what is identified, not just a slope

Triple Differences (DDD)

The unfurling logic of DID

DID watches how a gap in outcome levels changes from before to after.

But who says it has to be a gap in levels? We can run DID on a gap in almost anything:

a gap in a relationship (a slope / coefficient)
a gap in a quantile (DID + quantile regression)
a gap in a DID itself → difference-in-difference-in-differences

DDD to rescue parallel trends

The big use: find a group that shouldn’t be affected, and subtract its “effect” out.

Marshes vs. parks. A policy funds trash removal from marshes in some prefectures.

Run DID on marshes → you get an effect
Run DID on parks (shouldn’t be affected) → you also get an “effect”?!
That parks “effect” is a parallel-trends violation. Subtract it out: DID(marshes) − DID(parks)

Worked example: Collins & Urban (2014)

Maryland 2008 law: mortgage servicers must report their loan-modification activity. Did it change behavior?

Plain DID (Maryland vs. another state) is scary here — it’s 2008, the Great Recession; foreclosures are exploding everywhere
Lucky break: the law only applied to some servicers (ESRR), not others
Third difference: affected vs. unaffected servicers → subtract out the “financial chaos” common to both

What the triple-difference showed

ESRR servicers modified more loans and foreclosed more — the second effect being the opposite of the policy’s intent.

When You Aren’t Sure Parallel Trends Holds

The pre-trends test problem

We “check” parallel trends with pre-treatment event-study coefficients. But:

The test only sees the past; the assumption is about the counterfactual future
Pre-trends tests have low power — you often can’t detect a violation that still matters
Pre-testing then proceeding distorts your final estimates (Roth 2022, pretrends)

“Passing” the pre-trends test does not mean parallel trends holds.

Honest DID (Rambachan & Roth)

Stop pretending parallel trends is exactly true. Instead, bound how badly it could be violated:

Assume the post-treatment violation is no larger than \(\bar{M}\) times the worst violation you saw pre-treatment
Sweep \(\bar{M}\) from 0 upward → get a range of possible effects (partial identification)
Report the breakdown value: how big a violation it takes to kill your result
HonestDiD in R / Stata

What the output looks like

Medicaid-expansion event study (5 pre-periods, 2 post). Original estimate is a significant positive effect. Relax parallel trends by \(\bar{M}\times\) the max pre-period violation:

delta_rm_results
## # A tibble: 4 × 5
##         lb      ub method Delta    Mbar
##      <dbl>   <dbl> <chr>  <chr>   <dbl>
## 1  0.0241  0.0673  C-LF   DeltaRM   0.5
## 2  0.0171  0.0720  C-LF   DeltaRM   1
## 3  0.00859 0.0796  C-LF   DeltaRM   1.5
## 4 -0.00107 0.0883  C-LF   DeltaRM   2     ← interval finally includes 0

The sensitivity plot

Original CI (left) excludes 0; robust intervals stay above 0 until \(\bar{M}=2\).

Reading the result

At \(\bar{M}=0.5,\,1,\,1.5\) the robust interval still excludes 0 — effect survives
At \(\bar{M}=2\) it just barely includes 0 — that’s the breakdown value
“the significant effect holds up even if parallel-trends violations after treatment are up to twice as large as the worst violation we saw before treatment.”

Wrapping Up

The modern toolbox at a glance

Problem	Reach for
Controls done right	Callaway-Sant’Anna, ETWFE, doubly-robust
Binary / count outcome	LPM, Poisson, nonlinear ETWFE
Continuous / dose treatment	de Chaisemartin-D’Haultfœuille, CGS
DID on a placebo group	Triple differences (DDD)
Unsure about parallel trends	HonestDiD sensitivity analysis

Honorable mentions we skipped

Staggered timing — TWFE’s biggest failure; already covered in its own lecture (Goodman-Bacon, Callaway-Sant’Anna, Sun-Abraham, ETWFE, BJS imputation)
Synthetic control / Synthetic DID (Arkhangelsky et al. 2021) — build the counterfactual from prior outcomes
IV-DID — when treatment “turning on” is itself instrumented; the same fragility compounds
Inference with few clusters (Bertrand, Duflo & Mullainathan 2004) — your standard errors are probably too small

The meta-lesson

Please don’t leave thinking “the old way is broken but this new way Just Works.”

Nothing Just Works in DID. Every change in the setting or data probably requires a change of estimator. That estimator usually exists but you need to find it!

The meta-lesson

For any DID that isn’t plain-OLS-no-covariates-single-period, ask:

What exactly does parallel trends mean here? (scale, transform, covariates, estimator)
Do I believe that version?
Show results with and without controls; re-check prior trends after adjusting.

It can work. It just won’t Just Work.

Where to read more

Controls in DID explainer — nickchk.substack.com/p/controls-in-difference-in-differences
The Effect, Ch. 18 (Difference-in-Differences) — theeffectbook.net/ch-DifferenceinDifference.html
Roth, Sant’Anna, Bilinski & Poe (2023) — “What’s Trending in Difference-in-Differences?”
Package homes: did, etwfe, DRDID, HonestDiD, pretrends