I’ve looked at the code from various groups in the WAR/moderation
debate, wondering why they get different results. The answer is clear:
it’s because no one is running the regression correctly.
Currently, the three main parties in the debate are
- Lakshya Jain from Split Ticket. See his original
post in the debate and a later
response.
- Adam Bonica and Jake Grumbach from Stanford and UC Berkeley,
respectively. See their response to Jain here.
I’ll refer to them as BG for convenience.
- G. Elliot Morris, who writes Strength in Numbers, along with Mark
Rieke. He has two posts here
and here,
where the bulk of the argument is awkwardly behind a paywall.
Two groups have released code, which I have used to check the
analyses (thank you!):
You can also download my code for this article– just click the
dropdown by the title.
All of these groups have been analysing the effect of moderation
using an invalid regression approach. The approach has two steps:
- Run a regression predicting candidates’ election vote shares.
Crucially, this regression does not include ideology as a
predictor.
(This also describes Morris/Rieke’s Bayesian model,
which is essentially a regression.)
- Label the residuals from the first regression “WAR” (wins above
replacement), and run another regression that predicts WAR using
ideology.
Statistically, this is not recommended. To measure the effect of a
variable, we nearly always run one regression. Even in rarer
cases where we fit multiple regression equations, the regressions are
fit jointly. This two-step procedure is not a standard method in
statistics. And the nonstandard approach has caused a big problem: it
leads all the results to be contaminated by omitted variable bias.
The big mistake is that everyone has been running this regression
(the first step), but they leave out ideology. When you do this, the
other predictors that are correlated with ideology soak up part of the
effect of ideology into their own coefficients (omitted variable bias).
Then when fitting the second regression, much of the effect of ideology
has already been removed.
Why do Bonica and Grumbach find that ideology has no effect? It’s
because they included many more predictors that are correlated with
ideology, and those predictors soaked up more of the effect of ideology.
The same also applies to Morris/Rieke’s analysis, because they also
include more predictors. Lakshya’s original WAR model includes
fewer predictors, so the omitted variable bias is less severe in his
analysis.
The goal is to find the effect of ideology on vote share. So,
fundamentally, we should run one regression that predicts vote
share, using ideology as a predictor. Then we can check the coefficient
on ideology to find its effect on the vote share. It is incorrect to use
WAR for this!
Below I show the results from both the BG and Split Ticket versions
of the analysis, comparing the two-step analysis (incorrect) to the
correct one-step estimate. The error bars are the 95% confidence
interval for the coefficient:

We can see that, when done correctly, BG’s results are consistent
with Lakshya’s results. Both BG’s and Lakshya’s approach find that
moderation increases vote share.
We can also see that, when done correctly, the effect of moderation
is detectable (statistically significant). This directly contradicts
Elliot Morris, who has insisted that the confidence interval contains
both large negative and large positive effects. When the analysis is
done correctly, the effect of moderation is not hugely
uncertain. Instead we can see that it is highly likely to increase vote
share.
To get a sense of how large this effect is in practice, I’m plotting
the ideology scores of the Democrats below (hover to see the names):
So we’re seeing that, with both versions of this model, moving from a
relatively extreme to moderate Democrat – an increase of 1 to 1.5 on the
ideology scale – would increase the vote share by around 1% (which is a
2% vote differential).
To some extent we can blame Lakshya for this two-step problem, since
he introduced the two-step procedure that others then copied. But the
results show that it wasn’t much of a problem for his analysis anyway,
and only became serious when other people added more predictors.
So what about WAR?
There’s a broader takeaway here– if you include many predictors in
your WAR model, you are likely to get inaccurate WAR numbers. The WAR
you get will be missing many aspects of candidate quality, due to
omitted variable bias. It’s not just the moderation results, the WAR
values themselves are likely to be messed up! Only Lakshya’s WAR seems
reliable, due to having few predictors.
So this suggests the two new versions of WAR
(BG and Strength in Numbers) cannot be trusted. To make
progress over Lakshya’s original numbers, WAR modelers need to think
much more carefully about how to measure candidate quality, and what
attributes count as candidate quality in the first place. Throwing in
more predictors can easily make WAR worse, not better.
Other issues
That’s not the only statistical issue plaguing this debate. Here are
a few more that have only been addressed inconsistently.
Spillover effects: Imagine a district in Ohio.
Without changing anything within the district, we swap out all other
Democratic politicians across the country with Alexandria Ocasio-Cortez.
Would that change the votes in Ohio? Of course it would!
This shows that candidate moderation has spillover effects outside of
the candidate’s district, and these spillover effects are likely to be
large. Currently no one in this discussion is attempting to measure
these spillover effects. That means all of these estimates should be
considered lower bound estimates of the total impact of
moderation.
Inaccuracy of ideology scores: I think many people
in this debate are far too credulous of the ideology scores coming from
political science. DW-NOMINATE, the most widely used ideology score in
political science, incorrectly
marks the squad as moderate Democrats. There are also questions
about the accuracy of another popular ideology score, the CFscores
(which are derived from campaign donations), especially in recent
elections. BG, Morris, and myself in this article, all use Bonica’s
“composite scores” which are heavily
dependent on DW-NOMINATE and various CF scores. Generally, if these
scores are inaccurate, we will underestimate the effect of
moderation.
The scores I find most credible are GGUM and text-based
scores. But these aren’t widely available at the moment (GGUM takes
days just to estimate a single congress *sigh*). GGUM gives the original
squad members the most extreme left scores in congress, while you can
see above that Bonica’s composite scores do not put the squad farthest
left.
Lakshya’s approach of comparing groups/caucuses is a good robustness
check that should work even if the ideology scores are inaccurate.
Causal inference: Some other predictors in these
regressions may themselves be caused in part by moderation. For example,
if moderation causes incumbency because moderates are more likely to get
elected, then the causal impact of moderation is higher than
what we estimated here. On the other hand, if nominating moderates
causes Republicans to nominate more moderate candidates in response,
then the causal impact of moderation on vote share is lower
than what we estimated here. (Though I would consider more moderate
Republicans to be a positive side effect!) After accounting for this,
the result we estimated could be meaningfully different than the full
causal effect, and it’s not clear if the causal effect is higher or
lower.
