I’ve looked at the code from various groups in the WAR/moderation debate, wondering why they get different results. The answer is clear: it’s because no one is running the regression correctly.

Currently, the three main parties in the debate are

Two groups have released code, which I have used to check the analyses (thank you!):

You can also download my code for this article– just click the dropdown by the title.

All of these groups have been analysing the effect of moderation using an invalid regression approach. The approach has two steps:

  1. Run a regression predicting candidates’ election vote shares. Crucially, this regression does not include ideology as a predictor. (This also describes Morris/Rieke’s Bayesian model, which is essentially a regression.)1
  2. Label the residuals from the first regression “WAR” (wins above replacement), and run another regression that predicts WAR using ideology.

Statistically, this is not recommended. To measure the effect of a variable, we nearly always run one regression. Even in rarer cases where we fit multiple regression equations, the regressions are fit jointly. This two-step procedure is not a standard method in statistics. And the nonstandard approach has caused a big problem: it leads all the results to be contaminated by omitted variable bias.

The big mistake is that everyone has been running this regression (the first step), but they leave out ideology. When you do this, the other predictors that are correlated with ideology soak up part of the effect of ideology into their own coefficients (omitted variable bias). Then when fitting the second regression, much of the effect of ideology has already been removed.

Why do Bonica and Grumbach find that ideology has no effect? It’s because they included many more predictors that are correlated with ideology, and those predictors soaked up more of the effect of ideology. The same also applies to Morris/Rieke’s analysis, because they also include more predictors. Lakshya’s original WAR model includes fewer predictors, so the omitted variable bias is less severe in his analysis.

The goal is to find the effect of ideology on vote share. So, fundamentally, we should run one regression that predicts vote share, using ideology as a predictor. Then we can check the coefficient on ideology to find its effect on the vote share. It is incorrect to use WAR for this!

Below I show the results from both the BG and Split Ticket versions of the analysis, comparing the two-step analysis (incorrect) to the correct one-step estimate.2 The error bars are the 95% confidence interval for the coefficient:

We can see that, when done correctly, BG’s results are consistent with Lakshya’s results. Both BG’s and Lakshya’s approach find that moderation increases vote share.

We can also see that, when done correctly, the effect of moderation is detectable (statistically significant). This directly contradicts Elliot Morris, who has insisted that the confidence interval contains both large negative and large positive effects. When the analysis is done correctly, the effect of moderation is not hugely uncertain. Instead we can see that it is highly likely to increase vote share.

To get a sense of how large this effect is in practice, I’m plotting the ideology scores of the Democrats below (hover to see the names):

So we’re seeing that, with both versions of this model, moving from a relatively extreme to moderate Democrat – an increase of 1 to 1.5 on the ideology scale – would increase the vote share by around 1% (which is a 2% vote differential).

To some extent we can blame Lakshya for this two-step problem, since he introduced the two-step procedure that others then copied. But the results show that it wasn’t much of a problem for his analysis anyway, and only became serious when other people added more predictors.

So what about WAR?

There’s a broader takeaway here– if you include many predictors in your WAR model, you are likely to get inaccurate WAR numbers. The WAR you get will be missing many aspects of candidate quality, due to omitted variable bias. It’s not just the moderation results, the WAR values themselves are likely to be messed up! Only Lakshya’s WAR seems reliable, due to having few predictors.

So this suggests the two new versions of WAR (BG and Strength in Numbers) cannot be trusted. To make progress over Lakshya’s original numbers, WAR modelers need to think much more carefully about how to measure candidate quality, and what attributes count as candidate quality in the first place. Throwing in more predictors can easily make WAR worse, not better.

Other issues

That’s not the only statistical issue plaguing this debate. Here are a few more that have only been addressed inconsistently.

Spillover effects: Imagine a district in Ohio. Without changing anything within the district, we swap out all other Democratic politicians across the country with Alexandria Ocasio-Cortez. Would that change the votes in Ohio? Of course it would!

This shows that candidate moderation has spillover effects outside of the candidate’s district, and these spillover effects are likely to be large. Currently no one in this discussion is attempting to measure these spillover effects. That means all of these estimates should be considered lower bound estimates of the total impact of moderation.

Inaccuracy of ideology scores: I think many people in this debate are far too credulous of the ideology scores coming from political science. DW-NOMINATE, the most widely used ideology score in political science, incorrectly marks the squad as moderate Democrats. There are also questions about the accuracy of another popular ideology score, the CFscores (which are derived from campaign donations), especially in recent elections. BG, Morris, and myself in this article, all use Bonica’s “composite scores” which are heavily dependent on DW-NOMINATE and various CF scores. Generally, if these scores are inaccurate, we will underestimate the effect of moderation.

The scores I find most credible are GGUM and text-based scores. But these aren’t widely available at the moment (GGUM takes days just to estimate a single congress *sigh*). GGUM gives the original squad members the most extreme left scores in congress, while you can see above that Bonica’s composite scores do not put the squad farthest left.

Lakshya’s approach of comparing groups/caucuses is a good robustness check that should work even if the ideology scores are inaccurate.

Causal inference: Some other predictors in these regressions may themselves be caused in part by moderation. For example, if moderation causes incumbency because moderates are more likely to get elected, then the causal impact of moderation is higher than what we estimated here. On the other hand, if nominating moderates causes Republicans to nominate more moderate candidates in response, then the causal impact of moderation on vote share is lower than what we estimated here. (Though I would consider more moderate Republicans to be a positive side effect!) After accounting for this, the result we estimated could be meaningfully different than the full causal effect, and it’s not clear if the causal effect is higher or lower.


  1. I took a closer look and realized Morris and Rieke use a different type of model, so this criticism may not apply to their model.↩︎

  2. I didn’t run the Strength in Numbers model because they don’t provide the data files needed to reproduce it. I made a few other simplifications to make this easy to run over the weekend– instead of BG’s glmnet I just ran a linear regression (it’s fine for this), and I used Bonica’s composite score as the measure of ideology. For the Split Ticket version, I’m also not sure I have the exact form of the lagged value Lakshya uses. I just used lagged presidential vote share. The regression only includes 2024, because we’re focused on the most recent results.↩︎

