I’ve looked at the code from various groups in the WAR/moderation debate, wondering why they get different results. The answer is clear: it’s because no one is running the regression correctly.

Currently, the three main parties in the debate are

Two groups have released code, which I have used to check the analyses (thank you!):

You can also download my code for this article– just click the dropdown by the title.

All of these groups have been analysing the effect of moderation using an invalid regression approach. The approach has two steps:

  1. Run a regression predicting candidates’ election vote shares. Crucially, this regression does not include ideology as a predictor. (This also describes Morris/Rieke’s Bayesian model, which is essentially a regression.)1
  2. Label the residuals from the first regression “WAR” (wins above replacement), and run another regression that predicts WAR using ideology.

Statistically, this is not recommended. To measure the effect of a variable, we nearly always run one regression. Even in rarer cases where we fit multiple regression equations, the regressions are fit jointly. This two-step procedure is not a standard method in statistics. And the nonstandard approach has caused a big problem: it leads all the results to be contaminated by omitted variable bias.

The big mistake is that everyone has been running this regression (the first step), but they leave out ideology. When you do this, the other predictors that are correlated with ideology soak up part of the effect of ideology into their own coefficients (omitted variable bias). Then when fitting the second regression, much of the effect of ideology has already been removed.

Why do Bonica and Grumbach find that ideology has no effect? It’s because they included many more predictors that are correlated with ideology, and those predictors soaked up more of the effect of ideology. The same also applies to Morris/Rieke’s analysis, because they also include more predictors. Lakshya’s original WAR model includes fewer predictors, so the omitted variable bias is less severe in his analysis.

The goal is to find the effect of ideology on vote share. So, fundamentally, we should run one regression that predicts vote share, using ideology as a predictor. Then we can check the coefficient on ideology to find its effect on the vote share. It is incorrect to use WAR for this!

Below I show the results from both the BG and Split Ticket versions of the analysis, comparing the two-step analysis (incorrect) to the correct one-step estimate.2 The error bars are the 95% confidence interval for the coefficient:

We can see that, when done correctly, BG’s results are consistent with Lakshya’s results. Both BG’s and Lakshya’s approach find that moderation increases vote share.

We can also see that, when done correctly, the effect of moderation is detectable (statistically significant). This directly contradicts Elliot Morris, who has insisted that the confidence interval contains both large negative and large positive effects. When the analysis is done correctly, the effect of moderation is not hugely uncertain. Instead we can see that it is highly likely to increase vote share.

To get a sense of how large this effect is in practice, I’m plotting the ideology scores of the Democrats below (hover to see the names):

So we’re seeing that, with both versions of this model, moving from a relatively extreme to moderate Democrat – an increase of 1 to 1.5 on the ideology scale – would increase the vote share by around 1% (which is a 2% vote differential).

To some extent we can blame Lakshya for this two-step problem, since he introduced the two-step procedure that others then copied. But the results show that it wasn’t much of a problem for his analysis anyway, and only became serious when other people added more predictors.

So what about WAR?

There’s a broader takeaway here– if you include many predictors in your WAR model, you are likely to get inaccurate WAR numbers. The WAR you get will be missing many aspects of candidate quality, due to omitted variable bias. It’s not just the moderation results, the WAR values themselves are likely to be messed up! Only Lakshya’s WAR seems reliable, due to having few predictors.

So this suggests the two new versions of WAR (BG and Strength in Numbers) cannot be trusted. To make progress over Lakshya’s original numbers, WAR modelers need to think much more carefully about how to measure candidate quality, and what attributes count as candidate quality in the first place. Throwing in more predictors can easily make WAR worse, not better.

Other issues

That’s not the only statistical issue plaguing this debate. Here are a few more that have only been addressed inconsistently.

Spillover effects: Imagine a district in Ohio. Without changing anything within the district, we swap out all other Democratic politicians across the country with Alexandria Ocasio-Cortez. Would that change the votes in Ohio? Of course it would!

This shows that candidate moderation has spillover effects outside of the candidate’s district, and these spillover effects are likely to be large. Currently no one in this discussion is attempting to measure these spillover effects. That means all of these estimates should be considered lower bound estimates of the total impact of moderation.

Inaccuracy of ideology scores: I think many people in this debate are far too credulous of the ideology scores coming from political science. DW-NOMINATE, the most widely used ideology score in political science, incorrectly marks the squad as moderate Democrats. There are also questions about the accuracy of another popular ideology score, the CFscores (which are derived from campaign donations), especially in recent elections. BG, Morris, and myself in this article, all use Bonica’s “composite scores” which are heavily dependent on DW-NOMINATE and various CF scores. Generally, if these scores are inaccurate, we will underestimate the effect of moderation.

The scores I find most credible are GGUM and text-based scores. But these aren’t widely available at the moment (GGUM takes days just to estimate a single congress *sigh*). GGUM gives the original squad members the most extreme left scores in congress, while you can see above that Bonica’s composite scores do not put the squad farthest left.

Lakshya’s approach of comparing groups/caucuses is a good robustness check that should work even if the ideology scores are inaccurate.

Causal inference: Some other predictors in these regressions may themselves be caused in part by moderation. For example, if moderation causes incumbency because moderates are more likely to get elected, then the causal impact of moderation is higher than what we estimated here. On the other hand, if nominating moderates causes Republicans to nominate more moderate candidates in response, then the causal impact of moderation on vote share is lower than what we estimated here. (Though I would consider more moderate Republicans to be a positive side effect!) After accounting for this, the result we estimated could be meaningfully different than the full causal effect, and it’s not clear if the causal effect is higher or lower.


  1. I took a closer look and realized Morris and Rieke use a different type of model, so this criticism may not apply to their model.↩︎

  2. I didn’t run the Strength in Numbers model because they don’t provide the data files needed to reproduce it. I made a few other simplifications to make this easy to run over the weekend– instead of BG’s glmnet I just ran a linear regression (it’s fine for this), and I used Bonica’s composite score as the measure of ideology. For the Split Ticket version, I’m also not sure I have the exact form of the lagged value Lakshya uses. I just used lagged presidential vote share. The regression only includes 2024, because we’re focused on the most recent results.↩︎

---
title: "Everyone's wrong about WAR and moderation"
author: "William May"
date: "August 23rd, 2025"
output:
  html_document:
    code_download: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = F)
library(plotly)
library(tidyverse)
# library(texreg)
library(ggplot2)

# This section copies the preprocessing from the Bonica/Grumbach code, and also
# reads their dataset: https://github.com/abonica/WAR-Analysis-ideology

# Read and preprocess data
war_data <- read.csv('war_dime_merged.csv')
war_data$state_dist     <- paste(war_data$state_name, war_data$district, sep = "_")
war_data$dem_vote_share <- 100 * war_data$dem_vote_share
war_data$dem_pres_vs <- 100 * war_data$dem_pres_vs

# model_data <- dummy_cols(war_data, select_columns = "state_name")
model_data <- war_data

model_data$dem_tenure_sq <- model_data$dem_tenure^2
model_data$small_state_incumbent_tenure <- as.numeric(model_data$is_small_state * model_data$dem_tenure)
model_data$small_state_incumbent        <- as.integer(model_data$small_state_incumbent_tenure > 0)

model_data$log_cf_diff = log1p(model_data$dem_total_receipts) - log1p(model_data$rep_total_receipts)
model_data$rep_inc <- as.numeric(model_data$rep_inc == 'I')
model_data$dem_inc <- as.numeric(model_data$dem_inc == 'I')
model_data$rep_inc[is.na(model_data$rep_inc)] <-0 
model_data$dem_inc[is.na(model_data$dem_inc)] <-0 

# model_data <- model_data %>%
#   group_by(state_dist) %>%
#   mutate(dem_vote_share_lag = lag(dem_vote_share, n = 1, order_by = state_dist)) %>%
#   ungroup() %>%
#   as.data.frame()
# ^ the original code doesn't work, here's working code:
model_data_lagged = model_data %>%
  transform(lagged_dem_pres_vs = dem_pres_vs) %>%
  subset(select = c(Year, state_dist, lagged_dem_pres_vs))
model_data = model_data %>%
  transform(lagged_year = Year - 2) %>%
  merge(model_data_lagged, by.x = c('lagged_year', 'state_dist'),
        by.y = c('Year', 'state_dist'), all.x = TRUE, all.y = FALSE)
```

I've looked at the code from various groups in the WAR/moderation debate,
wondering why they get different results. The answer is clear: it's because no
one is running the regression correctly.

Currently, the three main parties in the debate are

- Lakshya Jain from Split Ticket. See his [original
  post](https://split-ticket.org/2025/03/17/are-moderates-more-electable/) in
  the debate and [a later
  response](https://split-ticket.org/2025/08/15/deconstructing-war/).
- Adam Bonica and Jake Grumbach from Stanford and UC Berkeley, respectively. See
  their response to Jain
  [here](https://data4democracy.substack.com/p/do-moderates-do-better). I'll
  refer to them as BG for convenience.
- G. Elliot Morris, who writes Strength in Numbers, along with Mark Rieke. He
  has two posts [here](https://www.gelliottmorris.com/p/moderation-is-overrated)
  and
  [here](https://www.gelliottmorris.com/p/data-over-dogma-a-reply-to-matt-yglesias),
  where the bulk of the argument is awkwardly behind a paywall.

Two groups have released code, which I have used to check the analyses (thank
you!):

- Bonica and Grumbach:
  [github](https://github.com/abonica/WAR-Analysis-ideology)
- Morris and Rieke: [github](https://github.com/markjrieke/2026-war)

You can also download my code for this article-- just click the dropdown by the
title.

All of these groups have been analysing the effect of moderation using an
invalid regression approach. The approach has two steps:

1. Run a regression predicting candidates' election vote shares. Crucially, this
   regression *does not include ideology as a predictor*. ~~(This also describes
   Morris/Rieke's Bayesian model, which is essentially a regression.)~~[^mr]
2. Label the residuals from the first regression "WAR" (wins above replacement),
   and run another regression that predicts WAR using ideology.

[^mr]: I took a closer look and realized Morris and Rieke use a different type of model, so this criticism may not apply to their model.

Statistically, this is not recommended. To measure the effect of a variable, we
nearly always run *one* regression. Even in rarer cases where we fit multiple
regression equations, the regressions are fit jointly. This two-step procedure
is not a standard method in statistics. And the nonstandard approach has caused
a big problem: it leads all the results to be contaminated by omitted variable
bias.

The big mistake is that everyone has been running this regression (the first
step), but they leave out ideology. When you do this, the other predictors that
are correlated with ideology soak up part of the effect of ideology into their
own coefficients (omitted variable bias). Then when fitting the second
regression, much of the effect of ideology has already been removed.

Why do Bonica and Grumbach find that ideology has no effect? It's because they
included many more predictors that are correlated with ideology, and those
predictors soaked up more of the effect of ideology. ~~The same also applies to
Morris/Rieke's analysis, because they also include more predictors.~~ Lakshya's
original WAR model includes fewer predictors, so the omitted variable bias is
less severe in his analysis.

The goal is to find the effect of ideology on vote share. So, fundamentally, we
should run *one* regression that predicts vote share, using ideology as a
predictor. Then we can check the coefficient on ideology to find its effect on
the vote share. It is incorrect to use WAR for this!

Below I show the results from both the BG and Split Ticket versions of the
analysis, comparing the two-step analysis (incorrect) to the correct one-step
estimate.[^missingfiles] The error bars are the 95% confidence interval for the
coefficient:

[^missingfiles]: I didn't run the Strength in Numbers model because they don't
    provide the data files needed to reproduce it. I made a few other
    simplifications to make this easy to run over the weekend-- instead of BG's
    glmnet I just ran a linear regression (it's fine for this), and I used
    Bonica's composite score as the measure of ideology. For the Split Ticket
    version, I'm also not sure I have the exact form of the lagged value Lakshya
    uses. I just used lagged presidential vote share. The regression only
    includes 2024, because we're focused on the most recent results.

```{r omitted-bias-results}
mdat2024 = model_data[model_data$cycle == 2024, ]

two_step_lm = function(y, xvars, dat) {
  m1 = lm(y ~ ., dat[, xvars], na.action = na.exclude)
  war = resid(m1)
  lm(war ~ dem_composite, dat[, 'dem_composite', drop = FALSE])
}

get_ideo_coefs = function(y, xvars) {
  regr_dat = mdat2024
  m_2step = two_step_lm(y, xvars, regr_dat)
  m_1step = lm(y ~ ., regr_dat[, c('dem_composite', xvars)])
  out = rbind(summary(m_2step)$coefficients,
              summary(m_1step)$coefficients)
  out = out[row.names(out) == 'dem_composite', ]
  out %>%
    as.data.frame %>%
    transform(model = c('2-step (incorrect)', '1-step (correct)'))
}

x_bg = c("dem_inc", "rep_inc", "dem_pres_vs", "rep_cfscore", "log_cf_diff",
         "is_small_state", "dem_tenure", "dem_tenure_sq",
         "small_state_incumbent", "small_state_incumbent_tenure", "state_name")
x_st = c("dem_inc", "rep_inc", 'lagged_dem_pres_vs')

res_bg = get_ideo_coefs(mdat2024$dem_vote_share, x_bg)
res_st = get_ideo_coefs(with(mdat2024, dem_vote_share - dem_pres_vs), x_st)
res_bg$Group = 'BG'
res_st$Group = 'Split Ticket'
res_all = rbind(res_st, res_bg)
# print(res_all)

ggplot(res_all, aes(x = Group, y = Estimate, fill = model)) +
  geom_bar(position = 'dodge', stat = 'identity') +
  geom_errorbar(aes(ymin = Estimate - `Std. Error` * 1.96,
                    ymax = Estimate + `Std. Error` * 1.96),
                position = position_dodge(width = .9), width = .5,
                color = '#444444') +
  ylab('% change in vote share per unit change in ideology\n(2024 elections)') +
  scale_fill_discrete(type = palette.colors()[3:2])
```

We can see that, when done correctly, BG's results are consistent with Lakshya's
results. Both BG's and Lakshya's approach find that moderation increases vote
share.

We can also see that, when done correctly, the effect of moderation is
detectable (statistically significant). This directly contradicts Elliot Morris,
who has insisted that the confidence interval contains both large negative and
large positive effects. When the analysis is done correctly, the effect of
moderation is *not* hugely uncertain. Instead we can see that it is highly
likely to increase vote share.

To get a sense of how large this effect is in practice, I'm plotting the
ideology scores of the Democrats below (hover to see the names):

<!-- ```{r scores} -->
<!-- hist(mdat2024$dem_composite, breaks=10) -->
<!-- ``` -->

```{r scores2, out.width="800px", out.height="250px", warning=FALSE}
# following https://en.wikipedia.org/wiki/Squad_(U.S._Congress)
squad_members = c(
    'Ocasio-Cortez',
    'omar',
    'pressley',
    'tlaib',
    'casar',
    'Summer Lee',
    'Delia Ramirez'
)
squad_grep = paste(tolower(squad_members), collapse = '|')
# following https://en.wikipedia.org/wiki/Blue_Dog_Coalition#Current_members
blue_dogs = c(
    'Mike Thompson',
    'Adam Gray',
    'Jim Costa',
    'Lou Correa',
    'Sanford Bishop',
    'Jared Golden',
    'Josh Gottheimer',
    'Henry Cuellar',
    'Marie Gluesenkamp Perez',
    'Vicente Gonzalez'
)
bluedog_grep = paste(tolower(blue_dogs), collapse = '|')

mdat2024 %>%
  transform(squad = grepl(squad_grep, Democrat, ignore.case = TRUE),
            blue_dog = grepl(bluedog_grep, Democrat, ignore.case = TRUE)) %>%
  transform(group = ifelse(squad, 'Squad', ifelse(blue_dog, 'Blue Dog', 'Other'))) %>%
  plot_ly(type = 'scatter', mode = 'markers', data = .,
          x = ~dem_composite, y = jitter(rep(0, nrow(mdat2024)), .5),
          text = ~paste(Democrat, '<br>Score:', round(dem_composite, 2)),
          color = ~group,
          colors = c('purple', '#777777', 'red'),
          marker = list(size = 7, opacity = 0.7)) %>%
  layout(xaxis = list(title = 'Bonica composite score'),
         # y-axis is meaningless
         yaxis = list(title = "", zeroline = FALSE, showline = FALSE,
                      showticklabels = FALSE, showgrid = FALSE))
```

So we're seeing that, with both versions of this model, moving from a relatively
extreme to moderate Democrat -- an increase of 1 to 1.5 on the ideology scale --
would increase the vote share by around 1% (which is a 2% vote differential).

To some extent we can blame Lakshya for this two-step problem, since he
introduced the two-step procedure that others then copied. But the results show
that it wasn't much of a problem for his analysis anyway, and only became
serious when other people added more predictors.


## So what about WAR?

There's a broader takeaway here-- if you include many predictors in your WAR
model, you are likely to get inaccurate WAR numbers. The WAR you get will be
missing many aspects of candidate quality, due to omitted variable bias. It's
not just the moderation results, the WAR values themselves are likely to be
messed up! Only Lakshya's WAR seems reliable, due to having few predictors.

So this suggests the ~~two~~ new version~~s~~ of WAR (BG ~~and Strength in Numbers~~) cannot
be trusted. To make progress over Lakshya's original numbers, WAR modelers need
to think much more carefully about how to measure candidate quality, and what
attributes count as candidate quality in the first place. Throwing in more
predictors can easily make WAR *worse*, not better.


## Other issues

That's not the only statistical issue plaguing this debate. Here are a few more
that have only been addressed inconsistently.

**Spillover effects:** Imagine a district in Ohio. Without changing anything
within the district, we swap out all other Democratic politicians across the
country with Alexandria Ocasio-Cortez. Would that change the votes in Ohio? Of
course it would!

This shows that candidate moderation has spillover effects outside of the
candidate's district, and these spillover effects are likely to be large.
Currently no one in this discussion is attempting to measure these spillover
effects. That means all of these estimates should be considered *lower bound
estimates* of the total impact of moderation.

**Inaccuracy of ideology scores:** I think many people in this debate are far
too credulous of the ideology scores coming from political science. DW-NOMINATE,
the most widely used ideology score in political science, [incorrectly marks the
squad as moderate
Democrats](https://voteview.com/articles/Ocasio-Cortez_Omar_Pressley_Tlaib).
There are also questions about the accuracy of another popular ideology score,
the CFscores (which are derived from campaign donations), especially in recent
elections. BG, Morris, and myself in this article, all use Bonica's "composite
scores" which are [heavily dependent on
DW-NOMINATE](https://bsky.app/profile/wmay.bsky.social/post/3lwkblb5mr22g) and
various CF scores. Generally, if these scores are inaccurate, we will
underestimate the effect of moderation.

The scores I find most credible are [GGUM](https://doi.org/10.1017/pan.2022.33)
and [text-based
scores](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4350550). But these
aren't widely available at the moment (GGUM takes days just to estimate a single
congress \*sigh\*). GGUM gives the original squad members the most extreme left
scores in congress, while you can see above that Bonica's composite scores do
not put the squad farthest left.

Lakshya's approach of comparing groups/caucuses is a good robustness check that
should work even if the ideology scores are inaccurate.

**Causal inference:** Some other predictors in these regressions may themselves
be caused in part by moderation. For example, if moderation causes incumbency
because moderates are more likely to get elected, then the causal impact of
moderation is *higher* than what we estimated here. On the other hand, if
nominating moderates causes Republicans to nominate more moderate candidates in
response, then the causal impact of moderation on vote share is *lower* than
what we estimated here. (Though I would consider more moderate Republicans to be
a positive side effect!) After accounting for this, the result we estimated
could be meaningfully different than the full causal effect, and it's not clear
if the causal effect is higher or lower.
