Introduction

In buisness and finance, return on investment (ROI) is about how much benefit is generated for each dollar spent. While election administration is not a business, it faces a similar problem: how to allocate limited resources to support participation and run elections smoothly.

In this project, I study North Carolina’s 2022 general election and ask: Does spending more money on election administration per registered voter lead to higher turnout?

Counties in North Carolina vary in their election budgets, population, and demographics. This variation provides an opportunity to examine whether higher spending is associated with higher voter turnout.

Research Question: Do counties in North Carolina that spend more per registered voter on operating expenses achieve higher turnout in the 2022 general election?

Hypothesis: Counties with higher spending per registered voter will have higher turnout rates, because greater administrative capacity should increase participation.

Reading and Preparing Data

voter_stats <- fread("voter_stats_20221108.txt")
history_stats <- fread("history_stats_20221108.txt")
finance <- read_csv("finance_2022.csv")
## Rows: 85 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County, Operating_Expenses, Spending_per_Capita
## dbl (1): Year
## num (1): Population
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Filtering to the 2022 General Election, Aggregating Registered Voters and Ballots Cast, and Computing County-Level Turnout

voter_stats_2022 <- voter_stats %>%
filter(election_date == "11/08/2022")

history_stats_2022 <- history_stats %>%
filter(election_date == "11/08/2022")

reg_county <- voter_stats_2022 %>%
group_by(county_desc) %>%
summarise(
registered_voters = sum(total_voters),
.groups = "drop"
)

county_ballots <- history_stats_2022 %>%
group_by(county_desc) %>%
summarise(
ballots_cast = sum(total_voters),
.groups = "drop"
)

county_turnout <- reg_county %>%
left_join(county_ballots, by = "county_desc") %>%
mutate(
turnout_rate = ballots_cast / registered_voters
)

Cleaning and Merging Finance Data, and Creating a Matching Key and Merging

The finance data stored operating expenses as strings with dollar signs and commas, so I cleaned these into numeric values and created a spending per registered voter variable.

finance <- finance %>%
  mutate(
    county_key = toupper(
      gsub("\\s*county$", "", County, ignore.case = TRUE)
    ),
    operating_num = as.numeric(gsub("[^0-9.]", "", Operating_Expenses))
  )

county_turnout <- county_turnout %>%
mutate(county_key = toupper(county_desc))

final_df <- finance %>%
left_join(county_turnout, by = "county_key") %>%
mutate(
spending_per_registered_voter = operating_num / registered_voters
)

At this point, final_df contains the following for each county: Population, Operating Expenses, Spending per Registered Voter, and Turnout Rate

Visualization 1: Spending vs. Turnout Scatterplot

# Keep only rows with non-missing spending and turnout
model_data <- final_df %>%
  filter(
    !is.na(spending_per_registered_voter),
    !is.na(turnout_rate),
    is.finite(spending_per_registered_voter),
    is.finite(turnout_rate)
  )


# Scatterplot using the cleaned data
ggplot(model_data,
       aes(x = spending_per_registered_voter,
           y = turnout_rate)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(
    title = "Spending per Registered Voter vs. Turnout Rate",
    x = "Spending per Registered Voter (US$)",
    y = "Turnout Rate"
  ) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

model <- lm(turnout_rate ~ spending_per_registered_voter, data = model_data)
summary(model)
## 
## Call:
## lm(formula = turnout_rate ~ spending_per_registered_voter, data = model_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.151382 -0.024168 -0.000458  0.038220  0.133436 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   0.4976231  0.0119488  41.646  < 2e-16 ***
## spending_per_registered_voter 0.0029584  0.0009373   3.156  0.00223 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05313 on 83 degrees of freedom
## Multiple R-squared:  0.1072, Adjusted R-squared:  0.0964 
## F-statistic: 9.961 on 1 and 83 DF,  p-value: 0.002228

This scatterplot shows a positive relationship between election spending and turnout, with counties that spend more per registered voter generally having higher turnout.

I fit a simple linear regression model, which allowed me to see that the coefficient on spending per registered voter is positive and statistically significant (p = 0.002).

The takeaway from this is that spending appears to matter, but is not the only factor driving turnout differences.

Visualization 2: Spending-Turnout Quadrant Map

# 1. Compute median cutpoints
spend_med   <- median(final_df$spending_per_registered_voter, na.rm = TRUE)
turnout_med <- median(final_df$turnout_rate, na.rm = TRUE)

# 2. Classify each county into quadrants
final_df_quad <- final_df %>%
  mutate(
    county_key = toupper(county_desc),  # key to join with map
    spend_level = ifelse(spending_per_registered_voter >= spend_med,
                         "High Spending", "Low Spending"),
    turnout_level = ifelse(turnout_rate >= turnout_med,
                           "High Turnout", "Low Turnout"),
    quadrant = dplyr::case_when(
      spend_level == "High Spending" & turnout_level == "High Turnout" ~ "High Spending, High Turnout",
      spend_level == "High Spending" & turnout_level == "Low Turnout"  ~ "High Spending, Low Turnout",
      spend_level == "Low Spending"  & turnout_level == "High Turnout" ~ "Low Spending, High Turnout",
      spend_level == "Low Spending"  & turnout_level == "Low Turnout"  ~ "Low Spending, Low Turnout",
      TRUE ~ NA_character_
    )
  )

# 3. NC county polygons from {maps}, with a matching key
us_counties <- map_data("county")

nc_map <- us_counties %>%
  dplyr::filter(region == "north carolina") %>%
  mutate(
    county_key = toupper(subregion)   # "alamance" -> "ALAMANCE"
  )

# 4. Join quadrants onto polygons
nc_map_quad <- nc_map %>%
  left_join(
    final_df_quad %>%
      dplyr::select(county_key, quadrant),
    by = "county_key"
  ) %>%
  mutate(
    quadrant_label = ifelse(
      is.na(quadrant),
      "Missing finance data",
      quadrant
    )
  )

# 5. Colors and map plot
quad_colors <- c(
  "High Spending, High Turnout" = "green",
  "High Spending, Low Turnout"  = "orange",
  "Low Spending, High Turnout"  = "purple",
  "Low Spending, Low Turnout"   = "red",
  "Missing finance data"        = "grey"
)

ggplot(nc_map_quad,
       aes(x = long, y = lat, group = group, fill = quadrant_label)) +
  geom_polygon(color = "white", linewidth = 0.2) +
  coord_fixed(1.3) +
  scale_fill_manual(
    values = quad_colors,
    name   = "County Type"
  ) +
  labs(
    title = "Spending and Turnout Quadrants by County",
    x = NULL,
    y = NULL
  ) +
  theme_minimal() +
  theme(
    axis.text  = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank()
  )

This map reveals that: High Spending, High Turnout counties are consistent with the hypothesis High Spending, Low Turnout counties suggest that spending alone is not enough, and there are other factors limiting participation Low Spending, High Turnout counties appear very efficient, getting strong participation with fewer spending Low Spending, Low Turnout counties match what we expected: low investment results in lower participation

Visualization 3: Turnout by Race

# Registered voters by county and race
reg_race_county <- voter_stats_2022 %>%
  group_by(county_desc, race_code) %>%
  summarise(
    registered = sum(total_voters),
    .groups = "drop"
  )

if ("race_code" %in% names(history_stats_2022)) {
  # Ballots cast by county and race
  ballots_race_county <- history_stats_2022 %>%
    group_by(county_desc, race_code) %>%
    summarise(
      ballots_cast = sum(total_voters),
      .groups = "drop"
    )
  
  # Merge and compute turnout rate
  turnout_race_county <- reg_race_county %>%
    left_join(ballots_race_county,
              by = c("county_desc", "race_code")) %>%
    mutate(
      turnout_rate = ballots_cast / registered
    )
  
} else {
  # if history_stats_2022 doesn't have race_code,create a placeholder with NA 
  turnout_race_county <- reg_race_county %>%
    mutate(
      ballots_cast = NA_real_,
      turnout_rate = NA_real_
    )
}

major_races <- c("W", "B", "A", "I")   # White, Black, Asian, American Indian

clean_race_county <- turnout_race_county %>%
  filter(race_code %in% major_races) %>%
  filter(!is.na(turnout_rate))

race_labels <- c(
  "A" = "Asian",
  "B" = "Black",
  "I" = "American Indian",
  "W" = "White"
)

clean_race_county <- clean_race_county %>%
  mutate(race_full = race_labels[race_code])

ggplot(clean_race_county,
       aes(x = county_desc, y = turnout_rate)) +
  geom_col() +
  facet_wrap(~ race_full, ncol = 2) +
  labs(
    title = "Turnout by Race Across NC Counties (Major Races, 2022)",
    x = "County",
    y = "Turnout Rate"
  ) +
  scale_y_continuous(labels = percent_format()) +
  theme_minimal() +
  theme(
    axis.text.x  = element_text(angle = 90, hjust = 1, size = 5),
    strip.text   = element_text(face = "bold"),
    plot.title   = element_text(hjust = 0.5)
  )

This visualization shows that white voters tend to have the highest turnout rates across most counties. Black, Hispanic, and American Indians often show lower turnout, with variation across counties. These graphs can help explain some of the variation seen earlier.

Conclusion

There is a positive, statistically significant relationship between spending per registered voter and turnout.

The effect size is modest, spending explains some but not most of the variation in turnout.

Geographic patterns and racial turnout differences play important roles in county-level turnout differences.

Overall, greater investment in election administration is associated with increased participation, but is not a sufficient factor on its own.

Limitations and Future

Limitations: Only one election year was analyzed, so trends cannot be observed. Also, my financial data only included operating expenses, which may not capture all election spending. There are also many other socioeconomic variables like income, education, etc. that can have effects on turnout that were not shown.

Future research: I would extend this to multiple election years to study trends over time. Also, I would compare North Carolina to other states to see how they compare.