Congressional Polling Data: What It Tells Us

Getting Started

I think it is interesting how polling works in politics, so for this project I decided to look into FiveThirtyEight’s polling data on the “generic ballot” question. This term basically means pollsters just ask people which party they’d vote for in Congress without mentioning specific candidates. FiveThirtyEight wrote about this back in 2017 in an article called “Here’s The Best Tool We Have For Understanding How The Midterms Are Shaping Up” - you can find it here.

What makes this dataset great is that it goes all the way back to 1996, so we can see how voter preferences have shifted between Democrats and Republicans over time. This should gives a broader understanding of voter choices than just looking at one poll at one point in time.

Getting the Data Ready

First, load the libraries and grab the data from GitHub:

library(tidyverse)
library(lubridate)

# Pulling the data
url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-generic-ballot/generic_topline_historical.csv"
generic_ballot_data <- read_csv(url)

## Rows: 7669 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): subgroup, modeldate, timestamp
## dbl (6): dem_estimate, dem_hi, dem_lo, rep_estimate, rep_hi, rep_lo
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Display first couple of rows of data
head(generic_ballot_data)

## # A tibble: 6 × 9
##   subgroup  modeldate  dem_estimate dem_hi dem_lo rep_estimate rep_hi rep_lo
##   <chr>     <chr>             <dbl>  <dbl>  <dbl>        <dbl>  <dbl>  <dbl>
## 1 All polls 11/9/1995            50   56.3   43.7           44   50.3   37.7
## 2 All polls 11/10/1995           50   56.3   43.7           44   50.3   37.7
## 3 All polls 11/11/1995           50   56.3   43.7           44   50.3   37.7
## 4 All polls 11/12/1995           50   56.3   43.7           44   50.3   37.7
## 5 All polls 11/13/1995           50   56.3   43.7           44   50.3   37.7
## 6 All polls 11/14/1995           50   56.3   43.7           44   50.3   37.7
## # ℹ 1 more variable: timestamp <chr>

Column names are not that intuitive, let’s make them more intuitive.

# Display first of row of data set with data types
glimpse(generic_ballot_data)

## Rows: 7,669
## Columns: 9
## $ subgroup     <chr> "All polls", "All polls", "All polls", "All polls", "All …
## $ modeldate    <chr> "11/9/1995", "11/10/1995", "11/11/1995", "11/12/1995", "1…
## $ dem_estimate <dbl> 50.00000, 50.00000, 50.00000, 50.00000, 50.00000, 50.0000…
## $ dem_hi       <dbl> 56.30568, 56.30568, 56.30568, 56.30568, 56.30568, 56.3056…
## $ dem_lo       <dbl> 43.69432, 43.69432, 43.69432, 43.69432, 43.69432, 43.6943…
## $ rep_estimate <dbl> 44.00000, 44.00000, 44.00000, 44.00000, 44.00000, 44.0000…
## $ rep_hi       <dbl> 50.30568, 50.30568, 50.30568, 50.30568, 50.30568, 50.3056…
## $ rep_lo       <dbl> 37.69432, 37.69432, 37.69432, 37.69432, 37.69432, 37.6943…
## $ timestamp    <chr> "11:37:15  3 Sep 2020", "11:37:14  3 Sep 2020", "11:37:14…

Cleaning up column names: dates aren’t formatted correctly, we can use better column names, and calculate the advantage one party has over other

# Making the data easier to work with
cleaned_ballot_data <- generic_ballot_data %>%
  # Fix dates
  mutate(poll_date = as.Date(modeldate, format = "%m/%d/%Y"),
         # Does Democrats lead (or trail) and by how much
         dem_lead = dem_estimate - rep_estimate,
         # Confidence of the polls
         dem_uncertainty = dem_hi - dem_lo,
         rep_uncertainty = rep_hi - rep_lo,
         # Use year for easier grouping
         year = year(poll_date)) %>%
  # Keep what we need and with better names
  select(date = poll_date, 
         year, 
         democrat = dem_estimate, 
         republican = rep_estimate,
         dem_lead,
         dem_high = dem_hi, 
         dem_low = dem_lo,
         rep_high = rep_hi,
         rep_low = rep_lo)

# Cleaned version
head(cleaned_ballot_data)

## # A tibble: 6 × 9
##   date        year democrat republican dem_lead dem_high dem_low rep_high
##   <date>     <dbl>    <dbl>      <dbl>    <dbl>    <dbl>   <dbl>    <dbl>
## 1 1995-11-09  1995       50         44        6     56.3    43.7     50.3
## 2 1995-11-10  1995       50         44        6     56.3    43.7     50.3
## 3 1995-11-11  1995       50         44        6     56.3    43.7     50.3
## 4 1995-11-12  1995       50         44        6     56.3    43.7     50.3
## 5 1995-11-13  1995       50         44        6     56.3    43.7     50.3
## 6 1995-11-14  1995       50         44        6     56.3    43.7     50.3
## # ℹ 1 more variable: rep_low <dbl>

Looking at the Results

Now I want to see how these numbers change year by year:

# Yearly averages
yearly_summary <- cleaned_ballot_data %>%
  group_by(year) %>%
  summarize(avg_dem = mean(democrat), 
            avg_rep = mean(republican),
            avg_lead = mean(dem_lead),
            .groups = "drop")

# Display averages
yearly_summary

## # A tibble: 22 × 4
##     year avg_dem avg_rep avg_lead
##    <dbl>   <dbl>   <dbl>    <dbl>
##  1  1995    47.5    40.8    6.73 
##  2  1996    47.2    41.6    5.68 
##  3  1997    45.9    40.8    5.10 
##  4  1998    42.4    39.9    2.50 
##  5  1999    43.1    39.8    3.30 
##  6  2000    42.8    40.8    1.98 
##  7  2001    44.8    41.2    3.65 
##  8  2002    42.0    41.1    0.931
##  9  2003    44.3    44.8   -0.558
## 10  2004    45.2    41.3    3.88 
## # ℹ 12 more rows

Here is the graph

# Visualize trends
ggplot(cleaned_ballot_data, 
       aes(x = date)) +
  geom_line(aes(y = democrat, 
                color = "Democrats"), 
                alpha = 0.7) +
  geom_line(aes(y = republican, 
                color = "Republicans"), 
                alpha = 0.7) +
  labs(title = "Who's Winning? Party Support Over Time",
       subtitle = "Based on FiveThirtyEight's polling data (1996-2020)",
       x = "Year",
       y = "Support (%)",
       color = "Party") +
  scale_color_manual(values = c("Democrats" = "blue", 
                                "Republicans" = "red")) +
  theme_minimal()

Lessons Learned & What Is Next

Looking at this data, I was surprised how the parties’ support has shifted over time. There are clear patterns around election years, and neither party has a huge advantage for very long.

Next steps:

Compare these poll numbers with actual election results; how close were the predictions?
Find out if polls got better over time as the variances seem tighter at the end.

DATA607 - Chapter 1 - FiveThirtyEight Generic Ballot Analysis

Sergio Belich

2025-05-01

Congressional Polling Data: What It Tells Us

Getting Started

Getting the Data Ready

Looking at the Results

Lessons Learned & What Is Next