I think it is interesting how polling works in politics, so for this project I decided to look into FiveThirtyEight’s polling data on the “generic ballot” question. This term basically means pollsters just ask people which party they’d vote for in Congress without mentioning specific candidates. FiveThirtyEight wrote about this back in 2017 in an article called “Here’s The Best Tool We Have For Understanding How The Midterms Are Shaping Up” - you can find it here.
What makes this dataset great is that it goes all the way back to 1996, so we can see how voter preferences have shifted between Democrats and Republicans over time. This should gives a broader understanding of voter choices than just looking at one poll at one point in time.
First, load the libraries and grab the data from GitHub:
library(tidyverse)
library(lubridate)
# Pulling the data
url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-generic-ballot/generic_topline_historical.csv"
generic_ballot_data <- read_csv(url)
## Rows: 7669 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): subgroup, modeldate, timestamp
## dbl (6): dem_estimate, dem_hi, dem_lo, rep_estimate, rep_hi, rep_lo
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display first couple of rows of data
head(generic_ballot_data)
## # A tibble: 6 × 9
## subgroup modeldate dem_estimate dem_hi dem_lo rep_estimate rep_hi rep_lo
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 All polls 11/9/1995 50 56.3 43.7 44 50.3 37.7
## 2 All polls 11/10/1995 50 56.3 43.7 44 50.3 37.7
## 3 All polls 11/11/1995 50 56.3 43.7 44 50.3 37.7
## 4 All polls 11/12/1995 50 56.3 43.7 44 50.3 37.7
## 5 All polls 11/13/1995 50 56.3 43.7 44 50.3 37.7
## 6 All polls 11/14/1995 50 56.3 43.7 44 50.3 37.7
## # ℹ 1 more variable: timestamp <chr>
Column names are not that intuitive, let’s make them more intuitive.
# Display first of row of data set with data types
glimpse(generic_ballot_data)
## Rows: 7,669
## Columns: 9
## $ subgroup <chr> "All polls", "All polls", "All polls", "All polls", "All …
## $ modeldate <chr> "11/9/1995", "11/10/1995", "11/11/1995", "11/12/1995", "1…
## $ dem_estimate <dbl> 50.00000, 50.00000, 50.00000, 50.00000, 50.00000, 50.0000…
## $ dem_hi <dbl> 56.30568, 56.30568, 56.30568, 56.30568, 56.30568, 56.3056…
## $ dem_lo <dbl> 43.69432, 43.69432, 43.69432, 43.69432, 43.69432, 43.6943…
## $ rep_estimate <dbl> 44.00000, 44.00000, 44.00000, 44.00000, 44.00000, 44.0000…
## $ rep_hi <dbl> 50.30568, 50.30568, 50.30568, 50.30568, 50.30568, 50.3056…
## $ rep_lo <dbl> 37.69432, 37.69432, 37.69432, 37.69432, 37.69432, 37.6943…
## $ timestamp <chr> "11:37:15 3 Sep 2020", "11:37:14 3 Sep 2020", "11:37:14…
Cleaning up column names: dates aren’t formatted correctly, we can use better column names, and calculate the advantage one party has over other
# Making the data easier to work with
cleaned_ballot_data <- generic_ballot_data %>%
# Fix dates
mutate(poll_date = as.Date(modeldate, format = "%m/%d/%Y"),
# Does Democrats lead (or trail) and by how much
dem_lead = dem_estimate - rep_estimate,
# Confidence of the polls
dem_uncertainty = dem_hi - dem_lo,
rep_uncertainty = rep_hi - rep_lo,
# Use year for easier grouping
year = year(poll_date)) %>%
# Keep what we need and with better names
select(date = poll_date,
year,
democrat = dem_estimate,
republican = rep_estimate,
dem_lead,
dem_high = dem_hi,
dem_low = dem_lo,
rep_high = rep_hi,
rep_low = rep_lo)
# Cleaned version
head(cleaned_ballot_data)
## # A tibble: 6 × 9
## date year democrat republican dem_lead dem_high dem_low rep_high
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1995-11-09 1995 50 44 6 56.3 43.7 50.3
## 2 1995-11-10 1995 50 44 6 56.3 43.7 50.3
## 3 1995-11-11 1995 50 44 6 56.3 43.7 50.3
## 4 1995-11-12 1995 50 44 6 56.3 43.7 50.3
## 5 1995-11-13 1995 50 44 6 56.3 43.7 50.3
## 6 1995-11-14 1995 50 44 6 56.3 43.7 50.3
## # ℹ 1 more variable: rep_low <dbl>
Now I want to see how these numbers change year by year:
# Yearly averages
yearly_summary <- cleaned_ballot_data %>%
group_by(year) %>%
summarize(avg_dem = mean(democrat),
avg_rep = mean(republican),
avg_lead = mean(dem_lead),
.groups = "drop")
# Display averages
yearly_summary
## # A tibble: 22 × 4
## year avg_dem avg_rep avg_lead
## <dbl> <dbl> <dbl> <dbl>
## 1 1995 47.5 40.8 6.73
## 2 1996 47.2 41.6 5.68
## 3 1997 45.9 40.8 5.10
## 4 1998 42.4 39.9 2.50
## 5 1999 43.1 39.8 3.30
## 6 2000 42.8 40.8 1.98
## 7 2001 44.8 41.2 3.65
## 8 2002 42.0 41.1 0.931
## 9 2003 44.3 44.8 -0.558
## 10 2004 45.2 41.3 3.88
## # ℹ 12 more rows
Here is the graph
# Visualize trends
ggplot(cleaned_ballot_data,
aes(x = date)) +
geom_line(aes(y = democrat,
color = "Democrats"),
alpha = 0.7) +
geom_line(aes(y = republican,
color = "Republicans"),
alpha = 0.7) +
labs(title = "Who's Winning? Party Support Over Time",
subtitle = "Based on FiveThirtyEight's polling data (1996-2020)",
x = "Year",
y = "Support (%)",
color = "Party") +
scale_color_manual(values = c("Democrats" = "blue",
"Republicans" = "red")) +
theme_minimal()
Looking at this data, I was surprised how the parties’ support has shifted over time. There are clear patterns around election years, and neither party has a huge advantage for very long.
Next steps:
Compare these poll numbers with actual election results; how close were the predictions?
Find out if polls got better over time as the variances seem tighter at the end.