I was immediately drawn to the “Do Voters Want Democrats or Republicans in Congress” Polls - Generic Ballot 2022 article title. Although, I really was not sure what I would find when I got into the data and I was not sure whether or not it would prove interesting. Please note I did not review the article prior to pulling in the data and doing it myself. The concept intrigued me and I was suprised to see that what I did with the data is exactly what 538 did. The first step was to load the data to my github and call it into RStudio:
# Define the URL for CSV file (using the raw URL)
data_url <- "https://raw.githubusercontent.com/tcgraham-data/data-607-week-1/main/congress-generic-ballot/generic_ballot_averages.csv"
# Load the data
generic_ballot_data <- read_csv(data_url)
## Rows: 3986 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): candidate
## dbl (4): pct_estimate, lo, hi, cycle
## date (2): date, election
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Looking at the data, I saw three columns: “pct_estimate”, “lo”, and “hi” which immediately interested me. I double checked and noted that the “pct_estimate” column was the median between the “lo” and “hi” so I changed the name accordingly.
# Rename the 'pct_estimate' column to 'median' using dplyr
generic_ballot_data <- generic_ballot_data %>%
rename(median = pct_estimate)
# Display the first few rows to verify the change
head(generic_ballot_data)
## # A tibble: 6 × 7
## candidate median lo hi date election cycle
## <chr> <dbl> <dbl> <dbl> <date> <date> <dbl>
## 1 Democrats 43.9 39.3 48.6 2017-04-15 2018-11-06 2018
## 2 Republicans 39.5 34.9 44.2 2017-04-15 2018-11-06 2018
## 3 Democrats 43.7 39.1 48.4 2017-04-16 2018-11-06 2018
## 4 Republicans 39.6 35.0 44.2 2017-04-16 2018-11-06 2018
## 5 Democrats 43.7 39.1 48.4 2017-04-17 2018-11-06 2018
## 6 Republicans 39.6 35.0 44.2 2017-04-17 2018-11-06 2018
The next thing I wanted to do was plot the median line over time with bands extending up to the hi and down to the low for each Republican and Democrat and overlay the two plots onto a single chart of percent over time.
# Ensure the date column is of Date type
generic_ballot_data <- generic_ballot_data %>%
mutate(date = as.Date(date))
# Load ggplot2 library
library(ggplot2)
# Create the overlay plot with ribbons for the confidence bands and median lines
ggplot(generic_ballot_data, aes(x = date, group = candidate)) +
# Add a ribbon for each candidate that extends from lo to hi
geom_ribbon(aes(ymin = lo, ymax = hi, fill = candidate), alpha = 0.2) +
# Overlay a solid line for the median value
geom_line(aes(y = median, color = candidate), size = 1) +
labs(title = "Generic Ballot Percentages Over Time",
x = "Date",
y = "Percentage") +
theme_minimal()
The most interesting take away for me in reviewing the plot of this data is to see how in 2021, voters began a shift away from having a “middle” ground where the median preferences were seperated and had little overlap in the margin for error to a place from 2022 onward where the median lines cross repeatedly and the margin zones totally overlap. Not that this is an analytical report by any means, but it is very interesting to see that there appear to be no more independent voters. The middle has nearly vanished.