This analysis uses polling data from FiveThirtyEight, which provides a comprehensive overview of U.S. Senate races across various states. The dataset includes information from numerous polls, covering a wide array of candidates and states. It features details such as the polling firm, state, start and end dates of the polls, and the percentage of voter support for each candidate.
The primary objective of this analysis is to clean and transform this extensive dataset to focus on key variables such as the polling firm, state, candidate names, and their respective support percentages. We will identify and visualize the top candidates based on their average polling percentages to gain insights into the current standings in various Senate races.
The data for this analysis is available at FiveThirtyEight’s Senate Polls dataset.
We will start by loading the necessary libraries and the dataset.
# Load necessary libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(ggplot2)
# Load the data from a CSV file
# Replace `url` with the path to your github CSV file
url <- "https://raw.githubusercontent.com/simonchy/DATA607/refs/heads/main/week%201/senate_polls.csv"
data <- read_csv(url)
## Rows: 294 Columns: 44
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): pollster, sponsors, display_name, pollster_rating_name, fte_grade,...
## dbl (12): poll_id, pollster_id, pollster_rating_id, transparency_score, spon...
## num (1): sponsor_ids
## lgl (7): subpopulation, tracking, source, internal, nationwide_batch, ranke...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s take a look at the structure of the data and preview the first few rows.
# View the structure of the dataset
str(data)
## spc_tbl_ [294 × 44] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ poll_id : num [1:294] 84873 84873 84709 84709 84709 ...
## $ pollster_id : num [1:294] 1365 1365 143 143 143 ...
## $ pollster : chr [1:294] "Change Research" "Change Research" "EPIC/MRA" "EPIC/MRA" ...
## $ sponsor_ids : num [1:294] 2059 2059 NA NA NA ...
## $ sponsors : chr [1:294] "Nebraska Railroaders for Public Safety" "Nebraska Railroaders for Public Safety" NA NA ...
## $ display_name : chr [1:294] "Change Research" "Change Research" "EPIC-MRA" "EPIC-MRA" ...
## $ pollster_rating_id : num [1:294] 48 48 84 84 84 84 263 263 88 88 ...
## $ pollster_rating_name : chr [1:294] "Change Research" "Change Research" "EPIC-MRA" "EPIC-MRA" ...
## $ fte_grade : chr [1:294] "B-" "B-" "B+" "B+" ...
## $ methodology : chr [1:294] "Text-to-Web/Online Ad" "Text-to-Web/Online Ad" "Live Phone" "Live Phone" ...
## $ transparency_score : num [1:294] 5 5 3 3 3 3 4 4 7 7 ...
## $ state : chr [1:294] "Nebraska" "Nebraska" "Michigan" "Michigan" ...
## $ start_date : chr [1:294] "11/13/23" "11/13/23" "11/10/23" "11/10/23" ...
## $ end_date : chr [1:294] "11/16/23" "11/16/23" "11/16/23" "11/16/23" ...
## $ sponsor_candidate_id : num [1:294] NA NA NA NA NA NA NA NA NA NA ...
## $ sponsor_candidate : chr [1:294] NA NA NA NA ...
## $ sponsor_candidate_party : chr [1:294] NA NA NA NA ...
## $ question_id : num [1:294] 187981 187981 187421 187421 187422 ...
## $ sample_size : num [1:294] 1048 1048 600 600 600 ...
## $ population : chr [1:294] "lv" "lv" "lv" "lv" ...
## $ subpopulation : logi [1:294] NA NA NA NA NA NA ...
## $ population_full : chr [1:294] "lv" "lv" "lv" "lv" ...
## $ tracking : logi [1:294] NA NA NA NA NA NA ...
## $ created_at : chr [1:294] "12/6/23 10:12" "12/6/23 10:12" "11/18/23 10:36" "11/18/23 10:36" ...
## $ notes : chr [1:294] NA NA NA NA ...
## $ url : chr [1:294] "https://theintercept.com/2023/12/04/nebraska-senate-dan-osborn-deb-fisher/" "https://theintercept.com/2023/12/04/nebraska-senate-dan-osborn-deb-fisher/" "https://www.freep.com/story/news/politics/elections/2023/11/18/election-2024-biden-trump-poll-michigan/71619518007/" "https://www.freep.com/story/news/politics/elections/2023/11/18/election-2024-biden-trump-poll-michigan/71619518007/" ...
## $ source : logi [1:294] NA NA NA NA NA NA ...
## $ internal : logi [1:294] NA NA NA NA NA NA ...
## $ partisan : chr [1:294] "IND" "IND" NA NA ...
## $ race_id : num [1:294] 9520 9520 9515 9515 9515 ...
## $ cycle : num [1:294] 2024 2024 2024 2024 2024 ...
## $ office_type : chr [1:294] "U.S. Senate" "U.S. Senate" "U.S. Senate" "U.S. Senate" ...
## $ seat_number : num [1:294] 0 0 0 0 0 0 0 0 0 0 ...
## $ seat_name : chr [1:294] "Class I" "Class I" "Class I" "Class I" ...
## $ election_date : chr [1:294] "11/5/24" "11/5/24" "11/5/24" "11/5/24" ...
## $ stage : chr [1:294] "general" "general" "general" "general" ...
## $ nationwide_batch : logi [1:294] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ ranked_choice_reallocated: logi [1:294] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ ranked_choice_round : logi [1:294] NA NA NA NA NA NA ...
## $ party : chr [1:294] "REP" "IND" "DEM" "REP" ...
## $ answer : chr [1:294] "Fischer" "Osborn" "Slotkin" "Rogers" ...
## $ candidate_id : num [1:294] 31125 31126 31025 31045 31025 ...
## $ candidate_name : chr [1:294] "Deb Fischer" "Dan Osborn" "Elissa Slotkin" "Mike Rogers" ...
## $ pct : num [1:294] 38 40 39 37 40 38 51 38 40.5 37.7 ...
## - attr(*, "spec")=
## .. cols(
## .. poll_id = col_double(),
## .. pollster_id = col_double(),
## .. pollster = col_character(),
## .. sponsor_ids = col_number(),
## .. sponsors = col_character(),
## .. display_name = col_character(),
## .. pollster_rating_id = col_double(),
## .. pollster_rating_name = col_character(),
## .. fte_grade = col_character(),
## .. methodology = col_character(),
## .. transparency_score = col_double(),
## .. state = col_character(),
## .. start_date = col_character(),
## .. end_date = col_character(),
## .. sponsor_candidate_id = col_double(),
## .. sponsor_candidate = col_character(),
## .. sponsor_candidate_party = col_character(),
## .. question_id = col_double(),
## .. sample_size = col_double(),
## .. population = col_character(),
## .. subpopulation = col_logical(),
## .. population_full = col_character(),
## .. tracking = col_logical(),
## .. created_at = col_character(),
## .. notes = col_character(),
## .. url = col_character(),
## .. source = col_logical(),
## .. internal = col_logical(),
## .. partisan = col_character(),
## .. race_id = col_double(),
## .. cycle = col_double(),
## .. office_type = col_character(),
## .. seat_number = col_double(),
## .. seat_name = col_character(),
## .. election_date = col_character(),
## .. stage = col_character(),
## .. nationwide_batch = col_logical(),
## .. ranked_choice_reallocated = col_logical(),
## .. ranked_choice_round = col_logical(),
## .. party = col_character(),
## .. answer = col_character(),
## .. candidate_id = col_double(),
## .. candidate_name = col_character(),
## .. pct = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# Preview the first few rows
head(data)
## # A tibble: 6 × 44
## poll_id pollster_id pollster sponsor_ids sponsors display_name
## <dbl> <dbl> <chr> <dbl> <chr> <chr>
## 1 84873 1365 Change Research 2059 Nebraska Railroa… Change Rese…
## 2 84873 1365 Change Research 2059 Nebraska Railroa… Change Rese…
## 3 84709 143 EPIC/MRA NA <NA> EPIC-MRA
## 4 84709 143 EPIC/MRA NA <NA> EPIC-MRA
## 5 84709 143 EPIC/MRA NA <NA> EPIC-MRA
## 6 84709 143 EPIC/MRA NA <NA> EPIC-MRA
## # ℹ 38 more variables: pollster_rating_id <dbl>, pollster_rating_name <chr>,
## # fte_grade <chr>, methodology <chr>, transparency_score <dbl>, state <chr>,
## # start_date <chr>, end_date <chr>, sponsor_candidate_id <dbl>,
## # sponsor_candidate <chr>, sponsor_candidate_party <chr>, question_id <dbl>,
## # sample_size <dbl>, population <chr>, subpopulation <lgl>,
## # population_full <chr>, tracking <lgl>, created_at <chr>, notes <chr>,
## # url <chr>, source <lgl>, internal <lgl>, partisan <chr>, race_id <dbl>, …
We’ll now select relevant columns, rename them for clarity, and convert date columns to the appropriate format.
# Select relevant columns and rename them
cleaned_data <- data %>%
select(
Pollster = pollster,
State = state,
StartDate = start_date,
EndDate = end_date,
CandidateName = candidate_name,
Percentage = pct
)
# Convert StartDate and EndDate to Date format
cleaned_data <- cleaned_data %>%
mutate(
StartDate = as.Date(StartDate, format = "%m/%d/%y"),
EndDate = as.Date(EndDate, format = "%m/%d/%y")
)
# Preview the cleaned data
head(cleaned_data)
## # A tibble: 6 × 6
## Pollster State StartDate EndDate CandidateName Percentage
## <chr> <chr> <date> <date> <chr> <dbl>
## 1 Change Research Nebraska 2023-11-13 2023-11-16 Deb Fischer 38
## 2 Change Research Nebraska 2023-11-13 2023-11-16 Dan Osborn 40
## 3 EPIC/MRA Michigan 2023-11-10 2023-11-16 Elissa Slotkin 39
## 4 EPIC/MRA Michigan 2023-11-10 2023-11-16 Mike Rogers 37
## 5 EPIC/MRA Michigan 2023-11-10 2023-11-16 Elissa Slotkin 40
## 6 EPIC/MRA Michigan 2023-11-10 2023-11-16 James Craig 38
Let’s summarize the data to see the average percentage of votes each candidate received.
# Summarize the data by candidate
summary_data <- cleaned_data %>%
group_by(CandidateName) %>%
summarise(
AvgPercentage = mean(Percentage, na.rm = TRUE),
Count = n()
)
# Display the summary data
summary_data
## # A tibble: 69 × 3
## CandidateName AvgPercentage Count
## <chr> <dbl> <int>
## 1 Adam B. Schiff 26 1
## 2 Adam Paul Laxalt 41.9 1
## 3 Alexander X. Mooney 36.5 7
## 4 Andy Kim 46 1
## 5 Bernie Moreno 35.2 7
## 6 Blake Masters 29.4 5
## 7 Brian Wright 31.6 2
## 8 Charles D. Baker 48.9 1
## 9 Chris Christie 23.5 2
## 10 Colin Allred 38.5 2
## # ℹ 59 more rows
We can visualize only the top five candidates based on their average percentage of votes using a bar chart.
# Sort the data by average percentage in descending order
top_five_candidates <- summary_data %>%
arrange(desc(AvgPercentage)) %>%
head(5)
# Plot only the top five candidates
ggplot(top_five_candidates, aes(x = reorder(CandidateName, -AvgPercentage), y = AvgPercentage, fill = CandidateName)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Top 5 Candidates by Average Percentage of Votes", x = "Candidate", y = "Average Percentage") +
theme(legend.position = "none")
The analysis of the polling data reveals that Maria Cantwell and Charles D. Baker have nearly identical average support percentages across the polls. They are followed by Tammy Baldwin, Robert P. Casey Jr., and James Justice, who also have notable support but with slightly lower average percentages compared to the top two candidates.
This finding indicates a competitive landscape among these candidates, with Cantwell and Baker standing out as the front-runners. Future analyses could focus on tracking changes in support over time or comparing these results with other polling sources to assess consistency and identify any emerging trends.
Additionally, expanding the analysis to include other factors such as polling methodology, sample size, and geographic distribution could provide a more comprehensive understanding of the dynamics in these Senate races.