Introduction

This analysis uses polling data from FiveThirtyEight, which provides a comprehensive overview of U.S. Senate races across various states. The dataset includes information from numerous polls, covering a wide array of candidates and states. It features details such as the polling firm, state, start and end dates of the polls, and the percentage of voter support for each candidate.

The primary objective of this analysis is to clean and transform this extensive dataset to focus on key variables such as the polling firm, state, candidate names, and their respective support percentages. We will identify and visualize the top candidates based on their average polling percentages to gain insights into the current standings in various Senate races.

The data for this analysis is available at FiveThirtyEight’s Senate Polls dataset.

Load Libraries and Data

We will start by loading the necessary libraries and the dataset.

# Load necessary libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(ggplot2)

# Load the data from a CSV file
# Replace `url` with the path to your github CSV file
url <- "https://raw.githubusercontent.com/simonchy/DATA607/refs/heads/main/week%201/senate_polls.csv"
data <- read_csv(url)
## Rows: 294 Columns: 44
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): pollster, sponsors, display_name, pollster_rating_name, fte_grade,...
## dbl (12): poll_id, pollster_id, pollster_rating_id, transparency_score, spon...
## num  (1): sponsor_ids
## lgl  (7): subpopulation, tracking, source, internal, nationwide_batch, ranke...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Exploration

Let’s take a look at the structure of the data and preview the first few rows.

# View the structure of the dataset
str(data)
## spc_tbl_ [294 × 44] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ poll_id                  : num [1:294] 84873 84873 84709 84709 84709 ...
##  $ pollster_id              : num [1:294] 1365 1365 143 143 143 ...
##  $ pollster                 : chr [1:294] "Change Research" "Change Research" "EPIC/MRA" "EPIC/MRA" ...
##  $ sponsor_ids              : num [1:294] 2059 2059 NA NA NA ...
##  $ sponsors                 : chr [1:294] "Nebraska Railroaders for Public Safety" "Nebraska Railroaders for Public Safety" NA NA ...
##  $ display_name             : chr [1:294] "Change Research" "Change Research" "EPIC-MRA" "EPIC-MRA" ...
##  $ pollster_rating_id       : num [1:294] 48 48 84 84 84 84 263 263 88 88 ...
##  $ pollster_rating_name     : chr [1:294] "Change Research" "Change Research" "EPIC-MRA" "EPIC-MRA" ...
##  $ fte_grade                : chr [1:294] "B-" "B-" "B+" "B+" ...
##  $ methodology              : chr [1:294] "Text-to-Web/Online Ad" "Text-to-Web/Online Ad" "Live Phone" "Live Phone" ...
##  $ transparency_score       : num [1:294] 5 5 3 3 3 3 4 4 7 7 ...
##  $ state                    : chr [1:294] "Nebraska" "Nebraska" "Michigan" "Michigan" ...
##  $ start_date               : chr [1:294] "11/13/23" "11/13/23" "11/10/23" "11/10/23" ...
##  $ end_date                 : chr [1:294] "11/16/23" "11/16/23" "11/16/23" "11/16/23" ...
##  $ sponsor_candidate_id     : num [1:294] NA NA NA NA NA NA NA NA NA NA ...
##  $ sponsor_candidate        : chr [1:294] NA NA NA NA ...
##  $ sponsor_candidate_party  : chr [1:294] NA NA NA NA ...
##  $ question_id              : num [1:294] 187981 187981 187421 187421 187422 ...
##  $ sample_size              : num [1:294] 1048 1048 600 600 600 ...
##  $ population               : chr [1:294] "lv" "lv" "lv" "lv" ...
##  $ subpopulation            : logi [1:294] NA NA NA NA NA NA ...
##  $ population_full          : chr [1:294] "lv" "lv" "lv" "lv" ...
##  $ tracking                 : logi [1:294] NA NA NA NA NA NA ...
##  $ created_at               : chr [1:294] "12/6/23 10:12" "12/6/23 10:12" "11/18/23 10:36" "11/18/23 10:36" ...
##  $ notes                    : chr [1:294] NA NA NA NA ...
##  $ url                      : chr [1:294] "https://theintercept.com/2023/12/04/nebraska-senate-dan-osborn-deb-fisher/" "https://theintercept.com/2023/12/04/nebraska-senate-dan-osborn-deb-fisher/" "https://www.freep.com/story/news/politics/elections/2023/11/18/election-2024-biden-trump-poll-michigan/71619518007/" "https://www.freep.com/story/news/politics/elections/2023/11/18/election-2024-biden-trump-poll-michigan/71619518007/" ...
##  $ source                   : logi [1:294] NA NA NA NA NA NA ...
##  $ internal                 : logi [1:294] NA NA NA NA NA NA ...
##  $ partisan                 : chr [1:294] "IND" "IND" NA NA ...
##  $ race_id                  : num [1:294] 9520 9520 9515 9515 9515 ...
##  $ cycle                    : num [1:294] 2024 2024 2024 2024 2024 ...
##  $ office_type              : chr [1:294] "U.S. Senate" "U.S. Senate" "U.S. Senate" "U.S. Senate" ...
##  $ seat_number              : num [1:294] 0 0 0 0 0 0 0 0 0 0 ...
##  $ seat_name                : chr [1:294] "Class I" "Class I" "Class I" "Class I" ...
##  $ election_date            : chr [1:294] "11/5/24" "11/5/24" "11/5/24" "11/5/24" ...
##  $ stage                    : chr [1:294] "general" "general" "general" "general" ...
##  $ nationwide_batch         : logi [1:294] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ ranked_choice_reallocated: logi [1:294] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ ranked_choice_round      : logi [1:294] NA NA NA NA NA NA ...
##  $ party                    : chr [1:294] "REP" "IND" "DEM" "REP" ...
##  $ answer                   : chr [1:294] "Fischer" "Osborn" "Slotkin" "Rogers" ...
##  $ candidate_id             : num [1:294] 31125 31126 31025 31045 31025 ...
##  $ candidate_name           : chr [1:294] "Deb Fischer" "Dan Osborn" "Elissa Slotkin" "Mike Rogers" ...
##  $ pct                      : num [1:294] 38 40 39 37 40 38 51 38 40.5 37.7 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   poll_id = col_double(),
##   ..   pollster_id = col_double(),
##   ..   pollster = col_character(),
##   ..   sponsor_ids = col_number(),
##   ..   sponsors = col_character(),
##   ..   display_name = col_character(),
##   ..   pollster_rating_id = col_double(),
##   ..   pollster_rating_name = col_character(),
##   ..   fte_grade = col_character(),
##   ..   methodology = col_character(),
##   ..   transparency_score = col_double(),
##   ..   state = col_character(),
##   ..   start_date = col_character(),
##   ..   end_date = col_character(),
##   ..   sponsor_candidate_id = col_double(),
##   ..   sponsor_candidate = col_character(),
##   ..   sponsor_candidate_party = col_character(),
##   ..   question_id = col_double(),
##   ..   sample_size = col_double(),
##   ..   population = col_character(),
##   ..   subpopulation = col_logical(),
##   ..   population_full = col_character(),
##   ..   tracking = col_logical(),
##   ..   created_at = col_character(),
##   ..   notes = col_character(),
##   ..   url = col_character(),
##   ..   source = col_logical(),
##   ..   internal = col_logical(),
##   ..   partisan = col_character(),
##   ..   race_id = col_double(),
##   ..   cycle = col_double(),
##   ..   office_type = col_character(),
##   ..   seat_number = col_double(),
##   ..   seat_name = col_character(),
##   ..   election_date = col_character(),
##   ..   stage = col_character(),
##   ..   nationwide_batch = col_logical(),
##   ..   ranked_choice_reallocated = col_logical(),
##   ..   ranked_choice_round = col_logical(),
##   ..   party = col_character(),
##   ..   answer = col_character(),
##   ..   candidate_id = col_double(),
##   ..   candidate_name = col_character(),
##   ..   pct = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Preview the first few rows
head(data)
## # A tibble: 6 × 44
##   poll_id pollster_id pollster        sponsor_ids sponsors          display_name
##     <dbl>       <dbl> <chr>                 <dbl> <chr>             <chr>       
## 1   84873        1365 Change Research        2059 Nebraska Railroa… Change Rese…
## 2   84873        1365 Change Research        2059 Nebraska Railroa… Change Rese…
## 3   84709         143 EPIC/MRA                 NA <NA>              EPIC-MRA    
## 4   84709         143 EPIC/MRA                 NA <NA>              EPIC-MRA    
## 5   84709         143 EPIC/MRA                 NA <NA>              EPIC-MRA    
## 6   84709         143 EPIC/MRA                 NA <NA>              EPIC-MRA    
## # ℹ 38 more variables: pollster_rating_id <dbl>, pollster_rating_name <chr>,
## #   fte_grade <chr>, methodology <chr>, transparency_score <dbl>, state <chr>,
## #   start_date <chr>, end_date <chr>, sponsor_candidate_id <dbl>,
## #   sponsor_candidate <chr>, sponsor_candidate_party <chr>, question_id <dbl>,
## #   sample_size <dbl>, population <chr>, subpopulation <lgl>,
## #   population_full <chr>, tracking <lgl>, created_at <chr>, notes <chr>,
## #   url <chr>, source <lgl>, internal <lgl>, partisan <chr>, race_id <dbl>, …

Data Cleaning and Transformation

We’ll now select relevant columns, rename them for clarity, and convert date columns to the appropriate format.

# Select relevant columns and rename them
cleaned_data <- data %>%
  select(
    Pollster = pollster,
    State = state,
    StartDate = start_date,
    EndDate = end_date,
    CandidateName = candidate_name,
    Percentage = pct
  )

# Convert StartDate and EndDate to Date format
cleaned_data <- cleaned_data %>%
  mutate(
    StartDate = as.Date(StartDate, format = "%m/%d/%y"),
    EndDate = as.Date(EndDate, format = "%m/%d/%y")
  )

# Preview the cleaned data
head(cleaned_data)
## # A tibble: 6 × 6
##   Pollster        State    StartDate  EndDate    CandidateName  Percentage
##   <chr>           <chr>    <date>     <date>     <chr>               <dbl>
## 1 Change Research Nebraska 2023-11-13 2023-11-16 Deb Fischer            38
## 2 Change Research Nebraska 2023-11-13 2023-11-16 Dan Osborn             40
## 3 EPIC/MRA        Michigan 2023-11-10 2023-11-16 Elissa Slotkin         39
## 4 EPIC/MRA        Michigan 2023-11-10 2023-11-16 Mike Rogers            37
## 5 EPIC/MRA        Michigan 2023-11-10 2023-11-16 Elissa Slotkin         40
## 6 EPIC/MRA        Michigan 2023-11-10 2023-11-16 James Craig            38

Data Analysis

Let’s summarize the data to see the average percentage of votes each candidate received.

# Summarize the data by candidate
summary_data <- cleaned_data %>%
  group_by(CandidateName) %>%
  summarise(
    AvgPercentage = mean(Percentage, na.rm = TRUE),
    Count = n()
  )

# Display the summary data
summary_data
## # A tibble: 69 × 3
##    CandidateName       AvgPercentage Count
##    <chr>                       <dbl> <int>
##  1 Adam B. Schiff               26       1
##  2 Adam Paul Laxalt             41.9     1
##  3 Alexander X. Mooney          36.5     7
##  4 Andy Kim                     46       1
##  5 Bernie Moreno                35.2     7
##  6 Blake Masters                29.4     5
##  7 Brian Wright                 31.6     2
##  8 Charles D. Baker             48.9     1
##  9 Chris Christie               23.5     2
## 10 Colin Allred                 38.5     2
## # ℹ 59 more rows

Visualization

We can visualize only the top five candidates based on their average percentage of votes using a bar chart.

# Sort the data by average percentage in descending order
top_five_candidates <- summary_data %>%
  arrange(desc(AvgPercentage)) %>%
  head(5)

# Plot only the top five candidates
ggplot(top_five_candidates, aes(x = reorder(CandidateName, -AvgPercentage), y = AvgPercentage, fill = CandidateName)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Top 5 Candidates by Average Percentage of Votes", x = "Candidate", y = "Average Percentage") +
  theme(legend.position = "none")  

Conclusions

The analysis of the polling data reveals that Maria Cantwell and Charles D. Baker have nearly identical average support percentages across the polls. They are followed by Tammy Baldwin, Robert P. Casey Jr., and James Justice, who also have notable support but with slightly lower average percentages compared to the top two candidates.

This finding indicates a competitive landscape among these candidates, with Cantwell and Baker standing out as the front-runners. Future analyses could focus on tracking changes in support over time or comparing these results with other polling sources to assess consistency and identify any emerging trends.

Additionally, expanding the analysis to include other factors such as polling methodology, sample size, and geographic distribution could provide a more comprehensive understanding of the dynamics in these Senate races.