Introduction

This is a brief overview of the data from the latest polls regarding presidential elections in the US. For the complete article and source of the data please see

https://projects.fivethirtyeight.com/polls/.

This file will include R chunks to instruct how to load the data

Install / Load libraries

Depending on what libraries are used, first load those, in this case we are loading dplyr and ggplot2

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

Specify the URL to raw CSV file from GitHub

Since the data needs to be reproducible it will be pulled from a URL from Github

url <- "https://raw.githubusercontent.com/Lfirenzeg/msds607labs/main/president_polls.csv"

Load the data into R

We create a data called total_data_polls to reference the entirety of the original data set.

total_data_polls <- read.csv(url)

View the first few rows of the data

Use the command head just to check that the data you want, to confirm it’s loaded correctly

head(total_data_polls)
##   poll_id pollster_id         pollster sponsor_ids sponsors     display_name
## 1   87989         235 InsiderAdvantage                      InsiderAdvantage
## 2   87989         235 InsiderAdvantage                      InsiderAdvantage
## 3   87990         235 InsiderAdvantage                      InsiderAdvantage
## 4   87990         235 InsiderAdvantage                      InsiderAdvantage
## 5   87994         235 InsiderAdvantage                      InsiderAdvantage
## 6   87994         235 InsiderAdvantage                      InsiderAdvantage
##   pollster_rating_id pollster_rating_name numeric_grade pollscore methodology
## 1                243     InsiderAdvantage             2      -0.4            
## 2                243     InsiderAdvantage             2      -0.4            
## 3                243     InsiderAdvantage             2      -0.4            
## 4                243     InsiderAdvantage             2      -0.4            
## 5                243     InsiderAdvantage             2      -0.4            
## 6                243     InsiderAdvantage             2      -0.4            
##   transparency_score          state start_date end_date sponsor_candidate_id
## 1                  5         Nevada    8/29/24  8/31/24                   NA
## 2                  5         Nevada    8/29/24  8/31/24                   NA
## 3                  5        Arizona    8/29/24  8/31/24                   NA
## 4                  5        Arizona    8/29/24  8/31/24                   NA
## 5                  5 North Carolina    8/29/24  8/31/24                   NA
## 6                  5 North Carolina    8/29/24  8/31/24                   NA
##   sponsor_candidate sponsor_candidate_party endorsed_candidate_id
## 1                                                              NA
## 2                                                              NA
## 3                                                              NA
## 4                                                              NA
## 5                                                              NA
## 6                                                              NA
##   endorsed_candidate_name endorsed_candidate_party question_id sample_size
## 1                      NA                       NA      207585         800
## 2                      NA                       NA      207585         800
## 3                      NA                       NA      207586         800
## 4                      NA                       NA      207586         800
## 5                      NA                       NA      207583         800
## 6                      NA                       NA      207583         800
##   population subpopulation population_full tracking    created_at notes
## 1         lv            NA              lv       NA 8/31/24 13:18      
## 2         lv            NA              lv       NA 8/31/24 13:18      
## 3         lv            NA              lv       NA 8/31/24 13:18      
## 4         lv            NA              lv       NA 8/31/24 13:18      
## 5         lv            NA              lv       NA 8/31/24 13:18      
## 6         lv            NA              lv       NA 8/31/24 13:18      
##                                                                                                            url
## 1 https://insideradvantage.com/nevada-trump-leads-by-one-point-rosen-holds-substantial-lead-in-senate-contest/
## 2 https://insideradvantage.com/nevada-trump-leads-by-one-point-rosen-holds-substantial-lead-in-senate-contest/
## 3                            https://insideradvantage.com/arizona-trump-leads-by-one-point-gallego-up-by-four/
## 4                            https://insideradvantage.com/arizona-trump-leads-by-one-point-gallego-up-by-four/
## 5      https://insideradvantage.com/north-carolina-trump-leads-harris-by-one-point-rounded-numbers-below-tabs/
## 6      https://insideradvantage.com/north-carolina-trump-leads-harris-by-one-point-rounded-numbers-below-tabs/
##   source internal partisan race_id cycle    office_type seat_number seat_name
## 1     NA                      8857  2024 U.S. President           0        NA
## 2     NA                      8857  2024 U.S. President           0        NA
## 3     NA                      8759  2024 U.S. President           0        NA
## 4     NA                      8759  2024 U.S. President           0        NA
## 5     NA                      8839  2024 U.S. President           0        NA
## 6     NA                      8839  2024 U.S. President           0        NA
##   election_date   stage nationwide_batch ranked_choice_reallocated
## 1       11/5/24 general            false                     false
## 2       11/5/24 general            false                     false
## 3       11/5/24 general            false                     false
## 4       11/5/24 general            false                     false
## 5       11/5/24 general            false                     false
## 6       11/5/24 general            false                     false
##   ranked_choice_round party answer candidate_id candidate_name  pct
## 1                  NA   DEM Harris        16661  Kamala Harris 46.5
## 2                  NA   REP  Trump        16651   Donald Trump 47.9
## 3                  NA   DEM Harris        16661  Kamala Harris 48.4
## 4                  NA   REP  Trump        16651   Donald Trump 48.6
## 5                  NA   DEM Harris        16661  Kamala Harris 48.4
## 6                  NA   REP  Trump        16651   Donald Trump 49.2

Select and rename the columns we are interested in

However, the data above is too large to easily visualize, so the next step is to focus on a couple of columns. So we are going to create a new subset called main_data_polls

main_data_polls <- total_data_polls %>%
  select(state, start_date, end_date, sample_size, party, answer, candidate_id, candidate_name, pct)

That reduced the number of columns from 48 to 8.

As an example to change the name of columns, let’s change the names to make them easier to read.

main_data_polls <- main_data_polls %>%
  rename(
    State = state,
    Start_Date = start_date,
    End_Date = end_date,
    SampleSize = sample_size,
    Party = party,
    Choice = answer,
    CandidateID = candidate_id,
    CandidateName = candidate_name,
    Percentage = pct
  )

Let’s see how the new rows are looking using the command head:

head(main_data_polls)
##            State Start_Date End_Date SampleSize Party Choice CandidateID
## 1         Nevada    8/29/24  8/31/24        800   DEM Harris       16661
## 2         Nevada    8/29/24  8/31/24        800   REP  Trump       16651
## 3        Arizona    8/29/24  8/31/24        800   DEM Harris       16661
## 4        Arizona    8/29/24  8/31/24        800   REP  Trump       16651
## 5 North Carolina    8/29/24  8/31/24        800   DEM Harris       16661
## 6 North Carolina    8/29/24  8/31/24        800   REP  Trump       16651
##   CandidateName Percentage
## 1 Kamala Harris       46.5
## 2  Donald Trump       47.9
## 3 Kamala Harris       48.4
## 4  Donald Trump       48.6
## 5 Kamala Harris       48.4
## 6  Donald Trump       49.2

Generate a graph to visualize information

Now that we have our subset ready let’s create a graph. Fist, let’s define the most recent poll dates as the ones that will be used to create the graph

Convert the date columns to Date type

main_data_polls$Start_Date <- as.Date(main_data_polls$Start_Date, format = "%m/%d/%Y")
main_data_polls$End_Date <- as.Date(main_data_polls$End_Date, format = "%m/%d/%Y")
start_date <- as.Date("0024-08-29")  
end_date <- as.Date("0024-08-31")    

Filter the data for the specific period

filtered_data <- main_data_polls %>%
  filter(Start_Date >= start_date & End_Date <= end_date)

And finally create the plot

  ggplot(filtered_data, aes(x = State, y = Percentage, fill = Choice)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("Trump" = "red", "Harris" = "blue")) + 
  geom_text(aes(label = paste0(Percentage, "%")), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5, 
            color = "black") +  # Add labels to the bars
  labs(title = "Percentage of Voters Between Trump and Harris",
       x = "State",
       y = "Percentage of Voters") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Conclusions

This data is limited to polls conducted between 08/29/2024 and 08/31/2024 in 5 states, with varying sample sizes. From the data is hard to determine how reliable the information is. However, as a first approach to working with creating ggplots and modifying tables in Rstudio it illustrates how easily everything can be customized. This also shows however, how data is presented is as or even more important than the data itself. A way to expand this search would be to include more polling sites, and rate how accurate their polling has in the past versus actual results.