This is a brief overview of the data from the latest polls regarding presidential elections in the US. For the complete article and source of the data please see
https://projects.fivethirtyeight.com/polls/.
This file will include R chunks to instruct how to load the data
Depending on what libraries are used, first load those, in this case we are loading dplyr and ggplot2
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
Since the data needs to be reproducible it will be pulled from a URL from Github
url <- "https://raw.githubusercontent.com/Lfirenzeg/msds607labs/main/president_polls.csv"
We create a data called total_data_polls to reference the entirety of the original data set.
total_data_polls <- read.csv(url)
Use the command head just to check that the data you want, to confirm it’s loaded correctly
head(total_data_polls)
## poll_id pollster_id pollster sponsor_ids sponsors display_name
## 1 87989 235 InsiderAdvantage InsiderAdvantage
## 2 87989 235 InsiderAdvantage InsiderAdvantage
## 3 87990 235 InsiderAdvantage InsiderAdvantage
## 4 87990 235 InsiderAdvantage InsiderAdvantage
## 5 87994 235 InsiderAdvantage InsiderAdvantage
## 6 87994 235 InsiderAdvantage InsiderAdvantage
## pollster_rating_id pollster_rating_name numeric_grade pollscore methodology
## 1 243 InsiderAdvantage 2 -0.4
## 2 243 InsiderAdvantage 2 -0.4
## 3 243 InsiderAdvantage 2 -0.4
## 4 243 InsiderAdvantage 2 -0.4
## 5 243 InsiderAdvantage 2 -0.4
## 6 243 InsiderAdvantage 2 -0.4
## transparency_score state start_date end_date sponsor_candidate_id
## 1 5 Nevada 8/29/24 8/31/24 NA
## 2 5 Nevada 8/29/24 8/31/24 NA
## 3 5 Arizona 8/29/24 8/31/24 NA
## 4 5 Arizona 8/29/24 8/31/24 NA
## 5 5 North Carolina 8/29/24 8/31/24 NA
## 6 5 North Carolina 8/29/24 8/31/24 NA
## sponsor_candidate sponsor_candidate_party endorsed_candidate_id
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## endorsed_candidate_name endorsed_candidate_party question_id sample_size
## 1 NA NA 207585 800
## 2 NA NA 207585 800
## 3 NA NA 207586 800
## 4 NA NA 207586 800
## 5 NA NA 207583 800
## 6 NA NA 207583 800
## population subpopulation population_full tracking created_at notes
## 1 lv NA lv NA 8/31/24 13:18
## 2 lv NA lv NA 8/31/24 13:18
## 3 lv NA lv NA 8/31/24 13:18
## 4 lv NA lv NA 8/31/24 13:18
## 5 lv NA lv NA 8/31/24 13:18
## 6 lv NA lv NA 8/31/24 13:18
## url
## 1 https://insideradvantage.com/nevada-trump-leads-by-one-point-rosen-holds-substantial-lead-in-senate-contest/
## 2 https://insideradvantage.com/nevada-trump-leads-by-one-point-rosen-holds-substantial-lead-in-senate-contest/
## 3 https://insideradvantage.com/arizona-trump-leads-by-one-point-gallego-up-by-four/
## 4 https://insideradvantage.com/arizona-trump-leads-by-one-point-gallego-up-by-four/
## 5 https://insideradvantage.com/north-carolina-trump-leads-harris-by-one-point-rounded-numbers-below-tabs/
## 6 https://insideradvantage.com/north-carolina-trump-leads-harris-by-one-point-rounded-numbers-below-tabs/
## source internal partisan race_id cycle office_type seat_number seat_name
## 1 NA 8857 2024 U.S. President 0 NA
## 2 NA 8857 2024 U.S. President 0 NA
## 3 NA 8759 2024 U.S. President 0 NA
## 4 NA 8759 2024 U.S. President 0 NA
## 5 NA 8839 2024 U.S. President 0 NA
## 6 NA 8839 2024 U.S. President 0 NA
## election_date stage nationwide_batch ranked_choice_reallocated
## 1 11/5/24 general false false
## 2 11/5/24 general false false
## 3 11/5/24 general false false
## 4 11/5/24 general false false
## 5 11/5/24 general false false
## 6 11/5/24 general false false
## ranked_choice_round party answer candidate_id candidate_name pct
## 1 NA DEM Harris 16661 Kamala Harris 46.5
## 2 NA REP Trump 16651 Donald Trump 47.9
## 3 NA DEM Harris 16661 Kamala Harris 48.4
## 4 NA REP Trump 16651 Donald Trump 48.6
## 5 NA DEM Harris 16661 Kamala Harris 48.4
## 6 NA REP Trump 16651 Donald Trump 49.2
However, the data above is too large to easily visualize, so the next step is to focus on a couple of columns. So we are going to create a new subset called main_data_polls
main_data_polls <- total_data_polls %>%
select(state, start_date, end_date, sample_size, party, answer, candidate_id, candidate_name, pct)
That reduced the number of columns from 48 to 8.
As an example to change the name of columns, let’s change the names to make them easier to read.
main_data_polls <- main_data_polls %>%
rename(
State = state,
Start_Date = start_date,
End_Date = end_date,
SampleSize = sample_size,
Party = party,
Choice = answer,
CandidateID = candidate_id,
CandidateName = candidate_name,
Percentage = pct
)
Let’s see how the new rows are looking using the command head:
head(main_data_polls)
## State Start_Date End_Date SampleSize Party Choice CandidateID
## 1 Nevada 8/29/24 8/31/24 800 DEM Harris 16661
## 2 Nevada 8/29/24 8/31/24 800 REP Trump 16651
## 3 Arizona 8/29/24 8/31/24 800 DEM Harris 16661
## 4 Arizona 8/29/24 8/31/24 800 REP Trump 16651
## 5 North Carolina 8/29/24 8/31/24 800 DEM Harris 16661
## 6 North Carolina 8/29/24 8/31/24 800 REP Trump 16651
## CandidateName Percentage
## 1 Kamala Harris 46.5
## 2 Donald Trump 47.9
## 3 Kamala Harris 48.4
## 4 Donald Trump 48.6
## 5 Kamala Harris 48.4
## 6 Donald Trump 49.2
Now that we have our subset ready let’s create a graph. Fist, let’s define the most recent poll dates as the ones that will be used to create the graph
main_data_polls$Start_Date <- as.Date(main_data_polls$Start_Date, format = "%m/%d/%Y")
main_data_polls$End_Date <- as.Date(main_data_polls$End_Date, format = "%m/%d/%Y")
start_date <- as.Date("0024-08-29")
end_date <- as.Date("0024-08-31")
filtered_data <- main_data_polls %>%
filter(Start_Date >= start_date & End_Date <= end_date)
And finally create the plot
ggplot(filtered_data, aes(x = State, y = Percentage, fill = Choice)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("Trump" = "red", "Harris" = "blue")) +
geom_text(aes(label = paste0(Percentage, "%")),
position = position_dodge(width = 0.9),
vjust = -0.5,
color = "black") + # Add labels to the bars
labs(title = "Percentage of Voters Between Trump and Harris",
x = "State",
y = "Percentage of Voters") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This data is limited to polls conducted between 08/29/2024 and 08/31/2024 in 5 states, with varying sample sizes. From the data is hard to determine how reliable the information is. However, as a first approach to working with creating ggplots and modifying tables in Rstudio it illustrates how easily everything can be customized. This also shows however, how data is presented is as or even more important than the data itself. A way to expand this search would be to include more polling sites, and rate how accurate their polling has in the past versus actual results.