Team<- c (Anthony C., James N., Koohyar P., Victor T.)
While the presidential election season is in full swing, we decided to explore polling data sources that exist online. There are several individual sources that could be found; however, the website RealClear Politics is a location that gathers, summarizes, and presents the results of the various polls in one location. It should be noted that, while this website is good for a summary view, the underlying polling data must be extracted from the various sources (if available) for further review and analysis. The polling sources include Emerson College, The Economist Magazine, The New York Times/Sienna College, CBS News, and many others. Some sources are free, while others incur a fee. It should be noted that the polls tend to discriminate between “Registered Voters” (RV) and “Likely Voters” (LV), and the common belief is that the LV is more indicative of election results. However, a Berkeley Haas Study in 2020 reported that while the polls reached a 95% confidence level for statistical reporting, the actual election results only matched with the polls 60% of the time. We decided to look at the results of the Super Tuesday primary results and examine the number of votes cast for both party candidates’, compare the voter participation by the various parties, and compare voter participation to both registered voters and eligible adult populations by state.
We reviewed and evaluated various data sources, primarily scraped from websites and the internet. These sources are diverse, encompassing tables on websites, attached PDF documents, and CSV files. Some data required substantial preparation through an additional platform before we could proceed with evaluation and analysis. This process also involved matching responses to ensure equivalency. Ultimately, we decided to utilize Ballotpedia’s Super Tuesday primaries and https://worldpopulationreview.com/state-rankings/registered-voters-by-state datasets. The websites have provided unbiased raw data from different states on topics of our interest, including voter turnout, registrations, voter enthusiasm, voter distribution, and total votes for different candidates.
Here I load the required libraries and ensure all the required packages are installed before running the following blocks of codes.
## [1] "All required packages are installed"
## Warning: package 'openxlsx' was built under R version 4.3.3
## Warning: package 'reactable' was built under R version 4.3.3
All our files are stored in the GitHub Data/*
directory
for productivity and collaboration. In this section, I verify the list
of files in the Data folder and then load them all into R. All files are
in CSV format and are readily accessible by RStudio. However, since they
originate from different sources, we must first tidy, clean, and
organize them.
## [1] "NYT_Sienna%20Poll_table_1.csv" "NYT_Sienna%20Poll_table_2.csv"
## [3] "RC%20Poll%20Topline%20Feb2024.txt" "president_polls.csv"
## [5] "us02212024_regvoter_upox75.txt"
Echo and Message are both FALSE
to not show unneeded
data.
https://ballotpedia.org/Super_Tuesday_primaries,_2024
url <- "https://ballotpedia.org/Super_Tuesday_primaries,_2024"
webpage <- read_html(url)
st_table <- html_nodes(webpage, '.portal-section')
table_names <- c("Alabama", "Alaska", "American Somoa", "Arkansas", "California", "Colorado", "Iowa", "Maine", "Massachusetts", "Minnesota", "North_Carolina", "Oklahoma", "Tennessee", "Texas", "Utah", "Vermont", "Virginia")
table_frames <- list()
for (i in seq_along(st_table)) {
table_data <- html_table(st_table[[i]])
table_name <- table_names[i]
table_frames[[table_name]] <- table_data
}
# Sample Data Frame
print("An Example of Raw Data is shown for information")
## [1] "An Example of Raw Data is shown for information"
kable(table_frames$Alabama)
X1 | X2 | X3 | X4 | X5 |
---|---|---|---|---|
NA | Candidate | % | Votes | Pledged delegates |
NA | Joe Biden | 89.5 | 167,165 | 52 |
NA | Dean Phillips | 4.5 | 8,391 | 0 |
NA | Other | 6.0 | 11,213 | 0 |
NA | 99% reporting Source | Total votes: 186,769 • Total pledged delegates: 52 | NA | NA |
NA | Candidate | % | Votes | Pledged delegates |
NA | Donald Trump | 83.2 | 497,739 | 50 |
NA | Nikki Haley | 13.0 | 77,564 | 0 |
NA | Ron DeSantis | 1.4 | 8,426 | 0 |
NA | Vivek Ramaswamy | 0.3 | 1,859 | 0 |
NA | Chris Christie | 0.2 | 1,436 | 0 |
NA | David James Stuckenberg | 0.1 | 748 | 0 |
NA | Ryan Binkley | 0.1 | 508 | 0 |
NA | Other | 1.6 | 9,755 | 0 |
NA | 99% reporting Source | Total votes: 598,035 • Total pledged delegates: 50 | NA | NA |
for (i in seq_along(table_frames)){
# Filter the rows that contain the character string "% reporting"
table_frames[[i]] <- table_frames[[i]][!grepl("% reporting", table_frames[[i]]$X2), ]
}
# Loop through each data frame removing the row that contains "Source" in column X2
for (i in seq_along(table_frames)){
# Filter the rows that contain the character string "Source"
table_frames[[i]] <- table_frames[[i]][!grepl("Source", table_frames[[i]]$X2), ]
}
# for (i in seq_along(table_frames)) {
# print(table_frames[[i]])
#}
#table_frames[[1]]
print("Data is cleaned by removing 99% reporting, just the f1rst 10 rows are shown")
## [1] "Data is cleaned by removing 99% reporting, just the f1rst 10 rows are shown"
head(table_frames[[1]],10)
## # A tibble: 10 × 5
## X1 X2 X3 X4 X5
## <lgl> <chr> <chr> <chr> <chr>
## 1 NA Candidate % Votes Pledged delegates
## 2 NA Joe Biden 89.5 167,165 52
## 3 NA Dean Phillips 4.5 8,391 0
## 4 NA Other 6.0 11,213 0
## 5 NA Candidate % Votes Pledged delegates
## 6 NA Donald Trump 83.2 497,739 50
## 7 NA Nikki Haley 13.0 77,564 0
## 8 NA Ron DeSantis 1.4 8,426 0
## 9 NA Vivek Ramaswamy 0.3 1,859 0
## 10 NA Chris Christie 0.2 1,436 0
Selecting the proper columns, dropping the unnecessary ones. This also places “Democrat into the Party Column for ALL candidates, which will fixed in a later section to include Republican in the correct candidate. Pulls the individual state data frames out of the list to be worked on individually.
X2 | X3 | X4 | X6 | X7 |
---|---|---|---|---|
Candidate | % | Votes | state | party |
Joe Biden | 89.5 | 167,165 | alabama | Democrat |
Dean Phillips | 4.5 | 8,391 | alabama | Democrat |
Other | 6.0 | 11,213 | alabama | Democrat |
Candidate | % | Votes | state | party |
Donald Trump | 83.2 | 497,739 | alabama | Democrat |
Nikki Haley | 13.0 | 77,564 | alabama | Democrat |
Ron DeSantis | 1.4 | 8,426 | alabama | Democrat |
Vivek Ramaswamy | 0.3 | 1,859 | alabama | Democrat |
Chris Christie | 0.2 | 1,436 | alabama | Democrat |
David James Stuckenberg | 0.1 | 748 | alabama | Democrat |
Ryan Binkley | 0.1 | 508 | alabama | Democrat |
Other | 1.6 | 9,755 | alabama | Democrat |
This uses a filter that identifies the second occurrence in the column X2 that separates republican candidates from the democratic candidates, and then replaces “Democrat” with “Republican”. This is done for each individual data frame for the Super Tuesday states.
X2 | X3 | X4 | X6 | X7 |
---|---|---|---|---|
Candidate | % | Votes | state | party |
Joe Biden | 89.5 | 167,165 | alabama | Democrat |
Dean Phillips | 4.5 | 8,391 | alabama | Democrat |
Other | 6.0 | 11,213 | alabama | Democrat |
Candidate | % | Votes | state | party |
Donald Trump | 83.2 | 497,739 | alabama | Republican |
Nikki Haley | 13.0 | 77,564 | alabama | Republican |
Ron DeSantis | 1.4 | 8,426 | alabama | Republican |
Vivek Ramaswamy | 0.3 | 1,859 | alabama | Republican |
Chris Christie | 0.2 | 1,436 | alabama | Republican |
David James Stuckenberg | 0.1 | 748 | alabama | Republican |
Ryan Binkley | 0.1 | 508 | alabama | Republican |
Other | 1.6 | 9,755 | alabama | Republican |
In this section, we remove the “candidate” row from each of the data frames and rename the column variable names from X to what they actually represent. We also address the accidental import of California candidates not running for the national office of president because they were placed in the same table on the website. Those candidates had their party affiliation and percent_vote values changes to NA.
candidate | percent_vote | num_vote | state | party |
---|---|---|---|---|
Joe Biden | 89.4 | 2,410,731 | california | Democrat |
Marianne Williamson | 3.7 | 99,321 | california | Democrat |
Dean Phillips | 2.8 | 75,313 | california | Democrat |
Armando Perez-Serrato | 1.2 | 32,927 | california | Democrat |
Gabriel Cornejo | 1.2 | 31,985 | california | Democrat |
President Boddie | 0.7 | 19,651 | california | Democrat |
Stephen Lyons Sr. | 0.6 | 16,946 | california | Democrat |
Eban Cambridge | 0.3 | 9,106 | california | Democrat |
Donald Trump | 78.8 | 1,505,027 | california | Republican |
Nikki Haley | 17.8 | 339,502 | california | Republican |
Ron DeSantis | 1.5 | 27,820 | california | Republican |
Chris Christie | 0.9 | 16,476 | california | Republican |
Vivek Ramaswamy | 0.4 | 8,171 | california | Republican |
Rachel Hannah Swift | 0.2 | 3,393 | california | Republican |
David James Stuckenberg | 0.2 | 3,086 | california | Republican |
Ryan Binkley | 0.2 | 2,947 | california | Republican |
Asa Hutchinson | 0.1 | 2,697 | california | Republican |
Jill Stein | NA | 12,856 | california | NA |
Charles Ballay | NA | 19,043 | california | NA |
James P. Bradley | NA | 40,290 | california | NA |
Claudia De La Cruz | NA | 5,273 | california | NA |
Cornel West | NA | 4,433 | california | NA |
Jasmine Sherman | NA | 1,491 | california | NA |
In this section we combine all the Super Tuesday state data frames into one tidy data frame containing five (5) variables and 205 rows. Once in, we will remove all the commas in the vote counts so we can use the data as numeric.
# The Combine
super_tuesday_combine <- bind_rows(alabama, arkansas, california, colorado, maine, massachusetts, minnesota, north_carolina, oklahoma, tennessee, texas, utah, vermont, virginia)
# Remove the commas
super_tuesday_combine$num_vote <- gsub(",", "", super_tuesday_combine$num_vote)
#super_tuesday_combine
In this section we are going to identify the total number of votes cast by state, and then grouped by party affiliation.
state | state_total | state_party_total_d | state_party_total_r |
---|---|---|---|
alabama | 784804 | 186769 | 598035 |
arkansas | 347379 | 81205 | 266174 |
california | 4688485 | 2695980 | 1909119 |
colorado | 1439162 | 573357 | 865805 |
maine | 165400 | 63538 | 101862 |
massachusetts | 1188835 | 632869 | 555966 |
minnesota | 581940 | 242291 | 339649 |
north carolina | 1770034 | 694323 | 1075711 |
oklahoma | 403701 | 91561 | 312140 |
tennessee | 711514 | 133296 | 578218 |
texas | 3294966 | 976680 | 2318286 |
utah | 136207 | 61411 | 74796 |
vermont | 135796 | 63609 | 72187 |
virginia | 1044813 | 351141 | 693672 |
In this section we are now going to import the state level data on registered voters and party affiliations. This comes from the website https://worldpopulationreview.com/state-rankings/registered-voters-by-state. These figures are from October 2022, but should be accurate enough for our purposes of comparison with the voting electorate on Super Tuesday 2024. This import was much simpler due to the table structured format of the website, so the transformation and clean up are going to be much simpler. The initial import required us to move the first row to column names and then remove the fist row as an observation.
In this section we are going to clean up the data frame a bit. We need to remove the commas and % in the values so we can do calculations later. We need to adjust the column names to be more workable, and we will need to mutate the population numbers to be comparable with the values in the other data frame (these are in thousands).
super_states <- state_final_2024$state
vote_2022af <- vote_2022[vote_2022$state %in% super_states, ]
combine_data_frame <- merge(state_final_2024, vote_2022af, by="state")
kable(combine_data_frame)
state | state_total | state_party_total_d | state_party_total_r | reg_voter_num | perc_vote_pop |
---|---|---|---|---|---|
alabama | 784804 | 186769 | 598035 | 2527000 | 0.680 |
arkansas | 347379 | 81205 | 266174 | 1361000 | 0.620 |
california | 4688485 | 2695980 | 1909119 | 18001000 | 0.694 |
colorado | 1439162 | 573357 | 865805 | 2993000 | 0.713 |
maine | 165400 | 63538 | 101862 | 832000 | 0.774 |
massachusetts | 1188835 | 632869 | 555966 | 3546000 | 0.724 |
minnesota | 581940 | 242291 | 339649 | 3436000 | 0.829 |
north carolina | 1770034 | 694323 | 1075711 | 5161000 | 0.698 |
oklahoma | 403701 | 91561 | 312140 | 1884000 | 0.673 |
tennessee | 711514 | 133296 | 578218 | 3742000 | 0.743 |
texas | 3294966 | 976680 | 2318286 | 13343000 | 0.718 |
utah | 136207 | 61411 | 74796 | 1468000 | 0.674 |
vermont | 135796 | 63609 | 72187 | 365000 | 0.730 |
virginia | 1044813 | 351141 | 693672 | 4541000 | 0.760 |
## Merging Super_tuesday_combine with combine_data_frame
merge_combine<- merge(super_tuesday_combine,combine_data_frame, by="state",all.x = TRUE) %>%
arrange((state), percent_vote)
#merge_combine
merge_combine_1 <- mutate(merge_combine, "Average_total"= rowMeans( select(merge_combine,"num_vote", "state_party_total_r", "state_party_total_r", "reg_voter_num") ) )
This filter data party which are Democrat and Republican. W
# Members filter
Party_D <- merge_combine_1 %>% filter( party == "Democrat", na.omit(TRUE))
Party_R <- merge_combine_1 %>% filter(party == "Republican", na.omit(TRUE))
There are two plot below, one shows Republican votes by State and the other show Democratic votes by state.
## Democratic by state
D_plot <- ggplot(state_final_2024, aes(x=state_party_total_d, y=state, fill=state)) +
geom_bar(stat = "identity") +
scale_x_continuous(labels = scales::comma) +
labs (x="Democratic Vote Totals", y="State") +
theme(plot.title = element_text(hjust=0.5)) +
ggtitle("Democratic Vote by State-Super Tuesday")
D_plot
## Republican by state
R_plot <- ggplot(state_final_2024, aes(x=state_party_total_r, y=state, fill=state)) +
geom_bar(stat = "identity") +
scale_x_continuous(labels = scales::comma) +
labs (x="Republican Vote Totals", y="State") +
theme(plot.title = element_text(hjust=0.5)) +
ggtitle("Republican Vote by State-Super Tuesday")
R_plot
The plot below show the party percentage by state. We can see that the Republican party have the highest percent by state compare to the Democratic party.
state_final_2024$perc_rep <- (state_final_2024$state_party_total_r/state_final_2024$state_total)*100
state_final_2024$perc_dem <- (state_final_2024$state_party_total_d/state_final_2024$state_total)*100
state_data_long <- pivot_longer(state_final_2024, cols = c(perc_rep, perc_dem),
names_to="Party", values_to = "Percentage")
ggplot(state_data_long, aes(x = Percentage, y = state, fill = Party)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Percentage of Total Votes (%)", y = "State") +
ggtitle("Percentage of Total Votes by Party for Each State") +
scale_fill_manual(values = c("skyblue","salmon"),
labels = c("Democrat", "Republican")) +
geom_text(aes(label = sprintf("%.0f%%", Percentage)), position = position_dodge(width = 1.0), vjust = 0.5, hjust= -0.25) +
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Super Tuesday stands as a pivotal juncture in American politics, offering a glimpse into the electorate’s priorities and the impact of campaign messaging. While numerous sources compile and present data, the challenge lies in identifying reliable and unbiased information that truly captures voter sentiment. Our team conducted a quick review of various sources and decide to go with Ballotpedia’s Super Tuesday primaries and worldpopulationreview. The initial step involved meticulously cleaning and organizing the data extracted from HTML sources to facilitate comprehensive state-by-state analysis. This process included several stages of data refinement to ensure accuracy and relevance.
Key metrics scrutinized were voter turnout and registration trends. Referecnes [1,2, and 3] indicate that these factors are crucial indicators of electoral success for candidates. Our analysis revealed a pronounced Republican presence at the polls compared to Democrats, particularly in states like California, Massachusetts, Minnesota, and traditionally Republican strongholds like Tennessee. The disparity between registered Republicans and Democrats, relative to the total number of registered voters, further underscores the enthusiasm within the Republican electorate.
Moreover, we observed the significant influence of uncommitted votes, particularly those aligned with Nikki Haley, which could potentially sway election outcomes. The disposition of Haley’s voter base—whether they support Joe Biden or Donald Trump—merits further investigation. The magnitude of Haley’s support could be a decisive factor in flipping battleground states.
In summary, our analysis underscores the complexity of voter behavior and the multifaceted nature of political engagement, setting the stage for a dynamic electoral landscape. Additionally, on the project note, working with various data sources and gathering them posed a significant challenge in this project, as did managing the complexity of teamwork. It took us some time to collaborate effectively, and additional time was required to develop the code for analysis and presentation.
GitHub: https://github.com/kohyarp/DATA607_Project3 Rpub: https://rpubs.com/kohyarp/1163803 https://electionlab.mit.edu/articles/voter-confidence-and-electoral-participation https://time.com/6223871/midterms-voting-turnout-results/ https://escholarship.org/content/qt83s3p2t2/qt83s3p2t2_noSplash_5f3d629c82c3feb4af00babba1b97e8d.pdf Ballotpedia’s Super Tuesdayprimaries(https://ballotpedia.org/Super_Tuesday_primaries,_2024) worldpopulationreview(https://worldpopulationreview.com/state-rankings/registered-voters-by-state). https://www.census.gov/newsroom/press-releases/2022/2020-presidential-election-voting-report.html https://www.pewresearch.org/politics/2019/03/14/political-independents-who-they-are-what-they-think/