DATA607 Project III

Introduction

While the presidential election season is in full swing, we decided to explore polling data sources that exist online. There are several individual sources that could be found online; however, the website RealClear Politics is a location that gathers, summarizes, and presents the results of the various polls in one location. It should be noted that, while this website is good of a summary view, the underlying polling data must be extracted from the various polling sources (if available) for further review and analysis. The polling sources include Emerson College, The Economist Magazine, The New York Times/Sienna College, CBS News, and many others. Some sources are free, while others incur a fee. It should be noted that the polls tend to discriminate between “Registered Voters” (RV) and “Likely Voters” (LV), and the common belief is the LV are better more indicative of election results. However, a Berkley Haas Study in 2020 reported that while the polls reached a 95% confidence level for statistical reporting, the actual election results only matched with the polls 60% of the time.

Data Sources

We are currently in discussion to identify the data sources for analysis, and the type of analysis we wish to discuss. The sources are varied and include tables on websites, attached PDF documents, and CSV files. Some will require us prepare the data through another platform before we are able to evaluate and analyze the data. This also needs to include a matching/pairing of questions and response on polls to insure equivalency of the questions. Data that has currently been identified include The New York Times/Sienna, Roanoke College, and Emerson College Polls.

Code Initialization

Here I load the required libraries and ensure all the required packages are installed before running the following blocks of codes.

## [1] "All required packages are installed"

Load files from GitHub (KP)

All our files are stored in the GitHub Data/* directory for productivity and collaboration. In this section, I verify the list of files in the Data folder and then load them all into R. All files are in CSV format and are readily accessible by RStudio. However, since they originate from different sources, we must first tidy, clean, and organize them.

## [1] "NYT_Sienna%20Poll_table_1.csv"     "NYT_Sienna%20Poll_table_2.csv"    
## [3] "RC%20Poll%20Topline%20Feb2024.txt" "president_polls.csv"              
## [5] "us02212024_regvoter_upox75.txt"

Echo and Message are both FALSE to not show unneeded data.

Data Import (AC)

https://ballotpedia.org/Super_Tuesday_primaries,_2024

url <- "https://ballotpedia.org/Super_Tuesday_primaries,_2024"
webpage <- read_html(url)
st_table <- html_nodes(webpage, '.portal-section')
table_names <- c("Alabama", "Alaska", "American Somoa", "Arkansas", "California", "Colorado", "Iowa", "Maine", "Massachusetts", "Minnesota", "North_Carolina", "Oklahoma", "Tennessee", "Texas", "Utah", "Vermont", "Virginia")
table_frames <- list()
for (i in seq_along(st_table)) {
  table_data <- html_table(st_table[[i]])
  table_name <- table_names[i]
  table_frames[[table_name]] <- table_data
}
# Sample Data Frame
print("An Exampel of Raw Data is shown for information")

## [1] "An Exampel of Raw Data is shown for information"

kable(table_frames$Alabama)

X1	X2	X3	X4	X5
NA	Candidate	%	Votes	Pledged delegates
NA	Joe Biden	89.5	167,165	52
NA	Dean Phillips	4.5	8,391	0
NA	Other	6.0	11,213	0
NA	99% reporting Source	Total votes: 186,769 • Total pledged delegates: 52	NA	NA
NA	Candidate	%	Votes	Pledged delegates
NA	Donald Trump	83.2	497,739	50
NA	Nikki Haley	13.0	77,564	0
NA	Ron DeSantis	1.4	8,426	0
NA	Vivek Ramaswamy	0.3	1,859	0
NA	Chris Christie	0.2	1,436	0
NA	David James Stuckenberg	0.1	748	0
NA	Ryan Binkley	0.1	508	0
NA	Other	1.6	9,755	0
NA	99% reporting Source	Total votes: 598,035 • Total pledged delegates: 50	NA	NA

Remove Extraneous Rows (AC)

for (i in seq_along(table_frames)){
  # Filter the rows that contain the character string "% reporting"
  table_frames[[i]] <- table_frames[[i]][!grepl("% reporting", table_frames[[i]]$X2), ]
}
# Loop through each data frame removing the row that contains "Source" in column X2
for (i in seq_along(table_frames)){
  # Filter the rows that contain the character string "Source"
  table_frames[[i]] <- table_frames[[i]][!grepl("Source", table_frames[[i]]$X2), ]
}
# for (i in seq_along(table_frames)) {
#  print(table_frames[[i]])
#}
#table_frames[[1]]
print("Data is cleaned by removing 99% reporting, just the forst 10 rows are shown")

## [1] "Data is cleaned by removing 99% reporting, just the forst 10 rows are shown"

head(table_frames[[1]],10)

## # A tibble: 10 × 5
##    X1    X2              X3    X4      X5               
##    <lgl> <chr>           <chr> <chr>   <chr>            
##  1 NA    Candidate       %     Votes   Pledged delegates
##  2 NA    Joe Biden       89.5  167,165 52               
##  3 NA    Dean Phillips   4.5   8,391   0                
##  4 NA    Other           6.0   11,213  0                
##  5 NA    Candidate       %     Votes   Pledged delegates
##  6 NA    Donald Trump    83.2  497,739 50               
##  7 NA    Nikki Haley     13.0  77,564  0                
##  8 NA    Ron DeSantis    1.4   8,426   0                
##  9 NA    Vivek Ramaswamy 0.3   1,859   0                
## 10 NA    Chris Christie  0.2   1,436   0

Adding Columns for State and Party affiliation (AC)

Extracting of Data Frames from List for Super Tuesday States (AC)

Selecting the proper columns, dropping the unnecessary ones. This also places “Democrat into the Party Column for ALL candidates, which will fixed in a later section to include Republican in the correct candidate. Pulls the individual state data frames out of the list to be worked on individually.

X2	X3	X4	X6	X7
Candidate	%	Votes	state	party
Joe Biden	89.5	167,165	alabama	Democrat
Dean Phillips	4.5	8,391	alabama	Democrat
Other	6.0	11,213	alabama	Democrat
Candidate	%	Votes	state	party
Donald Trump	83.2	497,739	alabama	Democrat
Nikki Haley	13.0	77,564	alabama	Democrat
Ron DeSantis	1.4	8,426	alabama	Democrat
Vivek Ramaswamy	0.3	1,859	alabama	Democrat
Chris Christie	0.2	1,436	alabama	Democrat
David James Stuckenberg	0.1	748	alabama	Democrat
Ryan Binkley	0.1	508	alabama	Democrat
Other	1.6	9,755	alabama	Democrat

Substituting Republican in the rows (AC)

This uses a filter that identifies the second occurrence in the column X2 that separates republican candidates from the democratic candidates, and then replaces “Democrat” with “Republican”. This is done for each individual data frame for the Super Tuesday states.

X2	X3	X4	X6	X7
Candidate	%	Votes	state	party
Joe Biden	89.5	167,165	alabama	Democrat
Dean Phillips	4.5	8,391	alabama	Democrat
Other	6.0	11,213	alabama	Democrat
Candidate	%	Votes	state	party
Donald Trump	83.2	497,739	alabama	Republican
Nikki Haley	13.0	77,564	alabama	Republican
Ron DeSantis	1.4	8,426	alabama	Republican
Vivek Ramaswamy	0.3	1,859	alabama	Republican
Chris Christie	0.2	1,436	alabama	Republican
David James Stuckenberg	0.1	748	alabama	Republican
Ryan Binkley	0.1	508	alabama	Republican
Other	1.6	9,755	alabama	Republican

Removing Candidate Rows (AC)

In this section, we remove the “candidate” row from each of the data frames and rename the column variable names from X to what they actually represent. We also address the accidental import of California candidates not running for the national office of president because they were placed in the same table on the website. Those candidates had their party affiliation and percent_vote values changes to NA.

candidate	percent_vote	num_vote	state	party
Joe Biden	89.4	2,410,731	california	Democrat
Marianne Williamson	3.7	99,321	california	Democrat
Dean Phillips	2.8	75,313	california	Democrat
Armando Perez-Serrato	1.2	32,927	california	Democrat
Gabriel Cornejo	1.2	31,985	california	Democrat
President Boddie	0.7	19,651	california	Democrat
Stephen Lyons Sr.	0.6	16,946	california	Democrat
Eban Cambridge	0.3	9,106	california	Democrat
Donald Trump	78.8	1,505,027	california	Republican
Nikki Haley	17.8	339,502	california	Republican
Ron DeSantis	1.5	27,820	california	Republican
Chris Christie	0.9	16,476	california	Republican
Vivek Ramaswamy	0.4	8,171	california	Republican
Rachel Hannah Swift	0.2	3,393	california	Republican
David James Stuckenberg	0.2	3,086	california	Republican
Ryan Binkley	0.2	2,947	california	Republican
Asa Hutchinson	0.1	2,697	california	Republican
Jill Stein	NA	12,856	california	NA
Charles Ballay	NA	19,043	california	NA
James P. Bradley	NA	40,290	california	NA
Claudia De La Cruz	NA	5,273	california	NA
Cornel West	NA	4,433	california	NA
Jasmine Sherman	NA	1,491	california	NA

Combining the Entire Data Frame (AC)

In this section we combine all the Super Tuesday state data frames into one tidy data frame containing five (5) variables and 205 rows. Once in, we will remove all the commas in the vote counts so we can use the data as numeric.

# The Combine
super_tuesday_combine <- bind_rows(alabama, arkansas, california, colorado, maine, massachusets, minnesota, north_carolina, oklahoma, tennessee, texas, utah, vermont, virginia)

# Remove the commas
super_tuesday_combine$num_vote <- gsub(",", "", super_tuesday_combine$num_vote)
#super_tuesday_combine

Total Votes Cast by Party Affiliation and State (AC)

In this section we are going to identify the total number of votes cast by state, and then grouped by party affiliation.

state	state_total	state_party_total_d	state_party_total_r
alabama	784804	186769	598035
arkansas	347379	81205	266174
california	4688485	2695980	1909119
colorado	1439162	573357	865805
maine	165400	63538	101862
massachusets	1188835	632869	555966
minnesota	581940	242291	339649
north_carolina	1770034	694323	1075711
oklahoma	403701	91561	312140
tennnessee	711514	133296	578218
texas	3294966	976680	2318286
utah	136207	61411	74796
vermont	135854	63609	72245
virginia	1044813	351141	693672

Importing Next Dataset (AC)

In this section we are now going to import the state level data on regsitered voters and party affiliations. This comes from the websitehttps://worldpopulationreview.com/state-rankings/registered-voters-by-state. These figures are from October 2022, but should be accurate enough for our purposes of comparison with the voting electorate on Super Tuesday 2024. This import was much simpler due to the table structured format of the website, so the transformation and clean up are going to be much simpler. The initial import required us to move the first row to column names and then remove the fist row as an observation.

Tidying and Transforming (AC)

In this section we are going to clean up the data frame a bit. We need to remove the commas and % in the values so we can do calculations later. We need to adjust the column names to be more workable, and we will need to mutate the population numbers to be comparable with the values in the other data frame (these are in thousands).

Filtering the State Level Data to Only Super Tuesday States (AC)

super_states <- state_final_2024$state
vote_2022af <- vote_2022[vote_2022$state %in% super_states, ]

combine_data_frame <- merge(state_final_2024, vote_2022af, by="state")

kable(combine_data_frame)

state	state_total	state_party_total_d	state_party_total_r	reg_voter_num	perc_vote_pop
alabama	784804	186769	598035	2527000	0.680
arkansas	347379	81205	266174	1361000	0.620
california	4688485	2695980	1909119	18001000	0.694
colorado	1439162	573357	865805	2993000	0.713
maine	165400	63538	101862	832000	0.774
minnesota	581940	242291	339649	3436000	0.829
oklahoma	403701	91561	312140	1884000	0.673
texas	3294966	976680	2318286	13343000	0.718
utah	136207	61411	74796	1468000	0.674
vermont	135854	63609	72245	365000	0.730
virginia	1044813	351141	693672	4541000	0.760

Merging both Data sets (JN)

## Merging Super_tuesday_combine with combine_data_frame
merge_combine<- merge(super_tuesday_combine,combine_data_frame, by="state",all.x = TRUE) %>%
  arrange((state), percent_vote)

#merge_combine

Analysis (JN)

Calculate Average (NJ)

To better visualize the data I will provide a table that display state by their Average Total

merge_combine_1 <- mutate(merge_combine, "Average_total"= rowMeans( select(merge_combine,"num_vote", "state_party_total_r", "state_party_total_r", "reg_voter_num") ) )

reactable(select(merge_combine_1, "state", "Average_total"))

Filter Votes by Party (NJ)

This filter data party which are Democrat and Republican. W

# Members filter 
Party_D <- merge_combine_1 %>% filter( party == "Democrat", na.omit(TRUE))
Party_R <- merge_combine_1 %>% filter(party == "Republican", na.omit(TRUE))

Democratic vs Republican vote by state (NJ)

There are two plot below, one show Republican votes by State and the other show Democratic votes by state.

## Democratic by state
D_plot <- ggplot(merge_combine_1, aes(y=merge_combine_1$state, fill = state)) + geom_bar(binwidth = 300) +
  ggtitle("Democratic Vote By State") +
  theme(plot.title = element_text(hjust=0.5)) +
  ylab("State Vote") +
  xlab("Democratic Party")

## Warning in geom_bar(binwidth = 300): Ignoring unknown parameters: `binwidth`

D_plot

## Warning: Use of `merge_combine_1$state` is discouraged.
## ℹ Use `state` instead.

## Republican by state
R_plot <- ggplot(merge_combine_1, aes(y=merge_combine_1$state, fill = state)) + geom_bar(binwidth = 300) +
  ggtitle("Republican Vote By State") +
  theme(plot.title = element_text(hjust=0.5)) +
  ylab("State Vote") +
  xlab("Republican Party")

## Warning in geom_bar(binwidth = 300): Ignoring unknown parameters: `binwidth`

R_plot

## Warning: Use of `merge_combine_1$state` is discouraged.
## ℹ Use `state` instead.

Party Percentage by state (NJ)

The plot below show the party percentage by state. We can see that the Republican party have the highest percent by state compare to the Democratic party.

myplot <- ggplot(merge_combine_1, aes(merge_combine_1$party, group = state)) +
          geom_text(aes( label = scales::percent(..prop..),
                   y= ..prop.. ), stat= "count", vjust = -.5) +
geom_bar(aes(y = ..prop.., fill = `state`, stat="count")) + 
          scale_y_continuous(labels=scales::percent) +
          ggtitle("Party Percentage by State") +
         theme(plot.title = element_text(hjust=0.5)) +
          ylab("Percentage of Voters by State") +
          xlab("Party")

## Warning in geom_bar(aes(y = ..prop.., fill = state, stat = "count")): Ignoring
## unknown aesthetics: stat

myplot

## Warning: The dot-dot notation (`..prop..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(prop)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Use of `merge_combine_1$party` is discouraged.
## ℹ Use `party` instead.
## Use of `merge_combine_1$party` is discouraged.
## ℹ Use `party` instead.

Number of vote (JN)

Average Ttoal by Party (JN)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

DATA607 Project III - Teamwork

2024-03-18