Question One: Load both datasets (election polls, henceforth polls, and election results, henceforth results) into R. Answer: To do this data coding, or any coding, I need to load in my tidyverse using the library(tidyverse) code. Then I set my working directory, which then will allow me to tell R where and which file to read. Then to confirm that the data has been loaded in, I use the view() function so I can see the dataframe in a separate tab. Then I do the same thing for the other file. check read command, also try excel file Answer: To do this I load in the libraries I am going to need to complete the assignment. Then I set my working directory to my Dropbox DATA101 folder. Then I read in my two dataframes, one using the path to my raw data folder and another I do the same thing, but for the dataframe I pulled from the website. To ensure I have done this right, I added view() to each so I could check that everything came in smoothly.

library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
require(dplyr)
setwd("~/Dropbox/DATA101/")
polls <- read_csv(file = "Data/Raw/president_polls_historical.csv")
## Rows: 16727 Columns: 44
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (23): pollster, sponsors, display_name, pollster_rating_name, fte_grade,...
## dbl (14): poll_id, pollster_id, pollster_rating_id, transparency_score, spon...
## num  (1): sponsor_ids
## lgl  (6): subpopulation, tracking, internal, seat_name, nationwide_batch, ra...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
view(polls)
#########################################################################################################
setwd("~/Dropbox/DATA101/")
results <- read_csv(file = "Data/Raw/1976-2020-president.csv")
## Rows: 4287 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): state, state_po, office, candidate, party_detailed, party_simplified
## dbl (7): year, state_fips, state_cen, state_ic, candidatevotes, totalvotes, ...
## lgl (2): writein, notes
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
view(results)

Question Two: Keep only the rows in polls with a start date on or after June 1, 2020 and where the candidate is either Joe Biden or Donald Trump. Answer: To do this, I tell R that I want to take the dataframe loaded from the above code called “polls” which I want to manipulate to meet the requirements of the task for Q2. Then I mutate my start_time so I can have it as month, day, year and filter so I can have only Trump and Biden since June 1, 2020. Then I select what variables I want to keep for the remainder of the assignment. I decided to simplify it so I could add later and make it easier for me to manage.

polls <- polls %>% 
  mutate(start_date = mdy(start_date)) %>% 
  filter(start_date >= as.Date('2020-06-01') & (answer == "Biden" | answer == "Trump")) %>%
  select(poll_id, pollster_id, answer, candidate_name, start_date, state, pct, methodology)

Question Three: Let’s explore this data. Create a scatterplot with a trend line that shows the change in each candidate’s support over the course of the race. Use only national (not state) polls. Use red for Trump’s support and blue for Biden’s support. What do you observe in the final weeks of the campaign? Answer: Now comes a harder task, which is changing the data and plotting it. I start by getting the information I need for Trump and doing the same for Biden. I do this by using filter, group_by, and summarise to compile all of the information that I need. Then I begin to plot, which I do by setting one line for Trump and one for Biden, distinguished by color (Biden, blue and Trump, red). Then I completed all of the labs, color scales, and themes. To finalize my plotting, I adjusted the legend to match perfectly with the plot. Data Analysis: It is evident by the plot that Biden had a significantly higher support rate, than Trump, which consequently resulted in Biden’s presidency. We can see that Biden’s support remained stable without significant increases or decreases; however, Trump, saw an increase and then leveled out followed by another increase. Between July and October of 2020, he did not see any significant change in his support.

library(ggplot2)
library(ggplot2)

trump_data <- polls %>%
  filter(state == "" | is.na(state), candidate_name == "Donald Trump") %>%
  group_by(start_date) %>%
  summarise(avgpoll = mean(pct, na.rm = TRUE))
biden_data <- polls %>%
  filter(state == "" | is.na(state), candidate_name == "Joe Biden") %>%
  group_by(start_date) %>%
  summarise(avgpoll = mean(pct, na.rm = TRUE))
ggplot() +
  geom_point(data = trump_data, aes(x = start_date, y = avgpoll, color = "Donald Trump"), alpha = 0.5) +
  geom_smooth(data = trump_data, aes(x = start_date, y = avgpoll), method = "loess", color = "red") +
  geom_point(data = biden_data, aes(x = start_date, y = avgpoll, color = "Joe Biden"), alpha = 0.5) +
  geom_smooth(data = biden_data, aes(x = start_date, y = avgpoll), method = "loess", color = "blue") +
  labs(title = "Candidate Support June - November 2020",
       x = "Date",
       y = "Average Poll Percentage",
       color = "Candidate") +
  scale_color_manual(values = c("Donald Trump" = "red", "Joe Biden" = "blue"),
                     labels = c("Donald Trump", "Joe Biden")) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  guides(color = guide_legend(title = "Candidate"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Question Four: Some observers believe that there is a “shy Trump voter” effect, where Trump voters are more likely to reveal their support for Trump to a machine compared to a human. Compare the scatterplot you just created for two groups using facets: polls that used “Live Phone” methods (methodology) compared to those that did not. Do you think there is a “shy Trump voter” effect? Answer: NOTE: Please disregard any repeat library() entries, this allowed me to be organized while I was coding since it does not confuse R and I still get the same outcome. To start this task, I make a new variable called avgpoll, which is going to become what it sounds like, the average of the polls. I do this by grouping and then summarising where I can do the mathematical part. I find the mean of the percentage, which gives me the statistics I am looking for when creating this variable. Then, I create another new variable, which is called shyvoters because this is going to be the variable I am using to represent those who are shyvoters and those who are not. You can see by the mutate line that I did this by utilizing the ifelse function. Then I used a left join to merge these two new variables together so I can plot the result. Then I begin the plotting process where I can make two different graphs simultaneously one for Phone calls and one for all other methods of survey. I also use a facet wrap to create the grid of these plots. Also in this coding is the color scale which is used to identify the candidates. To finalize, I label the grids, titles, and elements by adjusting the labs and theme. Lastly, I put the legend on the bottom where I felt it fit best. Data Analysis: It is evident from the two graphs that when asked over a Live Phone Call, there is voter shyness when the individual is in support of trump. Over the phone, there is a higher amount of people saying they are in support of Biden (blue) than when people are asked over a non-live format such as website or test. This is indicatory that people tend to be more shy when expressing support for Trump (red) when asked over the phone.

library(ggplot2)
library(lubridate)
library(dplyr)
library(ggplot2)

avgpoll <- polls %>%
  group_by(start_date, answer, methodology) %>%
  summarise(avgpoll = mean(pct, na.rm = TRUE)) %>%
  ungroup()
## `summarise()` has grouped output by 'start_date', 'answer'. You can override
## using the `.groups` argument.
shyvoters <- polls %>% 
  mutate(shy_trump_voter = ifelse(methodology %in% c("Live Phone", "IVR", "IVR/Live Phone", "IVR/Text", "IVR/Live Phone/Text", "IVR/Text-to-Web", "IVR/Text-to-Web/Online Ad", "IVR/Text-to-Web/Online Ad", "Live Phone/Text", "IVR/Live Phone/Text-to-Web"), 1, 0))

shyvoters <- left_join(shyvoters, avgpoll, by = c("start_date", "answer", "methodology"))

ggplot(shyvoters, aes(x = start_date, y = avgpoll, color = answer)) +
  geom_point() +
  geom_smooth(formula = y ~ x, method = "loess") +
  scale_color_manual(values = c("Biden" = "blue", "Trump" = "red")) +
  facet_wrap(~ shy_trump_voter, ncol = 2, labeller = labeller(shy_trump_voter = c("0" = "Live Phone", "1" = "Not Live Phone Call"))) +
  labs(title = "Candidate Support June - November 2020", x = "Date", y = "Average Poll Percentage", color = "Candidate") +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "bottom")

Question Five: Keep only the rows from polls that contain state polls. Summarize the “average” poll result for each state in the 2020 election. That is, what is the average percentage point difference between the Democratic candidate’s vote share and the Republican candidate’s vote share for each state’s polls? You should also remove states for which fewer than 10 polls were conducted, since data is unreliable when there are so few polls. Your new data frame should have 53 rows and your new variable will have positive and negative values. Hint: question_id is a unique identifier for each poll. Answer: To start this task, I create a new variable called state_polls, then in the next step, I mutate the parties, so I can prepare for the separate Trump vs. Biden mathematical lines. By mathematical lines, I mean the lines below that are calculcating the mean for difference, Trump, and Biden. Then I finish up wth a mutate to ensure the states on all the data are lowercase, which I did by using the tolower function. Then I print the data into one table. Also, I told R to not include any states that have less than 10 votes. Data Analysis: As far as data analysis, there is not much to say other than the numbers that are on the table. The table is organized in alphabetical order.

library(dplyr)

state_polls <- polls %>%
  filter(!is.na(state) & state != "")

state_polls <- state_polls %>%
  mutate(party = ifelse(candidate_name == "Joe Biden", "DEMOCRAT",
                        ifelse(candidate_name == "Donald Trump", "REPUBLICAN", NA))) %>%
  group_by(poll_id) %>%
  summarise(pct_diff = mean(ifelse(party == "DEMOCRAT", pct, -pct), na.rm = TRUE))

state_avg_diff <- state_polls %>%
  left_join(polls, by = "poll_id") %>%
  group_by(state) %>%
  summarise(poll_id = first(poll_id), 
            avg_diff = mean(pct_diff, na.rm = TRUE),
            avg_biden = mean(ifelse(candidate_name == "Joe Biden", pct, NA), na.rm = TRUE),
            avg_trump = mean(ifelse(candidate_name == "Donald Trump", pct, NA), na.rm = TRUE)) %>%
  mutate(state = tolower(state))  

state_avg_diff <- state_avg_diff %>%
  filter(n_distinct(poll_id) >= 10)

print(state_avg_diff)
## # A tibble: 55 × 5
##    state                poll_id avg_diff avg_biden avg_trump
##    <chr>                  <dbl>    <dbl>     <dbl>     <dbl>
##  1 alabama                67048    -9.48      39.1     58.1 
##  2 alaska                 67028    -4.08      44.4     52.6 
##  3 arizona                66820     2.03      49.5     45.4 
##  4 arkansas               66797    -9.77      39.1     58.7 
##  5 california             67892    13.9       62.1     34.4 
##  6 colorado               66973     7.89      56.2     40.4 
##  7 connecticut            70339    13.4       61.9     35.2 
##  8 delaware               68194    13.9       62.9     35.0 
##  9 district of columbia   70338    39.3       88.2      9.52
## 10 florida                66819     1.13      48.8     46.5 
## # ℹ 45 more rows

Question Six: Using the results data frame, generate a variable that represents the actual percentage point difference between the Democratic candidate’s vote share and the Republican candidate’s vote share in each state election in 2020. We will not need results for candidates from parties other than the Democratic and Republican parties for this exam. Answer: Now that I am working with my other raw data, I am going to filter it to be what I need the processed data to be. Then I select which variables I want to have followed by the pivot_wider() function. Then I mutate to make the state names lower by using the function tolower. Then I group_by state and summarise to get the statistics I am looking for. Data Analysis: There is no data analysis because we are just creating a table; however, I will note that this table is alphabetical order by state.

library(dplyr)

FilteredResults <- results %>% 
  filter(year == 2020, party_simplified %in% c("DEMOCRAT", "REPUBLICAN")) %>% 
  select(state, party_simplified, candidatevotes, totalvotes) %>% 
  pivot_wider(names_from = party_simplified, values_from = candidatevotes) %>% 
  mutate(state = tolower(state)) %>% 
  group_by(state) %>%
  summarise(DemVoteShare = round((DEMOCRAT / totalvotes) * 100, 2), 
            RepVoteShare = round((REPUBLICAN / totalvotes) * 100, 2), 
            Vote_Share_Difference = DemVoteShare - RepVoteShare) 

print(FilteredResults)
## # A tibble: 51 × 4
##    state                DemVoteShare RepVoteShare Vote_Share_Difference
##    <chr>                       <dbl>        <dbl>                 <dbl>
##  1 alabama                      36.6         62.0               -25.5  
##  2 alaska                       42.8         52.8               -10.1  
##  3 arizona                      49.4         49.1                 0.300
##  4 arkansas                     34.8         62.4               -27.6  
##  5 california                   63.5         34.3                29.2  
##  6 colorado                     55.0         41.6                13.4  
##  7 connecticut                  59.3         39.2                20.1  
##  8 delaware                     58.7         39.8                19.0  
##  9 district of columbia         92.2          5.4                86.8  
## 10 florida                      47.9         51.2                -3.36 
## # ℹ 41 more rows

Question Seven: Join the data frames from questions 5 and 6 together, keeping only those rows that match in both data frames. Answer: This was not a long process at all. I tell R what I would like the joined data to be henceforth called which is “joined_df”. Df is an abbreviation for dataframe. Then I tell R that I am combining it with Filtered Results, then I organized it by State. Finally, I tell it to print so I can see all the variables and values. Data Analysis: None needed for this question. However, once again, I will not that I kept the states in alphabetical order.

library(dplyr)

joined_df <- inner_join(state_avg_diff, FilteredResults, by = "state")

print(joined_df)
## # A tibble: 51 × 8
##    state          poll_id avg_diff avg_biden avg_trump DemVoteShare RepVoteShare
##    <chr>            <dbl>    <dbl>     <dbl>     <dbl>        <dbl>        <dbl>
##  1 alabama          67048    -9.48      39.1     58.1          36.6         62.0
##  2 alaska           67028    -4.08      44.4     52.6          42.8         52.8
##  3 arizona          66820     2.03      49.5     45.4          49.4         49.1
##  4 arkansas         66797    -9.77      39.1     58.7          34.8         62.4
##  5 california       67892    13.9       62.1     34.4          63.5         34.3
##  6 colorado         66973     7.89      56.2     40.4          55.0         41.6
##  7 connecticut      70339    13.4       61.9     35.2          59.3         39.2
##  8 delaware         68194    13.9       62.9     35.0          58.7         39.8
##  9 district of c…   70338    39.3       88.2      9.52         92.2          5.4
## 10 florida          66819     1.13      48.8     46.5          47.9         51.2
## # ℹ 41 more rows
## # ℹ 1 more variable: Vote_Share_Difference <dbl>

Question Eight: Generate a new variable indicating whether the polls made an accurate prediction of the winner in each state. Create a map showing which states the polls predicted right, and which states the polls predicted wrong. Based on this map, how well do you feel that the polls performed in the 2020 election? Answer: This task was one of the hardest on the final, which led me to try so many variations of the coding and had all sorts of graphs. My final coding is as follows: I tell R that I am using the data from joined_df and find that actual percentage of voters that is represented by actual_(insert candidate last name). Based on the table that I was able see from the coding, I learned that the only two states who predicted wrong were North Caroline and Florida, so I made sure to point this out to be in a separate color (red), and then made sure all of the other states were in blue to represent an accurate prediction. Then I merged all of the data together followed by the plotting, which produced the map you can see below. Data Analysis: Based on the map and the table that led to the creation of the map, the only two states that predicted wrong were North Carolina and Florida. The rest of the country had accurate predictions. I feel the polls performed very well because they were able to accurately predict the results.

joined_df <- joined_df %>%
  mutate(
    actual_Biden = avg_biden,
    actual_Trump = avg_trump,
    correct_prediction = ifelse(
      (actual_Biden > actual_Trump & avg_biden > avg_trump) |
        (actual_Biden < actual_Trump & avg_biden < avg_trump), 
      "Yes", 
      "Unknown" 
    )
  )

joined_df$correct_prediction[joined_df$state %in% c("north carolina", "florida")] <- "No"

us_map <- map_data("state")

us_map$region <- tolower(us_map$region)

merged_data <- merge(us_map, joined_df, by.x = "region", by.y = "state", all.x = TRUE)

ggplot(data = merged_data, aes(x = long, y = lat, group = group, fill = correct_prediction)) +
  geom_polygon(color = "grey40", lwd = 0.15) +
  coord_quickmap() +
  theme_minimal() +
  scale_fill_manual(values = c("Yes" = "blue", "No" = "red", "Unknown" = "white"), name = "Correct Prediction") +
  labs(title = "Accuracy of Poll Predictions in 2020 Election") +
  theme(legend.position = "bottom")

Question Nine: Generate a new variable indicating the difference between the expected election result from the polls and the actual election result. Create a map showing how much the candidates outperformed the polls. Use a color gradient such that larger Biden vote shares compared to the polls are darker shades of blue, larger Trump vote shares compared to the polls are darker shades of red, and states where the polls were basically correct should be white. Based on this map, how well do you feel that the polls performed in the 2020 election? Hint: you may want to use scale_color_gradient2(). Answer: On this task, I start by calling my dataframe joined_df. Then, I begin to mutate so I know how much the candidates over performed. Then, I finish up with merging the data, which leads me to my final step of plotting the information. I set the scale to be gradient so I can see how much on a gradient the candidate over performed. Then I did the labs to complete the titling and labeling, and positioned my theme at the bottom. Data Analysis: On this map I can see that the more white it is the more Trump performed and the more purple it is the more Biden performed. From this we can see Biden had the majority overall across the countries because it is mostly darker purple on the map. This allows me to conclude that more than likely Biden won the election (which I now know in 2024 that he did win the election).

joined_df <- joined_df %>%
  mutate(
    diff_Biden = actual_Biden,
    diff_Trump = actual_Trump - actual_Biden
  )

joined_df <- joined_df %>%
  mutate(
    correct_prediction = ifelse(diff_Biden > diff_Trump, "Biden", "Trump")
  )

library(ggplot2)
library(maps)
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
us_map <- map_data("state")

us_map$region <- tolower(us_map$region)

merged_data <- merge(us_map, joined_df, by.x = "region", by.y = "state", all.x = TRUE)

ggplot(data = merged_data, aes(x = long, y = lat, group = group, fill = diff_Biden - diff_Trump)) +
  geom_polygon(color = "grey40", lwd = 0.15) +
  coord_quickmap() +
  theme_minimal() +
  scale_fill_gradient2(low = "red", mid = "white", high = "blue", midpoint = 0) +
  labs(title = "Overperformance/Underperformance Compared to Polls",
       fill = "Biden - Trump\n(positive: Biden outperformed polls,\nnegative: Trump outperformed polls)") +
  theme(legend.position = "bottom")

Question Ten: Return to the data frame you created in question 5, but this time, provide the expected percentage point difference between the Democratic candidate’s vote share and the Republican candidate’s vote share for each pollster. Which three pollsters had the most accurate state election polls, on average? Which three pollsters had the least accurate state election polls, on average? Hint: Accuracy is measured by the absolute difference between expected and acutal result; the direction of the difference does not matter. You will need to use the abs() function in your answer. Answer: To begin coding this task, the first thing that I did was add pollster_id to my joined_df dataframe because it was the most efficient way I was capable of coding it. Then I began to mutate the mathematical equation needed to find the statistics that I was looking for. Then I found the absolute values and mean followed by the code to get the most and least accurate.
Data Analysis: Based on the table we are able to see that the top three pollsters, meaning the most accurate, were 396, 770, and 1613. Then we can see from the table that the least accurate were 67, 290, and 362. This helps us when collecting data in the future, to know who has notoriously been more accurate over the election seasons. We can tell which ones are more or less accurate based on the fast that we the number to be as close 0 as possible. Why? This is because we want to be have as little deviation as possible.

joined_df <- merge(joined_df, polls[, c("poll_id", "pollster_id")], by = "poll_id", all.x = TRUE)

joined_df <- joined_df %>%
  mutate(
    abs_diff = abs(actual_Biden - actual_Trump))

avg_abs_diff <- joined_df %>%
  group_by(pollster_id) %>%
  summarize(avg_abs_diff = mean(abs_diff, na.rm = TRUE))

top_3_pollsters <- avg_abs_diff %>%
  top_n(3, wt = -avg_abs_diff)

bottom_3_pollsters <- avg_abs_diff %>%
  top_n(3, wt = avg_abs_diff)

combined_pollsters <- bind_rows(top_3_pollsters %>% mutate(rank = "Top 3"),
                                bottom_3_pollsters %>% mutate(rank = "Bottom 3"))

print(combined_pollsters)
## # A tibble: 6 × 3
##   pollster_id avg_abs_diff rank    
##         <dbl>        <dbl> <chr>   
## 1         396         1.82 Top 3   
## 2         770         4.91 Top 3   
## 3        1613         4.12 Top 3   
## 4          67        36.4  Bottom 3
## 5         290        39.7  Bottom 3
## 6         362        34.8  Bottom 3
library(readr)
library(tidyverse)
require(dplyr)
setwd("~/Dropbox/DATA101/")
polls <- read_csv(file = "Data/Raw/president_polls_historical.csv")
view(polls)
#########################################################################################################
setwd("~/Dropbox/DATA101/")
results <- read_csv(file = "Data/Raw/1976-2020-president.csv")
view(results)
polls <- polls %>% 
  mutate(start_date = mdy(start_date)) %>% 
  filter(start_date >= as.Date('2020-06-01') & (answer == "Biden" | answer == "Trump")) %>%
  select(poll_id, pollster_id, answer, candidate_name, start_date, state, pct, methodology)
library(ggplot2)
library(ggplot2)

trump_data <- polls %>%
  filter(state == "" | is.na(state), candidate_name == "Donald Trump") %>%
  group_by(start_date) %>%
  summarise(avgpoll = mean(pct, na.rm = TRUE))
biden_data <- polls %>%
  filter(state == "" | is.na(state), candidate_name == "Joe Biden") %>%
  group_by(start_date) %>%
  summarise(avgpoll = mean(pct, na.rm = TRUE))
ggplot() +
  geom_point(data = trump_data, aes(x = start_date, y = avgpoll, color = "Donald Trump"), alpha = 0.5) +
  geom_smooth(data = trump_data, aes(x = start_date, y = avgpoll), method = "loess", color = "red") +
  geom_point(data = biden_data, aes(x = start_date, y = avgpoll, color = "Joe Biden"), alpha = 0.5) +
  geom_smooth(data = biden_data, aes(x = start_date, y = avgpoll), method = "loess", color = "blue") +
  labs(title = "Candidate Support June - November 2020",
       x = "Date",
       y = "Average Poll Percentage",
       color = "Candidate") +
  scale_color_manual(values = c("Donald Trump" = "red", "Joe Biden" = "blue"),
                     labels = c("Donald Trump", "Joe Biden")) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  guides(color = guide_legend(title = "Candidate"))
library(ggplot2)
library(lubridate)
library(dplyr)
library(ggplot2)

avgpoll <- polls %>%
  group_by(start_date, answer, methodology) %>%
  summarise(avgpoll = mean(pct, na.rm = TRUE)) %>%
  ungroup()

shyvoters <- polls %>% 
  mutate(shy_trump_voter = ifelse(methodology %in% c("Live Phone", "IVR", "IVR/Live Phone", "IVR/Text", "IVR/Live Phone/Text", "IVR/Text-to-Web", "IVR/Text-to-Web/Online Ad", "IVR/Text-to-Web/Online Ad", "Live Phone/Text", "IVR/Live Phone/Text-to-Web"), 1, 0))

shyvoters <- left_join(shyvoters, avgpoll, by = c("start_date", "answer", "methodology"))

ggplot(shyvoters, aes(x = start_date, y = avgpoll, color = answer)) +
  geom_point() +
  geom_smooth(formula = y ~ x, method = "loess") +
  scale_color_manual(values = c("Biden" = "blue", "Trump" = "red")) +
  facet_wrap(~ shy_trump_voter, ncol = 2, labeller = labeller(shy_trump_voter = c("0" = "Live Phone", "1" = "Not Live Phone Call"))) +
  labs(title = "Candidate Support June - November 2020", x = "Date", y = "Average Poll Percentage", color = "Candidate") +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "bottom")
library(dplyr)

state_polls <- polls %>%
  filter(!is.na(state) & state != "")

state_polls <- state_polls %>%
  mutate(party = ifelse(candidate_name == "Joe Biden", "DEMOCRAT",
                        ifelse(candidate_name == "Donald Trump", "REPUBLICAN", NA))) %>%
  group_by(poll_id) %>%
  summarise(pct_diff = mean(ifelse(party == "DEMOCRAT", pct, -pct), na.rm = TRUE))

state_avg_diff <- state_polls %>%
  left_join(polls, by = "poll_id") %>%
  group_by(state) %>%
  summarise(poll_id = first(poll_id), 
            avg_diff = mean(pct_diff, na.rm = TRUE),
            avg_biden = mean(ifelse(candidate_name == "Joe Biden", pct, NA), na.rm = TRUE),
            avg_trump = mean(ifelse(candidate_name == "Donald Trump", pct, NA), na.rm = TRUE)) %>%
  mutate(state = tolower(state))  

state_avg_diff <- state_avg_diff %>%
  filter(n_distinct(poll_id) >= 10)

print(state_avg_diff)
library(dplyr)

FilteredResults <- results %>% 
  filter(year == 2020, party_simplified %in% c("DEMOCRAT", "REPUBLICAN")) %>% 
  select(state, party_simplified, candidatevotes, totalvotes) %>% 
  pivot_wider(names_from = party_simplified, values_from = candidatevotes) %>% 
  mutate(state = tolower(state)) %>% 
  group_by(state) %>%
  summarise(DemVoteShare = round((DEMOCRAT / totalvotes) * 100, 2), 
            RepVoteShare = round((REPUBLICAN / totalvotes) * 100, 2), 
            Vote_Share_Difference = DemVoteShare - RepVoteShare) 

print(FilteredResults)
library(dplyr)

joined_df <- inner_join(state_avg_diff, FilteredResults, by = "state")

print(joined_df)
joined_df <- joined_df %>%
  mutate(
    actual_Biden = avg_biden,
    actual_Trump = avg_trump,
    correct_prediction = ifelse(
      (actual_Biden > actual_Trump & avg_biden > avg_trump) |
        (actual_Biden < actual_Trump & avg_biden < avg_trump), 
      "Yes", 
      "Unknown" 
    )
  )

joined_df$correct_prediction[joined_df$state %in% c("north carolina", "florida")] <- "No"

us_map <- map_data("state")

us_map$region <- tolower(us_map$region)

merged_data <- merge(us_map, joined_df, by.x = "region", by.y = "state", all.x = TRUE)

ggplot(data = merged_data, aes(x = long, y = lat, group = group, fill = correct_prediction)) +
  geom_polygon(color = "grey40", lwd = 0.15) +
  coord_quickmap() +
  theme_minimal() +
  scale_fill_manual(values = c("Yes" = "blue", "No" = "red", "Unknown" = "white"), name = "Correct Prediction") +
  labs(title = "Accuracy of Poll Predictions in 2020 Election") +
  theme(legend.position = "bottom")
joined_df <- joined_df %>%
  mutate(
    diff_Biden = actual_Biden,
    diff_Trump = actual_Trump - actual_Biden
  )

joined_df <- joined_df %>%
  mutate(
    correct_prediction = ifelse(diff_Biden > diff_Trump, "Biden", "Trump")
  )

library(ggplot2)
library(maps)

us_map <- map_data("state")

us_map$region <- tolower(us_map$region)

merged_data <- merge(us_map, joined_df, by.x = "region", by.y = "state", all.x = TRUE)

ggplot(data = merged_data, aes(x = long, y = lat, group = group, fill = diff_Biden - diff_Trump)) +
  geom_polygon(color = "grey40", lwd = 0.15) +
  coord_quickmap() +
  theme_minimal() +
  scale_fill_gradient2(low = "red", mid = "white", high = "blue", midpoint = 0) +
  labs(title = "Overperformance/Underperformance Compared to Polls",
       fill = "Biden - Trump\n(positive: Biden outperformed polls,\nnegative: Trump outperformed polls)") +
  theme(legend.position = "bottom")
joined_df <- merge(joined_df, polls[, c("poll_id", "pollster_id")], by = "poll_id", all.x = TRUE)

joined_df <- joined_df %>%
  mutate(
    abs_diff = abs(actual_Biden - actual_Trump))

avg_abs_diff <- joined_df %>%
  group_by(pollster_id) %>%
  summarize(avg_abs_diff = mean(abs_diff, na.rm = TRUE))

top_3_pollsters <- avg_abs_diff %>%
  top_n(3, wt = -avg_abs_diff)

bottom_3_pollsters <- avg_abs_diff %>%
  top_n(3, wt = avg_abs_diff)

combined_pollsters <- bind_rows(top_3_pollsters %>% mutate(rank = "Top 3"),
                                bottom_3_pollsters %>% mutate(rank = "Bottom 3"))

print(combined_pollsters)