Poll Predictions of the 2020 Election


  1. Load both datasets (election polls, henceforth polls, and election results, henceforth results) into R. (5 points)
Election Polls Dataset Example
poll_id pollster_id pollster sponsor_ids sponsors
57025 399 Rasmussen (Pulse Opinion Research)
57025 399 Rasmussen (Pulse Opinion Research)
57026 383 PPP 1363 DEM (partisan)
57026 383 PPP 1363 DEM (partisan)
57026 383 PPP 1363 DEM (partisan)
57026 383 PPP 1363 DEM (partisan)
Election Results Dataset Example
year state state_po state_fips state_cen
1976 ALABAMA AL 1 63
1976 ALABAMA AL 1 63
1976 ALABAMA AL 1 63
1976 ALABAMA AL 1 63
1976 ALABAMA AL 1 63
1976 ALABAMA AL 1 63

  1. Keep only the rows in polls with a start date on or after June 1, 2020 and where the candidate is either Joe Biden or Donald Trump. (5 points)
Filtered Poll Dataset Example
poll_id pollster_id answer candidate_name polldate
68377 1189 Biden Joe Biden 2020-06-01
68377 1189 Trump Donald Trump 2020-06-01
66763 1189 Biden Joe Biden 2020-06-01
66763 1189 Trump Donald Trump 2020-06-01
66745 1416 Biden Joe Biden 2020-06-01
66745 1416 Trump Donald Trump 2020-06-01

  1. Let’s explore this data. Create a scatterplot with a trend line that shows the change in each candidate’s support over the course of the race. Use only national (not state) polls. Use red for Trump’s support and blue for Biden’s support. What do you observe in the final weeks of the campaign? (10 points)

During the final weeks of the campaign, there was a sharp increase in support for Trump after a late-summer, early-fall dip, while Biden remained consistent between 50% and 52.5% after cresting 50% in mid-June.


  1. Some observers believe that there is a “shy Trump voter” effect, where Trump voters are more likely to reveal their support for Trump to a machine compared to a human. Compare the scatterplot you just created for two groups using facets: polls that used “Live Phone” methods (methodology) compared to those that did not. Do you think there is a “shy Trump voter” effect? (10 points)

When comparing Live Phone polls versus all others, it appears there there is a “Shy Trump Voter” effect being displayed as there are multiple instances where the live phone results are much lower (2-3%) than the other poll results - example, the latest polls(beginning of November) show Trump at approximately 45% versus approximately 42.5% for Non-Phone and Live Phone polls, respectively.


  1. Keep only the rows from polls that contain state polls. Summarize the “average” poll result for each state in the 2020 election. That is, what is the average percentage point difference between the Democratic candidate’s vote share and the Republican candidate’s vote share for each state’s polls? You should also remove states for which fewer than 10 polls were conducted, since data is unreliable when there are so few polls. Your new data frame should have 53 rows and your new variable will have positive and negative values. (10 points) Hint: question_id is a unique identifier for each poll.
Average Poll Prediction by State (2020)
State Count Performance (Biden)
Alabama 59 -18.9662712
Alaska 59 -8.1664407
Arizona 166 4.0644578
Arkansas 53 -19.5398113
California 64 27.7845312
Colorado 72 15.7755556
Connecticut 52 26.7211538
Delaware 52 27.8928846
District Of Columbia 50 78.6506000
Florida 187 2.2510695
Georgia 123 1.7035772
Hawaii 52 30.9040385
Idaho 51 -19.4119608
Illinois 55 19.5614545
Indiana 57 -10.0466667
Iowa 95 -0.7995789
Kansas 62 -6.8698387
Kentucky 62 -17.2280645
Louisiana 56 -16.5769643
Maine 74 13.4012162
Maine Cd-1 15 23.0633333
Maine Cd-2 19 1.1647368
Maryland 56 34.7946429
Massachusetts 56 39.7267857
Michigan 175 7.5772571
Minnesota 89 10.6530337
Mississippi 55 -16.2047273
Missouri 63 -7.3688889
Montana 69 -8.1365217
Nebraska 50 -7.9806000
Nevada 68 4.1195588
New Hampshire 68 12.1448529
New Jersey 63 22.8273016
New Mexico 55 8.7652727
New York 55 30.7107273
North Carolina 185 2.9697838
North Dakota 52 -21.2173077
Ohio 98 -1.8155102
Oklahoma 56 -20.6767857
Oregon 53 21.2920755
Pennsylvania 205 5.8112683
Rhode Island 50 31.6886000
South Carolina 73 -6.5724658
South Dakota 52 -15.4313462
Tennessee 51 -13.6288235
Texas 111 -1.1617117
Utah 55 -10.3609091
Vermont 52 36.4092308
Virginia 66 13.5907576
Washington 55 24.5669091
West Virginia 53 -32.0288679
Wisconsin 260 7.6444231
Wyoming 51 -34.8515686

  1. Using the results data frame, generate a variable that represents the actual percentage point dif- ference between the Democratic candidate’s vote share and the Republican candidate’s vote share in each state election in 2020. (10 points) We will not need results for candidates from parties other than the Democratic and Republican parties for this exam.
Vote Share Differential by State
State Biden Trump Difference
Alabama 36.57 62.03 -25.46
Alaska 42.77 52.83 -10.06
Arizona 49.36 49.06 0.30
Arkansas 34.78 62.40 -27.62
California 63.48 34.32 29.16
Colorado 55.01 41.60 13.41
Connecticut 59.26 39.19 20.07
Delaware 58.74 39.77 18.97
District Of Columbia 92.15 5.40 86.75
Florida 47.86 51.22 -3.36
Georgia 49.47 49.24 0.23
Hawaii 63.15 33.95 29.20
Idaho 33.07 63.84 -30.77
Illinois 57.54 40.55 16.99
Indiana 40.96 57.02 -16.06
Iowa 44.65 52.80 -8.15
Kansas 41.56 56.21 -14.65
Kentucky 36.15 62.09 -25.94
Louisiana 39.85 58.46 -18.61
Maine 52.53 43.55 8.98
Maryland 65.36 32.15 33.21
Massachusetts 65.12 31.91 33.21
Michigan 50.62 47.84 2.78
Minnesota 52.40 45.28 7.12
Mississippi 41.06 57.60 -16.54
Missouri 41.41 56.80 -15.39
Montana 40.55 56.92 -16.37
Nebraska 39.17 58.22 -19.05
Nevada 50.06 47.67 2.39
New Hampshire 52.71 45.36 7.35
New Jersey 57.33 41.40 15.93
New Mexico 54.29 43.50 10.79
New York 60.39 37.46 22.93
North Carolina 48.59 49.93 -1.34
North Dakota 31.76 65.11 -33.35
Ohio 45.24 53.27 -8.03
Oklahoma 32.29 65.37 -33.08
Oregon 56.45 40.37 16.08
Pennsylvania 50.01 48.84 1.17
Rhode Island 59.48 38.67 20.81
South Carolina 43.42 55.09 -11.67
South Dakota 35.61 61.77 -26.16
Tennessee 37.45 60.66 -23.21
Texas 46.48 52.06 -5.58
Utah 37.65 58.13 -20.48
Vermont 65.46 30.38 35.08
Virginia 54.11 44.00 10.11
Washington 57.97 38.77 19.20
West Virginia 29.70 68.63 -38.93
Wisconsin 49.45 48.82 0.63
Wyoming 26.39 69.50 -43.11

  1. Join the data frames from questions 5 and 6 together, keeping only those rows that match in both data frames. (5 points)
Date Frame Merge - Poll Prediction vs. Actual Vote Share
State Count Prediction (Biden) Actual (Biden) Actual (Trump) Actual (Delta)
Alabama 59 -18.9662712 36.57 62.03 -25.46
Alaska 59 -8.1664407 42.77 52.83 -10.06
Arizona 166 4.0644578 49.36 49.06 0.30
Arkansas 53 -19.5398113 34.78 62.40 -27.62
California 64 27.7845312 63.48 34.32 29.16
Colorado 72 15.7755556 55.01 41.60 13.41
Connecticut 52 26.7211538 59.26 39.19 20.07
Delaware 52 27.8928846 58.74 39.77 18.97
District Of Columbia 50 78.6506000 92.15 5.40 86.75
Florida 187 2.2510695 47.86 51.22 -3.36
Georgia 123 1.7035772 49.47 49.24 0.23
Hawaii 52 30.9040385 63.15 33.95 29.20
Idaho 51 -19.4119608 33.07 63.84 -30.77
Illinois 55 19.5614545 57.54 40.55 16.99
Indiana 57 -10.0466667 40.96 57.02 -16.06
Iowa 95 -0.7995789 44.65 52.80 -8.15
Kansas 62 -6.8698387 41.56 56.21 -14.65
Kentucky 62 -17.2280645 36.15 62.09 -25.94
Louisiana 56 -16.5769643 39.85 58.46 -18.61
Maine 74 13.4012162 52.53 43.55 8.98
Maryland 56 34.7946429 65.36 32.15 33.21
Massachusetts 56 39.7267857 65.12 31.91 33.21
Michigan 175 7.5772571 50.62 47.84 2.78
Minnesota 89 10.6530337 52.40 45.28 7.12
Mississippi 55 -16.2047273 41.06 57.60 -16.54
Missouri 63 -7.3688889 41.41 56.80 -15.39
Montana 69 -8.1365217 40.55 56.92 -16.37
Nebraska 50 -7.9806000 39.17 58.22 -19.05
Nevada 68 4.1195588 50.06 47.67 2.39
New Hampshire 68 12.1448529 52.71 45.36 7.35
New Jersey 63 22.8273016 57.33 41.40 15.93
New Mexico 55 8.7652727 54.29 43.50 10.79
New York 55 30.7107273 60.39 37.46 22.93
North Carolina 185 2.9697838 48.59 49.93 -1.34
North Dakota 52 -21.2173077 31.76 65.11 -33.35
Ohio 98 -1.8155102 45.24 53.27 -8.03
Oklahoma 56 -20.6767857 32.29 65.37 -33.08
Oregon 53 21.2920755 56.45 40.37 16.08
Pennsylvania 205 5.8112683 50.01 48.84 1.17
Rhode Island 50 31.6886000 59.48 38.67 20.81
South Carolina 73 -6.5724658 43.42 55.09 -11.67
South Dakota 52 -15.4313462 35.61 61.77 -26.16
Tennessee 51 -13.6288235 37.45 60.66 -23.21
Texas 111 -1.1617117 46.48 52.06 -5.58
Utah 55 -10.3609091 37.65 58.13 -20.48
Vermont 52 36.4092308 65.46 30.38 35.08
Virginia 66 13.5907576 54.11 44.00 10.11
Washington 55 24.5669091 57.97 38.77 19.20
West Virginia 53 -32.0288679 29.70 68.63 -38.93
Wisconsin 260 7.6444231 49.45 48.82 0.63
Wyoming 51 -34.8515686 26.39 69.50 -43.11

  1. Generate a new variable indicating whether the polls made an accurate prediction of the winner in each state. Create a map showing which states the polls predicted right, and which states the polls predicted wrong. Based on this map, how well do you feel that the polls performed in the 2020 election? (15 points)
Poll Accuracy Variable Example
State Count Prediction (Biden) Vote Share Diff. Accurate?
Alabama 59 -18.966271 -25.46 Yes
Alaska 59 -8.166441 -10.06 Yes
Arizona 166 4.064458 0.30 Yes
Arkansas 53 -19.539811 -27.62 Yes
California 64 27.784531 29.16 Yes
Colorado 72 15.775556 13.41 Yes

Based on the data and generated map, I feel the polls performed well, with the results of only 2 states differing from the polls conducted.


  1. Generate a new variable indicating the difference between the expected election result from the polls and the actual election result. Create a map showing how much the candidates outperformed the polls. Use a color gradient such that larger Biden vote shares compared to the polls are darker shades of blue, larger Trump vote shares compared to the polls are darker shades of red, and states where the polls were basically correct should be white. Based on this map, how well do you feel that the polls performed in the 2020 election? (15 points) Hint: you may want to use scale_color_gradient2().

Based on the above map, I still believe that the polls performed well but, as discussed above, they were unable to account for the “Shy Trump Voters” so there were more occurrences of Trump over performance than there were of Biden over performance.


  1. Return to the data frame you created in question 5, but this time, provide the expected percentage point difference between the Democratic candidate’s vote share and the Republican candidate’s vote share for each pollster. Which three pollsters had the most accurate state election polls, on average? Which three pollsters had the least accurate state election polls, on average? (15 points) Hint: Accuracy is measured by the absolute difference between expected and actual result; the direction of the difference does not matter. You will need to use the abs() function in your answer.
Most Accurate Pollsters - Predicted vs. Actual (Biden vs. Trump)
Pollster Predicted Actual Pollster Deviation
Spry Strategies -4.441667 -4.49 0.0483333
Fairleigh Dickinson 16.000000 15.93 0.0700000
Keating Research/OnSight Public Affairs/Melanson 13.500000 13.41 0.0900000
Least Accurate Pollsters - Predicted vs. Actual (Biden vs. Trump)
Pollster Predicted Actual Pollster Deviation
Research America -14.00 -38.93 24.93
Triton Polling & Research -19.05 -38.93 19.88
Bluegrass Data Analytics -7.00 -25.94 18.94

On average, the three most accurate Pollsters were Spry Strategies, Fairleigh Dickinson, and Keating Research and the least accurate were Research America, Triton Polling & Research, and Bluegrass Data Analytics.

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
library(dplyr)
library(knitr)
library(lubridate)
library(stringr)
library(mapdata)
library(ggplot2)
library(usmap)

setwd("c:/users/trega/Dropbox/DATA1010/Data/Raw")

PresidentPolls<-read.csv("president_polls_historical.csv")

ElectionResults<-read.csv("1976-2020-president.csv")

state<-map_data("state")


kable(head(PresidentPolls[,1:5]),align='ccccc',
      caption="Election Polls Dataset Example")


kable(head(ElectionResults[,1:5]),align='ccccc',
      caption="Election Results Dataset Example")

FilteredPolls<-PresidentPolls %>% 
  mutate(polldate=mdy(start_date)) %>% 
    filter(polldate>='2020-06-01',answer=="Biden"|answer=="Trump")

FilteredPollExample<- FilteredPolls %>% 
  select(poll_id, pollster_id,answer,candidate_name,polldate)

kable(head(FilteredPollExample),align='ccccc',
      caption="Filtered Poll Dataset Example")

NatlTrend <- FilteredPolls %>% 
  filter(state=="") %>% 
  group_by(polldate, answer) %>% 
  summarise(avgpoll=mean(pct))


ggplot(NatlTrend, mapping = aes(x=polldate, y=avgpoll, color=answer)) +
  geom_point()+
  geom_smooth(formula = y ~ x, method = "loess")+
  scale_color_manual(values=c("Biden"="blue","Trump"="red"))+
  labs(title="Candidate Support June - November 2020", x="Date", y="Average Poll Percentage", color="Candidate")+
  theme(plot.title = element_text(hjust = 0.5),
        legend.position="bottom")

ShyVoter<-FilteredPolls %>% 
  mutate(ContactType=ifelse((methodology=="Live Phone"),"Live Phone","Non-Phone")) %>% 
  filter(state=="") %>% 
  group_by(polldate, answer, ContactType) %>% 
  summarise(avgpoll=mean(pct), .groups="drop")

ggplot(ShyVoter, mapping = aes(x=polldate, y=avgpoll, color=answer)) +
  geom_point()+
  geom_smooth(formula = y ~ x, method = "loess")+
  scale_color_manual(values=c("Biden"="blue","Trump"="red"))+
  facet_wrap(~ContactType, ncol=2)+
  labs(title="Candidate Support June - November 2020", x="Date", y="Average Poll Percentage", color="Candidate")+
  theme(plot.title = element_text(hjust = 0.5),
        legend.position="bottom")

StateTrend <- FilteredPolls %>% 
  filter(state!="") %>%
  select(state, party, question_id, pct) %>% 
  pivot_wider(names_from = party, values_from = pct ) %>%  
  mutate(polldiff=round(DEM-REP,2), 
         state=tolower(state)) %>% 
  group_by(state) %>% 
  summarise(Poll_Count=n(), 
            Dem_to_Rep = mean(polldiff)) %>% 
  filter(Poll_Count>10)

StateTrendUp<-StateTrend %>% 
  mutate(state=str_to_title(state))

kable(StateTrendUp, align = 'ccc', 
      col.names = c("State","Count", "Performance (Biden)"), 
      caption="Average Poll Prediction by State (2020)")


FilteredResults <-ElectionResults %>% 
  filter(year=="2020", party_simplified=="DEMOCRAT"|party_simplified=="REPUBLICAN") %>% 
  select(state, party_simplified, candidatevotes, totalvotes) %>% 
  pivot_wider(names_from=party_simplified, values_from = candidatevotes) %>% 
  mutate(state=tolower(state)) %>% 
  group_by(state) %>% 
  summarise(DemVoteShare = round((DEMOCRAT/totalvotes)*100,2), 
            RepVoteShare = round((REPUBLICAN/totalvotes)*100,2), 
    Vote_Share_Difference = DemVoteShare-RepVoteShare) 

FilteredResults2<-FilteredResults %>% 
  mutate(state=str_to_title(state))

kable(FilteredResults2, align='cccc', 
      col.names = c("State", "Biden", "Trump", "Difference"), 
      caption="Vote Share Differential by State")


DataJoin<-StateTrend %>% 
  inner_join(FilteredResults, by=join_by(state)) %>% 
  mutate(state=str_to_title(state))

kable(DataJoin, align='cccccc', 
      col.names = c("State","Count", "Prediction (Biden)", "Actual (Biden)", "Actual (Trump)","Actual (Delta)" ), 
      caption="Date Frame Merge - Poll Prediction vs. Actual Vote Share")


PollAccuracy<-DataJoin %>% 
  mutate(PollWinner = ifelse(Dem_to_Rep>0, "Dem","Rep"), 
         ActualWinner = ifelse(Vote_Share_Difference>0, "Dem","Rep"),
         Accurate = ifelse(PollWinner==ActualWinner, "Yes", "No"), 
         region=tolower(state)) %>% 
  select(-state)

PollAccuracy2<-DataJoin %>% 
  mutate(PollWinner = ifelse(Dem_to_Rep>0, "Dem","Rep"), 
         ActualWinner = ifelse(Vote_Share_Difference>0, "Dem","Rep"),
         Accurate = ifelse(PollWinner==ActualWinner, "Yes", "No"), 
         region=tolower(state)) %>% 
  select(region, Poll_Count, Dem_to_Rep, Vote_Share_Difference, Accurate) %>% 
  mutate(region=str_to_title(region))

kable(head(PollAccuracy2), align='cc', 
      col.names = c("State", "Count", "Prediction (Biden)", "Vote Share Diff.", "Accurate?"), 
      caption="Poll Accuracy Variable Example")

StateJoin<-state %>% 
  left_join(PollAccuracy, by=join_by("region"))

ggplot(data=StateJoin)+
  geom_polygon(aes(x=long,y=lat,group=group,fill=as.factor(Accurate)),col="white",lwd=0.005)+
  scale_fill_manual(values = c("Yes"="blue",
                                "No"="red"))+
  coord_quickmap()+
  theme_void()+
  labs(title="Poll Accuracy by-State",
    fill="Polls Accurate?")+
   theme(plot.title = element_text(hjust = 0.5),
        legend.position="bottom")

StateTrend2 <- FilteredPolls %>% 
  filter(state!="") %>%
  select(state, party, question_id, pct) %>% 
  pivot_wider(names_from = party, values_from = pct) %>% 
  group_by(state) %>% 
  summarise(polltotaldem=mean(DEM), 
            polltotalrep=mean(REP)) %>% 
  mutate(Winner=ifelse(polltotaldem>polltotalrep, "Biden", "Trump"),
         state=tolower(state))

StateMerge3<-StateTrend2 %>% 
  left_join(FilteredResults,by=join_by(state)) %>% 
  select(-Vote_Share_Difference) %>% 
  mutate(demdiff=DemVoteShare-polltotaldem,
         repdiff=RepVoteShare-polltotalrep)

StateJoin3<- StateMerge3 %>% 
  left_join(state, by=c("state"="region"))

ggplot(data=StateJoin3)+
  geom_polygon(aes(x=long, y=lat, group=group, fill=demdiff))+
  scale_fill_gradient2(low="red", mid="white", high="blue", midpoint=0)+
  coord_quickmap()+
    theme_void()+
  labs(title="Over Performance vs. Polls",
    fill="Trump vs. Biden")+
   theme(plot.title = element_text(hjust = 0.5),
        legend.position="")
  

PollsterTrend <- FilteredPolls %>% 
  filter(state!="") %>%
  select(state, party, question_id, pct, pollster) %>% 
  pivot_wider(names_from = party, values_from = pct ) %>% 
  mutate(state=tolower(state))

PollsterTrend2<-PollsterTrend %>% 
  left_join(PollAccuracy, by=c("state"="region")) %>% 
  select(state,pollster,DEM, REP, DemVoteShare, RepVoteShare) %>% 
  group_by(pollster) %>% 
  summarise(PollsterDemvRep=mean(DEM-REP),
            ActualDemvRep=mean(DemVoteShare-RepVoteShare),
            PollsterDev = abs(PollsterDemvRep - ActualDemvRep)) %>% 
  arrange(-PollsterDev)

PollsterTrend3<-PollsterTrend %>% 
  left_join(PollAccuracy, by=c("state"="region")) %>% 
  select(state,pollster,DEM, REP, DemVoteShare, RepVoteShare) %>% 
  group_by(pollster) %>% 
  summarise(PollsterDemvRep=mean(DEM-REP),
            ActualDemvRep=mean(DemVoteShare-RepVoteShare),
            PollsterDev = abs(PollsterDemvRep - ActualDemvRep)) %>% 
  arrange(PollsterDev)

kable(head(PollsterTrend3,3), align='ccc',
      col.names = c("Pollster", "Predicted", "Actual", "Pollster Deviation"),
      caption="Most Accurate Pollsters - Predicted vs. Actual (Biden vs. Trump)")

kable(head(PollsterTrend2,3), align='ccc',
      col.names = c("Pollster", "Predicted", "Actual", "Pollster Deviation"),
      caption="Least Accurate Pollsters - Predicted vs. Actual (Biden vs. Trump)")