Final Exam

Poll Predictions of the 2020 Election

Load both datasets (election polls, henceforth polls, and election results, henceforth results) into R. (5 points)

Election Polls Dataset Example
poll_id	pollster_id	pollster	sponsor_ids	sponsors
57025	399	Rasmussen (Pulse Opinion Research)
57025	399	Rasmussen (Pulse Opinion Research)
57026	383	PPP	1363	DEM (partisan)
57026	383	PPP	1363	DEM (partisan)
57026	383	PPP	1363	DEM (partisan)
57026	383	PPP	1363	DEM (partisan)

Election Results Dataset Example
year	state	state_po	state_fips	state_cen
1976	ALABAMA	AL	1	63
1976	ALABAMA	AL	1	63
1976	ALABAMA	AL	1	63
1976	ALABAMA	AL	1	63
1976	ALABAMA	AL	1	63
1976	ALABAMA	AL	1	63

Keep only the rows in polls with a start date on or after June 1, 2020 and where the candidate is either Joe Biden or Donald Trump. (5 points)

Filtered Poll Dataset Example
poll_id	pollster_id	answer	candidate_name	polldate
68377	1189	Biden	Joe Biden	2020-06-01
68377	1189	Trump	Donald Trump	2020-06-01
66763	1189	Biden	Joe Biden	2020-06-01
66763	1189	Trump	Donald Trump	2020-06-01
66745	1416	Biden	Joe Biden	2020-06-01
66745	1416	Trump	Donald Trump	2020-06-01

Let’s explore this data. Create a scatterplot with a trend line that shows the change in each candidate’s support over the course of the race. Use only national (not state) polls. Use red for Trump’s support and blue for Biden’s support. What do you observe in the final weeks of the campaign? (10 points)

During the final weeks of the campaign, there was a sharp increase in support for Trump after a late-summer, early-fall dip, while Biden remained consistent between 50% and 52.5% after cresting 50% in mid-June.

Some observers believe that there is a “shy Trump voter” effect, where Trump voters are more likely to reveal their support for Trump to a machine compared to a human. Compare the scatterplot you just created for two groups using facets: polls that used “Live Phone” methods (methodology) compared to those that did not. Do you think there is a “shy Trump voter” effect? (10 points)

When comparing Live Phone polls versus all others, it appears there there is a “Shy Trump Voter” effect being displayed as there are multiple instances where the live phone results are much lower (2-3%) than the other poll results - example, the latest polls(beginning of November) show Trump at approximately 45% versus approximately 42.5% for Non-Phone and Live Phone polls, respectively.

Keep only the rows from polls that contain state polls. Summarize the “average” poll result for each state in the 2020 election. That is, what is the average percentage point difference between the Democratic candidate’s vote share and the Republican candidate’s vote share for each state’s polls? You should also remove states for which fewer than 10 polls were conducted, since data is unreliable when there are so few polls. Your new data frame should have 53 rows and your new variable will have positive and negative values. (10 points) Hint: question_id is a unique identifier for each poll.

Average Poll Prediction by State (2020)
State	Count	Performance (Biden)
Alabama	59	-18.9662712
Alaska	59	-8.1664407
Arizona	166	4.0644578
Arkansas	53	-19.5398113
California	64	27.7845312
Colorado	72	15.7755556
Connecticut	52	26.7211538
Delaware	52	27.8928846
District Of Columbia	50	78.6506000
Florida	187	2.2510695
Georgia	123	1.7035772
Hawaii	52	30.9040385
Idaho	51	-19.4119608
Illinois	55	19.5614545
Indiana	57	-10.0466667
Iowa	95	-0.7995789
Kansas	62	-6.8698387
Kentucky	62	-17.2280645
Louisiana	56	-16.5769643
Maine	74	13.4012162
Maine Cd-1	15	23.0633333
Maine Cd-2	19	1.1647368
Maryland	56	34.7946429
Massachusetts	56	39.7267857
Michigan	175	7.5772571
Minnesota	89	10.6530337
Mississippi	55	-16.2047273
Missouri	63	-7.3688889
Montana	69	-8.1365217
Nebraska	50	-7.9806000
Nevada	68	4.1195588
New Hampshire	68	12.1448529
New Jersey	63	22.8273016
New Mexico	55	8.7652727
New York	55	30.7107273
North Carolina	185	2.9697838
North Dakota	52	-21.2173077
Ohio	98	-1.8155102
Oklahoma	56	-20.6767857
Oregon	53	21.2920755
Pennsylvania	205	5.8112683
Rhode Island	50	31.6886000
South Carolina	73	-6.5724658
South Dakota	52	-15.4313462
Tennessee	51	-13.6288235
Texas	111	-1.1617117
Utah	55	-10.3609091
Vermont	52	36.4092308
Virginia	66	13.5907576
Washington	55	24.5669091
West Virginia	53	-32.0288679
Wisconsin	260	7.6444231
Wyoming	51	-34.8515686

Using the results data frame, generate a variable that represents the actual percentage point dif- ference between the Democratic candidate’s vote share and the Republican candidate’s vote share in each state election in 2020. (10 points) We will not need results for candidates from parties other than the Democratic and Republican parties for this exam.

Vote Share Differential by State
State	Biden	Trump	Difference
Alabama	36.57	62.03	-25.46
Alaska	42.77	52.83	-10.06
Arizona	49.36	49.06	0.30
Arkansas	34.78	62.40	-27.62
California	63.48	34.32	29.16
Colorado	55.01	41.60	13.41
Connecticut	59.26	39.19	20.07
Delaware	58.74	39.77	18.97
District Of Columbia	92.15	5.40	86.75
Florida	47.86	51.22	-3.36
Georgia	49.47	49.24	0.23
Hawaii	63.15	33.95	29.20
Idaho	33.07	63.84	-30.77
Illinois	57.54	40.55	16.99
Indiana	40.96	57.02	-16.06
Iowa	44.65	52.80	-8.15
Kansas	41.56	56.21	-14.65
Kentucky	36.15	62.09	-25.94
Louisiana	39.85	58.46	-18.61
Maine	52.53	43.55	8.98
Maryland	65.36	32.15	33.21
Massachusetts	65.12	31.91	33.21
Michigan	50.62	47.84	2.78
Minnesota	52.40	45.28	7.12
Mississippi	41.06	57.60	-16.54
Missouri	41.41	56.80	-15.39
Montana	40.55	56.92	-16.37
Nebraska	39.17	58.22	-19.05
Nevada	50.06	47.67	2.39
New Hampshire	52.71	45.36	7.35
New Jersey	57.33	41.40	15.93
New Mexico	54.29	43.50	10.79
New York	60.39	37.46	22.93
North Carolina	48.59	49.93	-1.34
North Dakota	31.76	65.11	-33.35
Ohio	45.24	53.27	-8.03
Oklahoma	32.29	65.37	-33.08
Oregon	56.45	40.37	16.08
Pennsylvania	50.01	48.84	1.17
Rhode Island	59.48	38.67	20.81
South Carolina	43.42	55.09	-11.67
South Dakota	35.61	61.77	-26.16
Tennessee	37.45	60.66	-23.21
Texas	46.48	52.06	-5.58
Utah	37.65	58.13	-20.48
Vermont	65.46	30.38	35.08
Virginia	54.11	44.00	10.11
Washington	57.97	38.77	19.20
West Virginia	29.70	68.63	-38.93
Wisconsin	49.45	48.82	0.63
Wyoming	26.39	69.50	-43.11

Join the data frames from questions 5 and 6 together, keeping only those rows that match in both data frames. (5 points)

Date Frame Merge - Poll Prediction vs. Actual Vote Share
State	Count	Prediction (Biden)	Actual (Biden)	Actual (Trump)	Actual (Delta)
Alabama	59	-18.9662712	36.57	62.03	-25.46
Alaska	59	-8.1664407	42.77	52.83	-10.06
Arizona	166	4.0644578	49.36	49.06	0.30
Arkansas	53	-19.5398113	34.78	62.40	-27.62
California	64	27.7845312	63.48	34.32	29.16
Colorado	72	15.7755556	55.01	41.60	13.41
Connecticut	52	26.7211538	59.26	39.19	20.07
Delaware	52	27.8928846	58.74	39.77	18.97
District Of Columbia	50	78.6506000	92.15	5.40	86.75
Florida	187	2.2510695	47.86	51.22	-3.36
Georgia	123	1.7035772	49.47	49.24	0.23
Hawaii	52	30.9040385	63.15	33.95	29.20
Idaho	51	-19.4119608	33.07	63.84	-30.77
Illinois	55	19.5614545	57.54	40.55	16.99
Indiana	57	-10.0466667	40.96	57.02	-16.06
Iowa	95	-0.7995789	44.65	52.80	-8.15
Kansas	62	-6.8698387	41.56	56.21	-14.65
Kentucky	62	-17.2280645	36.15	62.09	-25.94
Louisiana	56	-16.5769643	39.85	58.46	-18.61
Maine	74	13.4012162	52.53	43.55	8.98
Maryland	56	34.7946429	65.36	32.15	33.21
Massachusetts	56	39.7267857	65.12	31.91	33.21
Michigan	175	7.5772571	50.62	47.84	2.78
Minnesota	89	10.6530337	52.40	45.28	7.12
Mississippi	55	-16.2047273	41.06	57.60	-16.54
Missouri	63	-7.3688889	41.41	56.80	-15.39
Montana	69	-8.1365217	40.55	56.92	-16.37
Nebraska	50	-7.9806000	39.17	58.22	-19.05
Nevada	68	4.1195588	50.06	47.67	2.39
New Hampshire	68	12.1448529	52.71	45.36	7.35
New Jersey	63	22.8273016	57.33	41.40	15.93
New Mexico	55	8.7652727	54.29	43.50	10.79
New York	55	30.7107273	60.39	37.46	22.93
North Carolina	185	2.9697838	48.59	49.93	-1.34
North Dakota	52	-21.2173077	31.76	65.11	-33.35
Ohio	98	-1.8155102	45.24	53.27	-8.03
Oklahoma	56	-20.6767857	32.29	65.37	-33.08
Oregon	53	21.2920755	56.45	40.37	16.08
Pennsylvania	205	5.8112683	50.01	48.84	1.17
Rhode Island	50	31.6886000	59.48	38.67	20.81
South Carolina	73	-6.5724658	43.42	55.09	-11.67
South Dakota	52	-15.4313462	35.61	61.77	-26.16
Tennessee	51	-13.6288235	37.45	60.66	-23.21
Texas	111	-1.1617117	46.48	52.06	-5.58
Utah	55	-10.3609091	37.65	58.13	-20.48
Vermont	52	36.4092308	65.46	30.38	35.08
Virginia	66	13.5907576	54.11	44.00	10.11
Washington	55	24.5669091	57.97	38.77	19.20
West Virginia	53	-32.0288679	29.70	68.63	-38.93
Wisconsin	260	7.6444231	49.45	48.82	0.63
Wyoming	51	-34.8515686	26.39	69.50	-43.11

Generate a new variable indicating whether the polls made an accurate prediction of the winner in each state. Create a map showing which states the polls predicted right, and which states the polls predicted wrong. Based on this map, how well do you feel that the polls performed in the 2020 election? (15 points)

Poll Accuracy Variable Example
State	Count	Prediction (Biden)	Vote Share Diff.	Accurate?
Alabama	59	-18.966271	-25.46	Yes
Alaska	59	-8.166441	-10.06	Yes
Arizona	166	4.064458	0.30	Yes
Arkansas	53	-19.539811	-27.62	Yes
California	64	27.784531	29.16	Yes
Colorado	72	15.775556	13.41	Yes

Based on the data and generated map, I feel the polls performed well, with the results of only 2 states differing from the polls conducted.

Generate a new variable indicating the difference between the expected election result from the polls and the actual election result. Create a map showing how much the candidates outperformed the polls. Use a color gradient such that larger Biden vote shares compared to the polls are darker shades of blue, larger Trump vote shares compared to the polls are darker shades of red, and states where the polls were basically correct should be white. Based on this map, how well do you feel that the polls performed in the 2020 election? (15 points) Hint: you may want to use scale_color_gradient2().

Based on the above map, I still believe that the polls performed well but, as discussed above, they were unable to account for the “Shy Trump Voters” so there were more occurrences of Trump over performance than there were of Biden over performance.

Return to the data frame you created in question 5, but this time, provide the expected percentage point difference between the Democratic candidate’s vote share and the Republican candidate’s vote share for each pollster. Which three pollsters had the most accurate state election polls, on average? Which three pollsters had the least accurate state election polls, on average? (15 points) Hint: Accuracy is measured by the absolute difference between expected and actual result; the direction of the difference does not matter. You will need to use the abs() function in your answer.

Most Accurate Pollsters - Predicted vs. Actual (Biden vs. Trump)
Pollster	Predicted	Actual	Pollster Deviation
Spry Strategies	-4.441667	-4.49	0.0483333
Fairleigh Dickinson	16.000000	15.93	0.0700000
Keating Research/OnSight Public Affairs/Melanson	13.500000	13.41	0.0900000

Least Accurate Pollsters - Predicted vs. Actual (Biden vs. Trump)
Pollster	Predicted	Actual	Pollster Deviation
Research America	-14.00	-38.93	24.93
Triton Polling & Research	-19.05	-38.93	19.88
Bluegrass Data Analytics	-7.00	-25.94	18.94

On average, the three most accurate Pollsters were Spry Strategies, Fairleigh Dickinson, and Keating Research and the least accurate were Research America, Triton Polling & Research, and Bluegrass Data Analytics.

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
library(dplyr)
library(knitr)
library(lubridate)
library(stringr)
library(mapdata)
library(ggplot2)
library(usmap)

setwd("c:/users/trega/Dropbox/DATA1010/Data/Raw")

PresidentPolls<-read.csv("president_polls_historical.csv")

ElectionResults<-read.csv("1976-2020-president.csv")

state<-map_data("state")


kable(head(PresidentPolls[,1:5]),align='ccccc',
      caption="Election Polls Dataset Example")


kable(head(ElectionResults[,1:5]),align='ccccc',
      caption="Election Results Dataset Example")

FilteredPolls<-PresidentPolls %>% 
  mutate(polldate=mdy(start_date)) %>% 
    filter(polldate>='2020-06-01',answer=="Biden"|answer=="Trump")

FilteredPollExample<- FilteredPolls %>% 
  select(poll_id, pollster_id,answer,candidate_name,polldate)

kable(head(FilteredPollExample),align='ccccc',
      caption="Filtered Poll Dataset Example")

NatlTrend <- FilteredPolls %>% 
  filter(state=="") %>% 
  group_by(polldate, answer) %>% 
  summarise(avgpoll=mean(pct))


ggplot(NatlTrend, mapping = aes(x=polldate, y=avgpoll, color=answer)) +
  geom_point()+
  geom_smooth(formula = y ~ x, method = "loess")+
  scale_color_manual(values=c("Biden"="blue","Trump"="red"))+
  labs(title="Candidate Support June - November 2020", x="Date", y="Average Poll Percentage", color="Candidate")+
  theme(plot.title = element_text(hjust = 0.5),
        legend.position="bottom")

ShyVoter<-FilteredPolls %>% 
  mutate(ContactType=ifelse((methodology=="Live Phone"),"Live Phone","Non-Phone")) %>% 
  filter(state=="") %>% 
  group_by(polldate, answer, ContactType) %>% 
  summarise(avgpoll=mean(pct), .groups="drop")

ggplot(ShyVoter, mapping = aes(x=polldate, y=avgpoll, color=answer)) +
  geom_point()+
  geom_smooth(formula = y ~ x, method = "loess")+
  scale_color_manual(values=c("Biden"="blue","Trump"="red"))+
  facet_wrap(~ContactType, ncol=2)+
  labs(title="Candidate Support June - November 2020", x="Date", y="Average Poll Percentage", color="Candidate")+
  theme(plot.title = element_text(hjust = 0.5),
        legend.position="bottom")

StateTrend <- FilteredPolls %>% 
  filter(state!="") %>%
  select(state, party, question_id, pct) %>% 
  pivot_wider(names_from = party, values_from = pct ) %>%  
  mutate(polldiff=round(DEM-REP,2), 
         state=tolower(state)) %>% 
  group_by(state) %>% 
  summarise(Poll_Count=n(), 
            Dem_to_Rep = mean(polldiff)) %>% 
  filter(Poll_Count>10)

StateTrendUp<-StateTrend %>% 
  mutate(state=str_to_title(state))

kable(StateTrendUp, align = 'ccc', 
      col.names = c("State","Count", "Performance (Biden)"), 
      caption="Average Poll Prediction by State (2020)")


FilteredResults <-ElectionResults %>% 
  filter(year=="2020", party_simplified=="DEMOCRAT"|party_simplified=="REPUBLICAN") %>% 
  select(state, party_simplified, candidatevotes, totalvotes) %>% 
  pivot_wider(names_from=party_simplified, values_from = candidatevotes) %>% 
  mutate(state=tolower(state)) %>% 
  group_by(state) %>% 
  summarise(DemVoteShare = round((DEMOCRAT/totalvotes)*100,2), 
            RepVoteShare = round((REPUBLICAN/totalvotes)*100,2), 
    Vote_Share_Difference = DemVoteShare-RepVoteShare) 

FilteredResults2<-FilteredResults %>% 
  mutate(state=str_to_title(state))

kable(FilteredResults2, align='cccc', 
      col.names = c("State", "Biden", "Trump", "Difference"), 
      caption="Vote Share Differential by State")


DataJoin<-StateTrend %>% 
  inner_join(FilteredResults, by=join_by(state)) %>% 
  mutate(state=str_to_title(state))

kable(DataJoin, align='cccccc', 
      col.names = c("State","Count", "Prediction (Biden)", "Actual (Biden)", "Actual (Trump)","Actual (Delta)" ), 
      caption="Date Frame Merge - Poll Prediction vs. Actual Vote Share")


PollAccuracy<-DataJoin %>% 
  mutate(PollWinner = ifelse(Dem_to_Rep>0, "Dem","Rep"), 
         ActualWinner = ifelse(Vote_Share_Difference>0, "Dem","Rep"),
         Accurate = ifelse(PollWinner==ActualWinner, "Yes", "No"), 
         region=tolower(state)) %>% 
  select(-state)

PollAccuracy2<-DataJoin %>% 
  mutate(PollWinner = ifelse(Dem_to_Rep>0, "Dem","Rep"), 
         ActualWinner = ifelse(Vote_Share_Difference>0, "Dem","Rep"),
         Accurate = ifelse(PollWinner==ActualWinner, "Yes", "No"), 
         region=tolower(state)) %>% 
  select(region, Poll_Count, Dem_to_Rep, Vote_Share_Difference, Accurate) %>% 
  mutate(region=str_to_title(region))

kable(head(PollAccuracy2), align='cc', 
      col.names = c("State", "Count", "Prediction (Biden)", "Vote Share Diff.", "Accurate?"), 
      caption="Poll Accuracy Variable Example")

StateJoin<-state %>% 
  left_join(PollAccuracy, by=join_by("region"))

ggplot(data=StateJoin)+
  geom_polygon(aes(x=long,y=lat,group=group,fill=as.factor(Accurate)),col="white",lwd=0.005)+
  scale_fill_manual(values = c("Yes"="blue",
                                "No"="red"))+
  coord_quickmap()+
  theme_void()+
  labs(title="Poll Accuracy by-State",
    fill="Polls Accurate?")+
   theme(plot.title = element_text(hjust = 0.5),
        legend.position="bottom")

StateTrend2 <- FilteredPolls %>% 
  filter(state!="") %>%
  select(state, party, question_id, pct) %>% 
  pivot_wider(names_from = party, values_from = pct) %>% 
  group_by(state) %>% 
  summarise(polltotaldem=mean(DEM), 
            polltotalrep=mean(REP)) %>% 
  mutate(Winner=ifelse(polltotaldem>polltotalrep, "Biden", "Trump"),
         state=tolower(state))

StateMerge3<-StateTrend2 %>% 
  left_join(FilteredResults,by=join_by(state)) %>% 
  select(-Vote_Share_Difference) %>% 
  mutate(demdiff=DemVoteShare-polltotaldem,
         repdiff=RepVoteShare-polltotalrep)

StateJoin3<- StateMerge3 %>% 
  left_join(state, by=c("state"="region"))

ggplot(data=StateJoin3)+
  geom_polygon(aes(x=long, y=lat, group=group, fill=demdiff))+
  scale_fill_gradient2(low="red", mid="white", high="blue", midpoint=0)+
  coord_quickmap()+
    theme_void()+
  labs(title="Over Performance vs. Polls",
    fill="Trump vs. Biden")+
   theme(plot.title = element_text(hjust = 0.5),
        legend.position="")
  

PollsterTrend <- FilteredPolls %>% 
  filter(state!="") %>%
  select(state, party, question_id, pct, pollster) %>% 
  pivot_wider(names_from = party, values_from = pct ) %>% 
  mutate(state=tolower(state))

PollsterTrend2<-PollsterTrend %>% 
  left_join(PollAccuracy, by=c("state"="region")) %>% 
  select(state,pollster,DEM, REP, DemVoteShare, RepVoteShare) %>% 
  group_by(pollster) %>% 
  summarise(PollsterDemvRep=mean(DEM-REP),
            ActualDemvRep=mean(DemVoteShare-RepVoteShare),
            PollsterDev = abs(PollsterDemvRep - ActualDemvRep)) %>% 
  arrange(-PollsterDev)

PollsterTrend3<-PollsterTrend %>% 
  left_join(PollAccuracy, by=c("state"="region")) %>% 
  select(state,pollster,DEM, REP, DemVoteShare, RepVoteShare) %>% 
  group_by(pollster) %>% 
  summarise(PollsterDemvRep=mean(DEM-REP),
            ActualDemvRep=mean(DemVoteShare-RepVoteShare),
            PollsterDev = abs(PollsterDemvRep - ActualDemvRep)) %>% 
  arrange(PollsterDev)

kable(head(PollsterTrend3,3), align='ccc',
      col.names = c("Pollster", "Predicted", "Actual", "Pollster Deviation"),
      caption="Most Accurate Pollsters - Predicted vs. Actual (Biden vs. Trump)")

kable(head(PollsterTrend2,3), align='ccc',
      col.names = c("Pollster", "Predicted", "Actual", "Pollster Deviation"),
      caption="Least Accurate Pollsters - Predicted vs. Actual (Biden vs. Trump)")

Final Exam

Tom Regan

2023-10-21

Poll Predictions of the 2020 Election