How reliable is polling data?

Leading up to the 2016 presidential election, many pollsters predicted that the Democratic candidate, Hillary Clinton, would win a “decisive victory.” However, as we all know, the election was won by the Republican candidate, and current president, Donald Trump. During class we discussed how general biases, not accounted for by prediction models, often affect many pollsters in the same way. In this homework, you are going to further investigate these biases through comparisons across both national and state-level races.

The repository for this homework includes an .RData file, election_polls.RData, containing a data.frame (polls) with several years worth of polling data (2008, 2010, 2012, 2014 and 2016). The polls cover federal elections for house representatives, senators and the president, and includes polling data from up to a year before the election date. The Presidential election polls were collected from the RealClearPolitics website and the Congressional and Senatorial polls were collected from the FiveThirtyEight Github repository.

library(tidyverse)
library(dplyr)
load("elections_polls.RData")

The polls data.frame contains the following columns:

race: race identifier year_electiontype_location.
race_state: race identifier year_electiontype_state. In contrast to the previous column, this identifier ignores information about counties and only contains information at the state level.
state: abbreviation of state of the election.
state_long: full name of the state.
type: type of race. Could be either presidential (Pres), senatorial election (Sen-G) or house representative election (House-G).
year: election year.
pollster: name of the pollster.
samplesize: size of the sample used in the poll.
startdate: start date of the poll. If this date was not available, this will be the same as enddate.
enddate: end date of the poll.
democrat_name: name of the democratic candidate.
democrat_poll: percentage of people from the poll saying they would vote for the democratic candidate.
democrat_result: actual percentage of people voting for the democratic candidate in the election.
republican_name: name of the republican candidate.
republican_poll: percentage of people from the poll saying they would vote for the republican candidate.
republican_result: actual percentage of people voting for the republican candidate in the election.

Problem 1

Subset the polls data.frame to only keep polls which ended within approximately 6 weeks preceding any Election Day (i.e. in October or November). You will be using this smaller data set for the remainder of this homework. Hint: you might need to extract the month from the enddate. The strftime function might be useful for this.

# United States presidential election was held on November 8, 2016
# We will use the 'strftime' function to filter the dataset to only include polls in October and November
polls <- 
  polls %>% mutate(endmonth = strftime(polls$enddate, "%m")) %>% filter(endmonth %in% c(10,11))

The original ‘polls’ dataframe has 6847 observations. When we subset this dataframe to only keep polls that ended within approximately 6 weeks preceding the 2016 presidential election date, our subsetted dataframe contains 4330 observations.

Problem 2

For each poll, calculate the difference between the fraction of people saying they would vote for the Republican Party and the fraction of people saying they would vote for the Democratic Party. Add these values to your data.frame as a new column named spread. Similarly, calculate the true (actual) difference between the fraction of people who ended up voting for the Republican Party and the fraction of people who ended up voting for the Democratic Party. Again, add the true (actual) difference as a new column named spread_act to your data.frame.

polls$spread <- (polls$republican_poll - polls$democrat_poll)/100
polls$spread_act <- (polls$republican_result - polls$democrat_result)/100

From the subsetted dataframe, we use two columns ‘democrat_poll’ and ‘republican_poll’ to calculate the projected difference as they represent precentage of people from the poll saying that they would vote for the democratic and republican candidate. Using the difference between these two columns, a new column ‘spread’ is created.

Fo the second half of this problem, we use ‘democrat_result’ and ‘republican_result’ columns to calculate the true difference as they represent the actual percentage of people voting for the democratic and the republican candidate. Using the difference between these two columns, a new column ‘spread_act’ is created.

Problem 3

Now, we are going to collapse polls for each race. For this, we group polls by the type, year, and state of the corresponding election. There are several polls for each race, and each one provides an approximation of the real \(d\) value. Generate a point estimate for each race, \(\hat{d}\), that summarizes the polls for that race using the following steps: [1] use the column race_state to group polls by type, year, and state, and [2] use the summarize function to generate a new data.frame called reduced_polls with the following columns:

the mean spread,
the standard deviation of the spread,
the mean spread_act, and
the number of polls per race.

Make sure you also keep information about the year and state of each race in this new data.frame.

reduced_polls <-
  polls %>% group_by(race_state) %>% summarize(spread.mean = mean(spread), spread.sd = sd(spread), spread.act.mean = mean(spread_act), n = n(), year = unique(year), type = unique(type), state=unique(state))

reduced_polls

## # A tibble: 423 x 8
##    race_state     spread.mean spread.sd spread.act.mean     n  year type  
##    <chr>                <dbl>     <dbl>           <dbl> <int> <dbl> <chr> 
##  1 2008_House-G_…     -0.0725    0.015          0.0517      4  2008 House…
##  2 2008_House-G_…     -0.025     0.0636        -0.00620     2  2008 House…
##  3 2008_House-G_…      0.1      NA              0.120       1  2008 House…
##  4 2008_House-G_…     -0.015     0.0881        -0.0113      4  2008 House…
##  5 2008_House-G_…     -0.03     NA             -0.00810     1  2008 House…
##  6 2008_House-G_…      0.035     0.151          0.0894      8  2008 House…
##  7 2008_House-G_…     -0.105     0.0919        -0.145       2  2008 House…
##  8 2008_House-G_…      0.135     0.120          0.211       2  2008 House…
##  9 2008_House-G_…      0.103     0.292          0.132       3  2008 House…
## 10 2008_House-G_…     -0.0850    0.114         -0.0940      4  2008 House…
## # ... with 413 more rows, and 1 more variable: state <chr>

Problem 4

Note that the previous question merges different congressional elections held in the same year across districts in a state. Using the collapsed data.frame from the previous question, filter out races from congressional elections. Also, filter out races that had less than 3 polls. The reduced_polls data.frame should now contain only Presidential and Senatorial elections. For each remaining race, build a 95% confidence interval for \(\hat{d}\). Include the boundaries (upper and lower) of these confidence intervals in the reduced_polls data.frame.

In this question, we want to filter out races from house elections to keep only presidential and senate polls. We use the ‘type’ column for this problem as this column indicates the type of race that can be presidential (Pres), senatorial election (Sen-G) or house representative election (House-G). Then, we filter out races that had less than 3 polls.

reduced_polls <- 
  reduced_polls %>% filter(!grepl("House-G", type) &
                           n > 2)

reduced_polls$upper <- reduced_polls$spread.mean + 1.96 * (reduced_polls$spread.sd/sqrt(reduced_polls$n))

reduced_polls$lower <- reduced_polls$spread.mean - 1.96 * (reduced_polls$spread.sd/sqrt(reduced_polls$n))

For each remaining race, we build a 95% confidence interval with the boundaries of the confidence interval. Given \(E(\bar{X})=\mu\) and \(SE(\bar{X}) = \frac{\sigma}{\sqrt{n}}\), we can build a 95% confidence interval with the boundaries where \(\hat{\mu} + 1.96*\frac{\sigma}{\sqrt{n}}\) is the upper boundary and \(\hat{\mu} - 1.96*\frac{\sigma}{\sqrt{n}}\) is the lower boundary. These are added to the existing reduced_polls dataframe.

Problem 5

For each election type in each year, calculate the fraction of states where the actual result was outside of the 95% confidence interval. Which race was the most unpredictable, (i.e. for which race was the polling data most inaccurate compared to the actual result)?

reduced_polls$CI <- 
  ifelse(reduced_polls$spread.act.mean >= reduced_polls$lower & reduced_polls$spread.act.mean <= reduced_polls$upper, 0, 1)

reduced_polls %>% 
  group_by(year, type) %>% 
  summarise(Deviation = sum(CI)/length(CI))

## # A tibble: 7 x 3
## # Groups:   year [?]
##    year type  Deviation
##   <dbl> <chr>     <dbl>
## 1  2008 Pres      0.421
## 2  2008 Sen-G     0.615
## 3  2010 Sen-G     0.708
## 4  2012 Pres      0.667
## 5  2012 Sen-G     0.762
## 6  2014 Sen-G     0.4  
## 7  2016 Pres      0.745

An indicator column “CI” is created where 1 is assigned to each race whose actual mean spread was outside the 95% confidence interval; 0 is assigned to each race whose actual mean spread was within the 95% confidence interval. By summing the 1’s of this column divided by the total number of races for each election type for each year, fractions between 0 and 1 were obtained and stored in the “Deviation” column. Fractions approaching 0 identify close mean spread compared to the true spread. In contrast, fractions close to 1 indicate highly unsuccessful predicted mean spread compared to the actual mean spread. Looking at the summary above, the most unpredictable race was the 2012 Senate race.

Problem 6

Using data from only the 2016 presidential election, make a plot of states (\(x\)-axis) and \(\hat{d}\) estimates (\(y\)-axis). Using the gg_errorbar function, include the 95% confidence intervals of \(\hat{d}\) for each state. Finally, using a different color, include the actual results for each state. Describe the resulting plot.

reduced_polls %>% 
  filter(year == 2016, type == "Pres") %>%
  ggplot(aes(reorder(state, spread.mean), spread.mean)) + 
  geom_point() + 
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.4) + 
  geom_point(aes(state, spread.act.mean), col = "lightblue") +
  geom_hline(yintercept=0, col = "grey") + 
  theme_classic() + 
  ylab("Estimates") +
  xlab("State") + 
  ggtitle("2016 Presidential Election Prediction vs. Estimate") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  theme(plot.title = element_text(hjust = 0.5))

This plot displays the predicted and actual spread of votes for the 2016 Presidential Election. The black dot is the predicted spread of voting with positive estimates being in favor of the Republicans and negative for the Democrats. The black error bar for each state is its 95% confidence interval for the predicted spread from the black dot. The blue dot is the actual spread observed from the election.

For a better interpretation, a line was drawn at y = 0 to distinguish whether each state was predicted to be in favor of Democrats or Republicans. This line shows that about half of the states was predicted to be in favor of Democrats whereas the other half was predicted to be in favor of the Republicans.

States that are generally blue such as California were indeed in favor of Hilary Clinton whereas states that are typically red such as Texas were in favor of Donald Trump. This plot coincides with the actual election result. Although results from the states were predicted to be splited in half, the actual spread (denoted by the blue dot) was in favor of Donald Trump in the 2016 presidential election.

Problem 7

Which states did Donald Trump win in the 2016 presidential election, despite the entire 95% confidence intervals being in favor of his opponent, Hillary Clinton?

Recall how both spread columns are calculated. The numbers were obtained by subtracting democratic results from republican results. Therefore, to identify whether the entire 95% CI being in factor of Hillary Clinton, both lower and upper intervals of the confidence interval should be negative values while the actual election result is a positive number.

reduced_polls %>%
  filter(year == 2016, type == "Pres", lower < 0 & upper < 0, spread.act.mean > 0)

## # A tibble: 5 x 11
##   race_state spread.mean spread.sd spread.act.mean     n  year type  state
##   <chr>            <dbl>     <dbl>           <dbl> <int> <dbl> <chr> <chr>
## 1 2016_Pres…     -0.0120    0.0363          0.012     81  2016 Pres  FL   
## 2 2016_Pres…     -0.0489    0.0395          0.002     49  2016 Pres  MI   
## 3 2016_Pres…     -0.0183    0.0372          0.0360    69  2016 Pres  NC   
## 4 2016_Pres…     -0.0432    0.0314          0.007     71  2016 Pres  PA   
## 5 2016_Pres…     -0.0627    0.0420          0.007     45  2016 Pres  WI   
## # ... with 3 more variables: upper <dbl>, lower <dbl>, CI <dbl>

Donald Trump won the 2016 presdiential election in Florida (I hear your sentiment), Michigan, North Carolina, Pennsylvania, and Wisconsin, despite the entire 95% confidence intervals being in favor of his opponent, Hilary Clinton.

Problem 8

Looking again at all races, calculate the the difference between \(d\) and \(\hat{d}\) (Hint: use the data for all races in the reduced_polls object created in Problem 4). We call this the bias term. Add these values as a column to reduced_polls.

reduced_polls$biasterm <- reduced_polls$spread.act.mean - reduced_polls$spread.mean

Problem 9

Plot and compare the distribution of bias terms for races in each year. Describe the bias patterns. Are these centered around zero? Give possible explanations.

reduced_polls %>%
  ggplot(aes(as.factor(year), biasterm, color = type)) + 
  geom_boxplot() + 
  theme_classic() + 
  labs(x= "Year", y = "Bias term") + 
  ggtitle("Distribution of bias terms for races in each year") +
  geom_hline(yintercept=0, col = "grey") + 
  theme(plot.title = element_text(hjust = 0.5))

We visualized the distribution of the bias terms for races in each election using the boxplot. Fist, by taking a quick glimpse of the distribution, there do not seem to be too many outstanding outliers for each election. A line was drawn at y = 0 to examine whether the bias terms are centered around 0. By examining the relationship between the boxplot with this line, one can tell that the bias term seems to have been in favor of the Democrats in the earlier elections (in 2010 and 2012); however, this trend shifts in 2014 and 2016 where the bias term now seems to be in favor of the Republicans.

It appears that the 2008 elections (for both the Presidential and the Senate) can be suggested to be closely centered around zero. This coincides with our finding from question 5 because the 2008 presdiential election was one of the most predictable elections. From question 5, the 2016 presidential election was one of the most unpredictable elections. This seems to coincides with the boxplot because the 2016 presidential election has the median that is farthest from the zero line of all the other elections in addition to its conspicious outlier that has a bias term that is close to -0.2.

Problem 10

Using the fiftystater package, create a plot for each of the last three presidential elections showing the bias estimates for each state on a map of the United States. Describe any patterns or differences between the three elections.

Instead of the ‘fiftystater’ package, we will use a new package ‘usmap’ to create the maps of the United States.

library(usmap)

election2008 <- reduced_polls[reduced_polls$year == 2008,]
election2012 <- reduced_polls[reduced_polls$year == 2012,]
election2016 <- reduced_polls[reduced_polls$year == 2016,]

plot_usmap(data = election2008, values = "biasterm", labels = TRUE) + scale_fill_continuous(low = "white", high = "purple", name = "Bias term") + theme(legend.position = "right") + labs(title = "2008 Presidential election - distribution of bias term across the states")

plot_usmap(data = election2012, values = "biasterm", labels = TRUE) + scale_fill_continuous(low = "white", high = "purple", name = "Bias term") + theme(legend.position = "right") + labs(title = "2012 Presidential election - distribution of bias term across the states")

plot_usmap(data = election2016, values = "biasterm", labels = TRUE) + scale_fill_continuous(low = "white", high = "purple", name = "Bias term") + theme(legend.position = "right") + labs(title = "2016 Presidential election - distribution of bias term across the states")

The last three presidential elections were held in 2008, 2012, and 2016. Three spearate dataframes were derived from the reduced_polls dataframe to visualize the distribution of bias term across states for each presidential election. Overall, many states are near a zero bias in 2008, which is also suggested by the previous problem. There are more variations from the 2012 election. The 2016 Presidential election map shows a much darker colored map.

Based on the color maps of the United States, it appearrs that, on average, many states transitioned from having negative bias terms to positive bias terms. Recall that the bias term is calculated by \(d - \hat{d}\). Because I calculated the spread to be the Republican results minus those of Democrats, a positive bias term indicates that it is in favor of the Republicans.

These distributions all make sense because Democrats won the 2008 and 2012 elections whereas Republicans won the 2016 Presidential election. It is also reasonable why the 2016 map has a much darker shade because a lot of eminent political sources including Politico announced that the 2016 election was one of the most unpredictable elections.

We can also look at these maps at a more individual level. For example, California is one of the states that is usually in favor of Democrats. For all the three presidential elections, California has a negative bias term, being in favor of Democrats.

Homework 4: Election Forecasting

Due: 11/18/2018