Getting the Polling Data

Here are two ways to get the polling data. The first way is to grab it directly from FiveThirtyEight.com’s GitHub page as follows:

library(RCurl)
url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/pollster-ratings/raw-polls.csv"
x <- getURL(url)
raw_polls <- read.csv(text = x)

The second is to simply read in the .csv stored in our shared folder. Note that if that “Data_Science_Data” folder is not in your working directory, you’ll need to navigate to it. Alternatively, you could use “read.csv(”file.choose()“)” to find this directory.

raw_polls <- read.csv("Data_Science_Data/election_forecasting/raw-polls.csv")

One advantage of writing code that draws the data directly from GitHub is that if, you return to this analysis months or even years from now, FiveThirtyEight may have added to their data set and you can re-run your analysis using the updated data.

There’s a description of this data here. Take some time to look over this data set and try to understand what you’re looking at.

View(raw_polls)
summary(raw_polls)

The Easiest Elections to Poll… and other Summaries.

Here’s some code to get you started.

library(dplyr)
raw_polls %>% group_by(type_simple) %>% summarize(n = n(), mean_error= mean(error), 
                                                  mean_sample=mean(samplesize))

raw_polls %>% group_by(year) %>% summarize(n = n(), mean_error= mean(error), 
                                                  mean_sample=mean(samplesize))

raw_polls %>% group_by(pollster) %>% summarize(n = n(), mean_error= mean(error), 
                                           mean_sample=mean(samplesize)) %>% filter(n>=50)

raw_polls %>% group_by(race) %>% summarize(n = n(), mean_error= mean(error), 
                                               mean_sample=mean(samplesize),
                                                mean_bias=mean(bias),
                                                righ_call = mean(rightcall)) %>% filter(n>=50)

Question: Are there issues with rating Pollsters by comparing their average errors in this data set? What are they?

Challenge: Try to come up with a good way of comparing the quality of different pollsters.

Do Polls get Better Closer to Elections?

If I want to make the best election predictions should I simply average all of the polls for a given race or am I better off simply looking at the most recent poll? Or is there a solution that’s better than either of these extremes?

Your second challenge is to answer this question using the data. Here’s some code for finding the number of days (and weeks) between the poll and the election:

library(lubridate)
raw_polls <- raw_polls %>% 
  mutate(polldate = as.Date(polldate, "%m/%d/%y"), 
         electiondate= as.Date(electiondate, "%m/%d/%y"),
         days_before_election = as.numeric(difftime(electiondate,polldate,units="days")),
         weeks_before_election = round(difftime(electiondate,polldate,units="weeks")))

Here’s data to find the latest poll for each election (or latest polls in case of a tie):

last_polls <- raw_polls %>% 
  group_by(race) %>% 
  top_n(1, desc(days_before_election)) %>% 
  ungroup()

View(last_polls)