Please do not cite without permission from author. Findings are preliminary.
In this document, I attempt to recreate a 2012 Primary Polling Accuracy analysis done by Nate Silver of FiveThirtyEight (then published by the New York Times). In the analysis, I will look at all Primary and Caucus polls held within 50 days of a Republican election contest and attempt to track their accuracy. The plot presented below suggests convergence towards the truth as the polls approach the date of the election; however, there are substantial outliers in this cycle.
Below, I will demonstrate how I performed this analysis. First, I will show how I formatted the data; second, I will demonstrate how I calculated the election margins; third, I will show how I imported polling; and, fourth, I will tie it all together into a database, culminating in the plot presented below.
Note: The code used below is not optimized for speed and does require some hardcoding.
In this section, I import data provided by Real Clear Politics. Then, I convert the columns of the data frame to usable formats by removing commas from numerical strings and converting election dates to Date formats.
# Loading Data, data taken from Real Clear Politics and prepared by author
GOP <- read.csv("https://raw.githubusercontent.com/apodkul/Data/Data/Republican2016.csv")
# Formatting Data
GOP$Popular.Trump <- as.numeric(gsub(",", "", GOP$Popular.Trump))
GOP$Popular.Cruz <- as.numeric(gsub(",", "", GOP$Popular.Cruz))
GOP$Popular.Rubio <- as.numeric(gsub(",", "", GOP$Popular.Rubio))
GOP$Popular.Kasich <- as.numeric(gsub(",", "", GOP$Popular.Kasich))
GOP$Date <- paste0("2016-", GOP$Date)
GOP$Date <- as.Date(GOP$Date, format = "%Y-%d-%b")
Here, I find in each election cycle the first place and second place winner in each election and determine the election margin between first and second place, in accordance with Silver’s stated method. Note: This calculation only considers the percentages of the vote in accordance to the popular vote for Trump, Cruz, Rubio, and Kasich. A more accurate analysis would gather actual vote counts from secretaries of state offices and incorporate all of the candidates listed on the ballot.
# Creating baseline
GOP$FirstPlace.Percent <- NA
GOP$SecondPlace.Percent <- NA
GOP$FirstPlace.Name <- NA
GOP$SecondPlace.Name <- NA
for(i in 1:nrow(GOP)){
if(!is.na(GOP$Popular.Trump[i])){
GOP$FirstPlace.Percent[i] <- max(GOP$Popular.Trump[i], GOP$Popular.Cruz[i], GOP$Popular.Rubio[i], GOP$Popular.Kasich[i])/(sum(GOP$Popular.Trump[i], GOP$Popular.Cruz[i], GOP$Popular.Rubio[i], GOP$Popular.Kasich[i]))
GOP$SecondPlace.Percent[i] <- sort(c(GOP$Popular.Trump[i], GOP$Popular.Cruz[i], GOP$Popular.Rubio[i], GOP$Popular.Kasich[i]), decreasing=T)[2]/(sum(GOP$Popular.Trump[i], GOP$Popular.Cruz[i], GOP$Popular.Rubio[i], GOP$Popular.Kasich[i]))
GOP$FirstPlace.Name[i] <- c("Trump", "Cruz", "Rubio", "Kasich")[which.max(c(GOP$Popular.Trump[i], GOP$Popular.Cruz[i], GOP$Popular.Rubio[i], GOP$Popular.Kasich[i]))]
GOP$SecondPlace.Name[i] <- c("Trump", "Cruz", "Rubio", "Kasich")[which(c(GOP$Popular.Trump[i], GOP$Popular.Cruz[i], GOP$Popular.Rubio[i], GOP$Popular.Kasich[i])==max(c(GOP$Popular.Trump[i], GOP$Popular.Cruz[i], GOP$Popular.Rubio[i], GOP$Popular.Kasich[i])[c(GOP$Popular.Trump[i], GOP$Popular.Cruz[i], GOP$Popular.Rubio[i], GOP$Popular.Kasich[i])!=max(c(GOP$Popular.Trump[i], GOP$Popular.Cruz[i], GOP$Popular.Rubio[i], GOP$Popular.Kasich[i]))]))]
}
}
I then imported the .csv files of polls consolidated by the Huffington Post Pollster. I included these urls by hand because they do not all follow the same format (e.g. “Iowa-Presidential” compared to “Texas Republican”, etc.). I excluded files that did not include any polls within the 50 day window. An alternative approach would be to use the API data but since there were only about 30 states the “labor-intensive” copy and paste approach sufficed.
#Getting Pollster Data from Huffington Post Pollster
GOP.poll.links <- c("http://elections.huffingtonpost.com/pollster/2016-iowa-presidential-republican-caucus.csv",
"http://elections.huffingtonpost.com/pollster/2016-new-hampshire-presidential-republican-primary.csv",
"http://elections.huffingtonpost.com/pollster/2016-south-carolina-presidential-republican-primary.csv",
"http://elections.huffingtonpost.com/pollster/2016-nevada-republican-presidential-caucus.csv",
"http://elections.huffingtonpost.com/pollster/2016-texas-republican-presidential-primary.csv",
"http://elections.huffingtonpost.com/pollster/2016-georgia-republican-presidential-primary.csv",
"http://elections.huffingtonpost.com/pollster/2016-massachusetts-republican-presidential-primary.csv",
"http://elections.huffingtonpost.com/pollster/2016-tennessee-republican-presidential-primary.csv",
"http://elections.huffingtonpost.com/pollster/2016-virginia-republican-presidential-primary.csv",
"http://elections.huffingtonpost.com/pollster/2016-alabama-republican-presidential-primary.csv",
"http://elections.huffingtonpost.com/pollster/2016-minnesota-republican-presidential-caucus.csv",
"http://elections.huffingtonpost.com/pollster/2016-oklahoma-republican-presidential-primary.csv",
"http://elections.huffingtonpost.com/pollster/2016-arkansas-republican-presidential-primary.csv",
"http://elections.huffingtonpost.com/pollster/2016-vermont-republican-presidential-primary.csv",
"http://elections.huffingtonpost.com/pollster/2016-alaska-presidential-republican-caucus.csv",
"http://elections.huffingtonpost.com/pollster/2016-maine-republican-presidential-caucus.csv",
"http://elections.huffingtonpost.com/pollster/2016-kansas-republican-presidential-caucus.csv",
"http://elections.huffingtonpost.com/pollster/2016-louisiana-republican-presidential-primary.csv",
"http://elections.huffingtonpost.com/pollster/2016-michigan-presidential-republican-primary.csv",
"http://elections.huffingtonpost.com/pollster/2016-mississippi-presidential-republican-primary.csv",
"http://elections.huffingtonpost.com/pollster/2016-idaho-republican-presidential-primary.csv",
"http://elections.huffingtonpost.com/pollster/2016-florida-presidential-republican-primary.csv",
"http://elections.huffingtonpost.com/pollster/2016-illinois-presidential-republican-primary.csv",
"http://elections.huffingtonpost.com/pollster/2016-missouri-presidential-republican-primary.csv",
"http://elections.huffingtonpost.com/pollster/2016-north-carolina-republican-presidential-primary.csv",
"http://elections.huffingtonpost.com/pollster/2016-ohio-republican-presidential-primary.csv",
"http://elections.huffingtonpost.com/pollster/2016-arizona-presidential-republican-primary.csv",
"http://elections.huffingtonpost.com/pollster/2016-utah-presidential-republican-primary.csv",
"http://elections.huffingtonpost.com/pollster/2016-wisconsin-presidential-republican-primary.csv")
list.of.poll.files <- c()
for(i in 1:length(GOP.poll.links)){
filename <- substr(GOP.poll.links[i], 51, 57)
filename <- sub("-", "", filename)
eval(parse(text=paste0(filename, '<- read.csv("', GOP.poll.links[i], '")')))
list.of.poll.files <- c(list.of.poll.files, filename)
}
Next, I aimed to produce a data frame that has three columns – type of race (Caucus or Primary), polling error, and days before the primary. The unit of analysis is each poll. Days before the primary was calculated according the end date of polling. Polling error was calculated according to Silver’s method by looking at the difference between the top two candidates in the result compared to those two candidates in the poll.
# Tie it all together
errors <- c()
daysout <- c()
type.of.contest <- c()
GOP <- GOP[!is.na(GOP$Popular.Trump),]
GOP <- GOP[order(GOP$State),]
GOP <- GOP[GOP$State!="District of Columbia"& GOP$State!="Hawaii"& GOP$State!="Kentucky",]
list.of.poll.files <- list.of.poll.files[order(list.of.poll.files)]
#cycle through states,
for(i in 1:nrow(GOP)){
first <- GOP$FirstPlace.Name[i]
second <- GOP$SecondPlace.Name[i]
margin <- (GOP$FirstPlace.Percent[i] - GOP$SecondPlace.Percent[i])*100
type <- as.character(GOP$Type[i])
election.day<- GOP$Date[i]
polls <- eval(parse(text=list.of.poll.files[i]))
polls <- polls[as.numeric(election.day-as.Date(polls$End.Date))<50,]
if(nrow(polls)>0){
polls$Margin <- eval(parse(text=paste0("polls$", first, "-", "polls$", second)))
polls$type <- type
for(ii in 1:nrow(polls)){
errors <- c(errors, abs(margin - polls$Margin[ii]))
daysout <- c(daysout, as.numeric(election.day-as.Date(polls$End.Date[ii])))
type.of.contest <- c(type.of.contest, polls$type[ii])
}
}
}
st.data <- data.frame(errors, daysout, type.of.contest)
Finally, I produced a plot of this data with the line represented a smooth spline with a smoothing paramter of .5
Future analyses should be more careful about calculating the actual margin of victory.