Somebody on some site the other day was talking about Nate Silver's forecasts, forecasts in general, skepticism about odds, and what-have-you. So I decided to make a quick-and-dirty simulation in an afternoon with the following assumptions:
Of course, this is so unrealistic, it's almost about two different people, Mittens and Bronco Bama, but let's roll with it. I found, on http://www.electoral-vote.com, some polling data. Here's what it looks like:
library("plyr")
library("ggplot2")
polls <- read.csv("http://www.electoral-vote.com/evp2012/Pres/pres_polls.csv")
head(polls)
## Day Len State EV Dem GOP Ind Date X X.1 X.2 X.3 X.4 X.5 X.6
## 1 226.5 4 Alabama 9 36 54 NA Aug 16 NA NA NA NA NA NA NA
## 2 177.0 3 Alabama 9 36 51 NA Jun 27 NA NA NA NA NA NA NA
## 3 1.0 1 Alabama 9 39 60 NA Jan 01 NA NA NA NA NA NA NA
## 4 1.0 1 Alaska 3 38 62 NA Jan 01 NA NA NA NA NA NA NA
## 5 306.5 2 Arizona 11 46 53 NA Nov 03 NA NA NA NA NA NA NA
## 6 294.0 1 Arizona 11 44 52 NA Oct 21 NA NA NA NA NA NA NA
## Pollster
## 1 Capital Survey-4
## 2 Capital Survey-3
## 3 Election 2008-1
## 4 Election 2008-1
## 5 PPP-2
## 6 Rasmussen-1
And there's stuff all the way down. Note that there are multiple polls for some states and some less interesting states have no polling data in the set, but, rather, have only 2008 voting data (these are polls with \( Day = 1.0 \) or \( Date = Jan 01 \)). Those are also the boring sorts of states that aren't going to change in the election.
The first order of business is to compute the difference between Mittens and Bronco (adding 2 to “unskew”“) and add a "margin of error” variable which will just be a constant (3).
polls1 <- ddply(polls, .(State, Day, Date, Pollster, EV), summarise, diff = (GOP +
2 - Dem), MoE = 5)
polls1 <- subset(ddply(polls1, .(State), transform, newdate = (max(Day) - Day)),
((newdate < 30) & (Pollster != "Angus-Reid-3")))
I'm strapped for time at the moment, so I'm being lazy and looking at an odd definition of “most recent” polls - I look at the newest poll and go back 30 days. I'll go back later and change it to take the best Mittens poll in the last month unless there hasn't been one, in which case it will take the most recent Mittens poll. It's just a little too much mucking around with \( ddply \) for me right now. I'll also make an “average” poll data frame.
bestresult <- ddply(polls1, .(State, EV, MoE), summarise, maxdiff = max(diff))
avgresult <- ddply(polls1, .(State, EV, MoE), summarise, maxdiff = mean(diff))
This is far too generous, but we'll work with it for the moment. I also calculate an average result which is an average of the last month's polls with our bias. Now we'll just do 10,000 simulations of each. I'll write a function that outputs how many EVs Mittens gets.
simulationresult <- function(statematrix) {
n <- dim(statematrix)[1]
randvector <- runif(n, -1, 1)
votevector <- statematrix[, 4] + randvector * statematrix[, 3]
simulationresult <- sum((votevector > 0) * statematrix[, 2])
}
simlength <- 1e+05
z <- rep(0, simlength)
y <- rep(0, simlength)
for (i in 1:simlength) {
z[i] = simulationresult(bestresult)
y[i] = simulationresult(avgresult)
}
percent <- sum(z >= 270)/length(z)
colorz <- (z < 270)
percenty <- sum(y >= 270)/length(y)
colory <- (y < 270)
So it looks like Mittens wins 99.839 percent of the time in the generous simulation and 48.488 percent of the time in the average simulation.
qplot(z, geom = "histogram", binwidth = 1, fill = colorz)
qplot(y, geom = "histogram", binwidth = 1, fill = colory)
The first plot is the histogram of Mittens' results in the generous, best poll result + 2 scenario. The second plot is the histogram of the Mittens' results in the average of the last month's polls + 2 scenario. If we cherry-pick the best poll and add 2 points, Mittens will win. If we take the average of recent polls and add 2 points, Bronco wins more often than not (as of 11/4, 51.512 percent of the time).