I previously made a Mittens-friendly simulation of the election out of curiosity (http://rpubs.com/gzt/mittenssimulation), but here's a rather simple “unbiased” simulation based on the same data. I do the following:
And I'll show you my source code because why not? This is a lazy simulation, the code is ugly and hacked togethr and not done well at all, but it does what I say it does.
library("plyr")
library("ggplot2")
polls <- read.csv("http://www.electoral-vote.com/evp2012/Pres/pres_polls.csv")
head(polls)
## Day Len State EV Dem GOP Ind Date X X.1 X.2 X.3 X.4 X.5 X.6
## 1 226.5 4 Alabama 9 36 54 NA Aug 16 NA NA NA NA NA NA NA
## 2 177.0 3 Alabama 9 36 51 NA Jun 27 NA NA NA NA NA NA NA
## 3 1.0 1 Alabama 9 39 60 NA Jan 01 NA NA NA NA NA NA NA
## 4 1.0 1 Alaska 3 38 62 NA Jan 01 NA NA NA NA NA NA NA
## 5 306.5 2 Arizona 11 46 53 NA Nov 03 NA NA NA NA NA NA NA
## 6 294.0 1 Arizona 11 44 52 NA Oct 21 NA NA NA NA NA NA NA
## Pollster
## 1 Capital Survey-4
## 2 Capital Survey-3
## 3 Election 2008-1
## 4 Election 2008-1
## 5 PPP-2
## 6 Rasmussen-1
This is what the polling data looks like and how you get it. And there's stuff all the way down. Note that there are multiple polls for some states and some less interesting states have no polling data in the set, but, rather, have only 2008 voting data (these are polls with \( Day = 1.0 \) or \( Date = Jan 01 \)). Those are also the boring sorts of states that aren't going to change in the election.
The first order of business is to compute the difference between Mittens and Bronco and add a “margin of error” variable which will just be either 5 or half the range of the difference. Also, I removed a pollster who gave very bad results.
polls1 <- ddply(polls, .(State, Day, Date, Pollster, EV), summarise, diff = (GOP -
Dem))
polls1 <- subset(ddply(polls1, .(State), transform, newdate = (max(Day) - Day)),
((newdate < 30) & (Pollster != "Angus-Reid-3")))
polls1 <- ddply(polls1, .(State), transform, MoE = max(0.5 * ((max(diff) - min(diff))),
5))
I'm strapped for time at the moment, so I'm being lazy and looking at an odd definition of “most recent” polls - I look at the newest poll and go back 30 days. I'll go back later and change it to take the polls in the last month unless there hasn't been one, in which case it will take the most recent poll. It's just a little too much mucking around with \( ddply \) for me right now.
avgresult <- ddply(polls1, .(State, EV, MoE), summarise, maxdiff = median(diff))
simulationresult <- function(statematrix) {
n <- dim(statematrix)[1]
randvector <- rnorm(n, 0, 1)
votevector <- statematrix[, 4] + randvector * statematrix[, 3]
simulationresult <- sum((votevector > 0) * statematrix[, 2])
}
simlength <- 1e+05
y <- rep(0, simlength)
for (i in 1:simlength) {
y[i] = simulationresult(avgresult)
}
percenty <- sum(y >= 270)/length(y)
z <- (y < 270)
So it looks like Mittens wins 16.4
percent of the time.
qplot(y, geom = "histogram", binwidth = 1, fill = z)
If we take the average of recent polls, Bronco wins more often than not (as of 11/4, 83.6
percent of the time). Of course, some of these polls are of low quality, some are “known” to be skewed, and perhaps the most recent polls should be weighted more heavily than polls even a couple weeks old, but I haven't done anything with that. You can see everything I did right here. You can fiddle around with the parameters. Leaving that one pollster in means more Obama victories and less noise means more Obama victories. I picked 5 as my default “margin of error” out of a hat, it coincidentally produces results rather close to Silver's. A different way of producing a margin of error is certainly possible, but I think this amount of noise is more generous than reality.