Election Forecasting

United States Presidential Elections

  • A president is elected every four years

  • Generally, only two competitive candidates
    • Republican
    • Democratic

The Electoral College

  • The United States have 50 states
  • Each assigned a number of electoral votes based on population
    • Most votes: 55 (California)
    • Least votes: 3 (multiple states)
    • Reassigned periodically based on population changes
  • Winner takes all: candidate with the most votes in a state gets all its electoral votes
  • Candidate with most electoral votes win election

2000 Election: Bush vs Gore

Election Prediction

  • Goal: Use polling data to predict state winners
  • Then - New York Times columnist Nate Silver famously took on this task for the 2012 election

The Dataset

  • Data from RealClearPolitics.com
  • Instances represent a state in a given election
    • State: Name of state
    • Year: Election year (2004, 2008, 20012)
  • Dependent variable
    • Republican: 1 if Republican won state, 0 if Democrat won
  • Independent variables
    • Rasmussen, SurveyUSA: Polled R% - Polled D%
    • DiffCount: Polls with R winner - Polls with D winner
    • PrepR: Polls with R winner / # polls

Simple Approaches to Missing Data

  • Delete the missing observations
    • We would be throwing away more than 50% of the data
    • We want to predict for all states
  • Delete variables with missing values
    • We want to retain data from Rasmussen/SurveyUSA
  • Fill missing data points with average values
    • The average value for a poll will be close to 0 (tie between Democrat and Republican)
    • If other polls in a state favor one candidate, the missing one probably would have, too

Multiple Imputation

  • Fill in missing values based on non-missing values
    • If Rasmussen is very negative, then a missing SurveyUSA value will likely be negative
    • Just like sample.split results will differ between runs unless you fix the random seed
  • Although the method is complicated, we can use it easily through R’s libraries
  • We will use Multiple Imputation by Chained Equations (mice) package

Election Forecasting in R

Load in the data

# Read in data
polling = read.csv("PollingData.csv")
str(polling)
## 'data.frame':    145 obs. of  7 variables:
##  $ State     : Factor w/ 50 levels "Alabama","Alaska",..: 1 1 2 2 3 3 3 4 4 4 ...
##  $ Year      : int  2004 2008 2004 2008 2004 2008 2012 2004 2008 2012 ...
##  $ Rasmussen : int  11 21 NA 16 5 5 8 7 10 NA ...
##  $ SurveyUSA : int  18 25 NA NA 15 NA NA 5 NA NA ...
##  $ DiffCount : int  5 5 1 6 8 9 4 8 5 2 ...
##  $ PropR     : num  1 1 1 1 1 ...
##  $ Republican: int  1 1 1 1 1 1 1 1 1 1 ...
z = table(polling$Year)
kable(z)
Var1 Freq
2004 50
2008 50
2012 45
z = summary(polling)
kable(z)
State Year Rasmussen SurveyUSA DiffCount PropR Republican
Arizona : 3 Min. :2004 Min. :-41.0000 Min. :-33.0000 Min. :-19.000 Min. :0.0000 Min. :0.0000
Arkansas : 3 1st Qu.:2004 1st Qu.: -8.0000 1st Qu.:-11.7500 1st Qu.: -6.000 1st Qu.:0.0000 1st Qu.:0.0000
California : 3 Median :2008 Median : 1.0000 Median : -2.0000 Median : 1.000 Median :0.6250 Median :1.0000
Colorado : 3 Mean :2008 Mean : 0.0404 Mean : -0.8243 Mean : -1.269 Mean :0.5259 Mean :0.5103
Connecticut: 3 3rd Qu.:2012 3rd Qu.: 8.5000 3rd Qu.: 8.0000 3rd Qu.: 4.000 3rd Qu.:1.0000 3rd Qu.:1.0000
Florida : 3 Max. :2012 Max. : 39.0000 Max. : 30.0000 Max. : 11.000 Max. :1.0000 Max. :1.0000
(Other) :127 NA NA’s :46 NA’s :71 NA NA NA

Load mice package

# Load mice package
library(mice)

Multiple imputations

# Multiple imputation
simple = polling[c("Rasmussen", "SurveyUSA", "PropR", "DiffCount")]
z = summary(simple)
kable(z)
Rasmussen SurveyUSA PropR DiffCount
Min. :-41.0000 Min. :-33.0000 Min. :0.0000 Min. :-19.000
1st Qu.: -8.0000 1st Qu.:-11.7500 1st Qu.:0.0000 1st Qu.: -6.000
Median : 1.0000 Median : -2.0000 Median :0.6250 Median : 1.000
Mean : 0.0404 Mean : -0.8243 Mean :0.5259 Mean : -1.269
3rd Qu.: 8.5000 3rd Qu.: 8.0000 3rd Qu.:1.0000 3rd Qu.: 4.000
Max. : 39.0000 Max. : 30.0000 Max. :1.0000 Max. : 11.000
NA’s :46 NA’s :71 NA NA

Split the data

# Split the data
set.seed(144)
imputed = complete(mice(simple))
## 
##  iter imp variable
##   1   1  Rasmussen  SurveyUSA
##   1   2  Rasmussen  SurveyUSA
##   1   3  Rasmussen  SurveyUSA
##   1   4  Rasmussen  SurveyUSA
##   1   5  Rasmussen  SurveyUSA
##   2   1  Rasmussen  SurveyUSA
##   2   2  Rasmussen  SurveyUSA
##   2   3  Rasmussen  SurveyUSA
##   2   4  Rasmussen  SurveyUSA
##   2   5  Rasmussen  SurveyUSA
##   3   1  Rasmussen  SurveyUSA
##   3   2  Rasmussen  SurveyUSA
##   3   3  Rasmussen  SurveyUSA
##   3   4  Rasmussen  SurveyUSA
##   3   5  Rasmussen  SurveyUSA
##   4   1  Rasmussen  SurveyUSA
##   4   2  Rasmussen  SurveyUSA
##   4   3  Rasmussen  SurveyUSA
##   4   4  Rasmussen  SurveyUSA
##   4   5  Rasmussen  SurveyUSA
##   5   1  Rasmussen  SurveyUSA
##   5   2  Rasmussen  SurveyUSA
##   5   3  Rasmussen  SurveyUSA
##   5   4  Rasmussen  SurveyUSA
##   5   5  Rasmussen  SurveyUSA
z = summary(imputed)
kable(z)
Rasmussen SurveyUSA PropR DiffCount
Min. :-41.000 Min. :-33.000 Min. :0.0000 Min. :-19.000
1st Qu.: -8.000 1st Qu.:-11.000 1st Qu.:0.0000 1st Qu.: -6.000
Median : 3.000 Median : 1.000 Median :0.6250 Median : 1.000
Mean : 1.731 Mean : 1.517 Mean :0.5259 Mean : -1.269
3rd Qu.: 11.000 3rd Qu.: 18.000 3rd Qu.:1.0000 3rd Qu.: 4.000
Max. : 39.000 Max. : 30.000 Max. :1.0000 Max. : 11.000
polling$Rasmussen = imputed$Rasmussen
polling$SurveyUSA = imputed$SurveyUSA
z = summary(polling)
kable(z)
State Year Rasmussen SurveyUSA DiffCount PropR Republican
Arizona : 3 Min. :2004 Min. :-41.000 Min. :-33.000 Min. :-19.000 Min. :0.0000 Min. :0.0000
Arkansas : 3 1st Qu.:2004 1st Qu.: -8.000 1st Qu.:-11.000 1st Qu.: -6.000 1st Qu.:0.0000 1st Qu.:0.0000
California : 3 Median :2008 Median : 3.000 Median : 1.000 Median : 1.000 Median :0.6250 Median :1.0000
Colorado : 3 Mean :2008 Mean : 1.731 Mean : 1.517 Mean : -1.269 Mean :0.5259 Mean :0.5103
Connecticut: 3 3rd Qu.:2012 3rd Qu.: 11.000 3rd Qu.: 18.000 3rd Qu.: 4.000 3rd Qu.:1.0000 3rd Qu.:1.0000
Florida : 3 Max. :2012 Max. : 39.000 Max. : 30.000 Max. : 11.000 Max. :1.0000 Max. :1.0000
(Other) :127 NA NA NA NA NA NA
Train = subset(polling, Year == 2004 | Year == 2008)
Test = subset(polling, Year == 2012)

Baseline

# Smart Baseline
z = table(Train$Republican)
kable(z)
Var1 Freq
0 47
1 53
sign(20)
## [1] 1
sign(-10)
## [1] -1
sign(0)
## [1] 0
z = table(sign(Train$Rasmussen))
kable(z)
Var1 Freq
-1 42
0 3
1 55
z = table(Train$Republican, sign(Train$Rasmussen))
kable(z)
-1 0 1
0 42 2 3
1 0 1 52

Multicollinearity

# Multicollinearity
cor(Train[c("Rasmussen", "SurveyUSA", "PropR", "DiffCount", "Republican")])
##            Rasmussen SurveyUSA     PropR DiffCount Republican
## Rasmussen  1.0000000 0.9194508 0.8404803 0.5124098  0.8021191
## SurveyUSA  0.9194508 1.0000000 0.8756581 0.5541816  0.8205806
## PropR      0.8404803 0.8756581 1.0000000 0.8273785  0.9484204
## DiffCount  0.5124098 0.5541816 0.8273785 1.0000000  0.8092777
## Republican 0.8021191 0.8205806 0.9484204 0.8092777  1.0000000

Logistic Regression

# Logistic Regression Model
mod1 = glm(Republican~PropR, data=Train, family="binomial")
summary(mod1)
## 
## Call:
## glm(formula = Republican ~ PropR, family = "binomial", data = Train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.22880  -0.06541   0.10260   0.10260   1.37392  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -6.146      1.977  -3.108 0.001882 ** 
## PropR         11.390      3.153   3.613 0.000303 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 138.269  on 99  degrees of freedom
## Residual deviance:  15.772  on 98  degrees of freedom
## AIC: 19.772
## 
## Number of Fisher Scoring iterations: 8

Make predictions

# Training set predictions
pred1 = predict(mod1, type="response")
z = table(Train$Republican, pred1 >= 0.5)
kable(z)
FALSE TRUE
0 45 2
1 2 51
### T wo-varia ble Model
# Two-variable model
mod2 = glm(Republican~SurveyUSA+DiffCount, data=Train, family="binomial")
# Make predictions
pred2 = predict(mod2, type="response")
z = table(Train$Republican, pred2 >= 0.5)
kable(z)
FALSE TRUE
0 45 2
1 1 52
summary(mod2)
## 
## Call:
## glm(formula = Republican ~ SurveyUSA + DiffCount, family = "binomial", 
##     data = Train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.01196  -0.00698   0.01005   0.05074   1.54975  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -1.1405     1.2456  -0.916   0.3599  
## SurveyUSA     0.2976     0.1949   1.527   0.1267  
## DiffCount     0.7673     0.4188   1.832   0.0669 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 138.269  on 99  degrees of freedom
## Residual deviance:  12.439  on 97  degrees of freedom
## AIC: 18.439
## 
## Number of Fisher Scoring iterations: 10

Analysis

# Smart baseline accuracy
z = table(Test$Republican, sign(Test$Rasmussen))
kable(z)
-1 0 1
0 18 2 4
1 0 0 21

# Test set predictions
TestPrediction = predict(mod2, newdata=Test, type="response")
z = table(Test$Republican, TestPrediction >= 0.5)
kable(z)
FALSE TRUE
0 23 1
1 0 21

# Analyze mistake
subset(Test, TestPrediction >= 0.5 & Republican == 0)
##      State Year Rasmussen SurveyUSA DiffCount     PropR Republican
## 24 Florida 2012         2         0         6 0.6666667          0