Analytics Edge: Unit 3 - Predicting the Winner Before any Votes are Cast

Election Forecasting

United States Presidential Elections

A president is elected every four years
Generally, only two competitive candidates
- Republican
- Democratic

The Electoral College

The United States have 50 states
Each assigned a number of electoral votes based on population
- Most votes: 55 (California)
- Least votes: 3 (multiple states)
- Reassigned periodically based on population changes
Winner takes all: candidate with the most votes in a state gets all its electoral votes
Candidate with most electoral votes win election

2000 Election: Bush vs Gore

Election Prediction

Goal: Use polling data to predict state winners
Then - New York Times columnist Nate Silver famously took on this task for the 2012 election

The Dataset

Data from RealClearPolitics.com
Instances represent a state in a given election
- State: Name of state
- Year: Election year (2004, 2008, 20012)
Dependent variable
- Republican: 1 if Republican won state, 0 if Democrat won
Independent variables
- Rasmussen, SurveyUSA: Polled R% - Polled D%
- DiffCount: Polls with R winner - Polls with D winner
- PrepR: Polls with R winner / # polls

Simple Approaches to Missing Data

Delete the missing observations
- We would be throwing away more than 50% of the data
- We want to predict for all states
Delete variables with missing values
- We want to retain data from Rasmussen/SurveyUSA
Fill missing data points with average values
- The average value for a poll will be close to 0 (tie between Democrat and Republican)
- If other polls in a state favor one candidate, the missing one probably would have, too

Multiple Imputation

Fill in missing values based on non-missing values
- If Rasmussen is very negative, then a missing SurveyUSA value will likely be negative
- Just like sample.split results will differ between runs unless you fix the random seed
Although the method is complicated, we can use it easily through R’s libraries
We will use Multiple Imputation by Chained Equations (mice) package

Election Forecasting in R

Load in the data

# Read in data
polling = read.csv("PollingData.csv")
str(polling)
## 'data.frame':    145 obs. of  7 variables:
##  $ State     : Factor w/ 50 levels "Alabama","Alaska",..: 1 1 2 2 3 3 3 4 4 4 ...
##  $ Year      : int  2004 2008 2004 2008 2004 2008 2012 2004 2008 2012 ...
##  $ Rasmussen : int  11 21 NA 16 5 5 8 7 10 NA ...
##  $ SurveyUSA : int  18 25 NA NA 15 NA NA 5 NA NA ...
##  $ DiffCount : int  5 5 1 6 8 9 4 8 5 2 ...
##  $ PropR     : num  1 1 1 1 1 ...
##  $ Republican: int  1 1 1 1 1 1 1 1 1 1 ...
z = table(polling$Year)
kable(z)

Var1	Freq
2004	50
2008	50
2012	45

z = summary(polling)
kable(z)

State	Year	Rasmussen	SurveyUSA	DiffCount	PropR	Republican
Arizona : 3	Min. :2004	Min. :-41.0000	Min. :-33.0000	Min. :-19.000	Min. :0.0000	Min. :0.0000
Arkansas : 3	1st Qu.:2004	1st Qu.: -8.0000	1st Qu.:-11.7500	1st Qu.: -6.000	1st Qu.:0.0000	1st Qu.:0.0000
California : 3	Median :2008	Median : 1.0000	Median : -2.0000	Median : 1.000	Median :0.6250	Median :1.0000
Colorado : 3	Mean :2008	Mean : 0.0404	Mean : -0.8243	Mean : -1.269	Mean :0.5259	Mean :0.5103
Connecticut: 3	3rd Qu.:2012	3rd Qu.: 8.5000	3rd Qu.: 8.0000	3rd Qu.: 4.000	3rd Qu.:1.0000	3rd Qu.:1.0000
Florida : 3	Max. :2012	Max. : 39.0000	Max. : 30.0000	Max. : 11.000	Max. :1.0000	Max. :1.0000
(Other) :127	NA	NA’s :46	NA’s :71	NA	NA	NA

Load mice package

# Load mice package
library(mice)

Multiple imputations

# Multiple imputation
simple = polling[c("Rasmussen", "SurveyUSA", "PropR", "DiffCount")]
z = summary(simple)
kable(z)

Rasmussen	SurveyUSA	PropR	DiffCount
Min. :-41.0000	Min. :-33.0000	Min. :0.0000	Min. :-19.000
1st Qu.: -8.0000	1st Qu.:-11.7500	1st Qu.:0.0000	1st Qu.: -6.000
Median : 1.0000	Median : -2.0000	Median :0.6250	Median : 1.000
Mean : 0.0404	Mean : -0.8243	Mean :0.5259	Mean : -1.269
3rd Qu.: 8.5000	3rd Qu.: 8.0000	3rd Qu.:1.0000	3rd Qu.: 4.000
Max. : 39.0000	Max. : 30.0000	Max. :1.0000	Max. : 11.000
NA’s :46	NA’s :71	NA	NA

Split the data

# Split the data
set.seed(144)
imputed = complete(mice(simple))
## 
##  iter imp variable
##   1   1  Rasmussen  SurveyUSA
##   1   2  Rasmussen  SurveyUSA
##   1   3  Rasmussen  SurveyUSA
##   1   4  Rasmussen  SurveyUSA
##   1   5  Rasmussen  SurveyUSA
##   2   1  Rasmussen  SurveyUSA
##   2   2  Rasmussen  SurveyUSA
##   2   3  Rasmussen  SurveyUSA
##   2   4  Rasmussen  SurveyUSA
##   2   5  Rasmussen  SurveyUSA
##   3   1  Rasmussen  SurveyUSA
##   3   2  Rasmussen  SurveyUSA
##   3   3  Rasmussen  SurveyUSA
##   3   4  Rasmussen  SurveyUSA
##   3   5  Rasmussen  SurveyUSA
##   4   1  Rasmussen  SurveyUSA
##   4   2  Rasmussen  SurveyUSA
##   4   3  Rasmussen  SurveyUSA
##   4   4  Rasmussen  SurveyUSA
##   4   5  Rasmussen  SurveyUSA
##   5   1  Rasmussen  SurveyUSA
##   5   2  Rasmussen  SurveyUSA
##   5   3  Rasmussen  SurveyUSA
##   5   4  Rasmussen  SurveyUSA
##   5   5  Rasmussen  SurveyUSA
z = summary(imputed)
kable(z)

Rasmussen	SurveyUSA	PropR	DiffCount
Min. :-41.000	Min. :-33.000	Min. :0.0000	Min. :-19.000
1st Qu.: -8.000	1st Qu.:-11.000	1st Qu.:0.0000	1st Qu.: -6.000
Median : 3.000	Median : 1.000	Median :0.6250	Median : 1.000
Mean : 1.731	Mean : 1.517	Mean :0.5259	Mean : -1.269
3rd Qu.: 11.000	3rd Qu.: 18.000	3rd Qu.:1.0000	3rd Qu.: 4.000
Max. : 39.000	Max. : 30.000	Max. :1.0000	Max. : 11.000

polling$Rasmussen = imputed$Rasmussen
polling$SurveyUSA = imputed$SurveyUSA
z = summary(polling)
kable(z)

State	Year	Rasmussen	SurveyUSA	DiffCount	PropR	Republican
Arizona : 3	Min. :2004	Min. :-41.000	Min. :-33.000	Min. :-19.000	Min. :0.0000	Min. :0.0000
Arkansas : 3	1st Qu.:2004	1st Qu.: -8.000	1st Qu.:-11.000	1st Qu.: -6.000	1st Qu.:0.0000	1st Qu.:0.0000
California : 3	Median :2008	Median : 3.000	Median : 1.000	Median : 1.000	Median :0.6250	Median :1.0000
Colorado : 3	Mean :2008	Mean : 1.731	Mean : 1.517	Mean : -1.269	Mean :0.5259	Mean :0.5103
Connecticut: 3	3rd Qu.:2012	3rd Qu.: 11.000	3rd Qu.: 18.000	3rd Qu.: 4.000	3rd Qu.:1.0000	3rd Qu.:1.0000
Florida : 3	Max. :2012	Max. : 39.000	Max. : 30.000	Max. : 11.000	Max. :1.0000	Max. :1.0000
(Other) :127	NA	NA	NA	NA	NA	NA

Train = subset(polling, Year == 2004 | Year == 2008)
Test = subset(polling, Year == 2012)

Baseline

# Smart Baseline
z = table(Train$Republican)
kable(z)

Var1	Freq
0	47
1	53

sign(20)
## [1] 1
sign(-10)
## [1] -1
sign(0)
## [1] 0
z = table(sign(Train$Rasmussen))
kable(z)

Var1	Freq
-1	42
0	3
1	55

z = table(Train$Republican, sign(Train$Rasmussen))
kable(z)

	-1	0	1
0	42	2	3
1	0	1	52

Multicollinearity

# Multicollinearity
cor(Train[c("Rasmussen", "SurveyUSA", "PropR", "DiffCount", "Republican")])
##            Rasmussen SurveyUSA     PropR DiffCount Republican
## Rasmussen  1.0000000 0.9194508 0.8404803 0.5124098  0.8021191
## SurveyUSA  0.9194508 1.0000000 0.8756581 0.5541816  0.8205806
## PropR      0.8404803 0.8756581 1.0000000 0.8273785  0.9484204
## DiffCount  0.5124098 0.5541816 0.8273785 1.0000000  0.8092777
## Republican 0.8021191 0.8205806 0.9484204 0.8092777  1.0000000

Logistic Regression

# Logistic Regression Model
mod1 = glm(Republican~PropR, data=Train, family="binomial")
summary(mod1)
## 
## Call:
## glm(formula = Republican ~ PropR, family = "binomial", data = Train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.22880  -0.06541   0.10260   0.10260   1.37392  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -6.146      1.977  -3.108 0.001882 ** 
## PropR         11.390      3.153   3.613 0.000303 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 138.269  on 99  degrees of freedom
## Residual deviance:  15.772  on 98  degrees of freedom
## AIC: 19.772
## 
## Number of Fisher Scoring iterations: 8

Make predictions

# Training set predictions
pred1 = predict(mod1, type="response")
z = table(Train$Republican, pred1 >= 0.5)
kable(z)

	FALSE	TRUE
0	45	2
1	2	51
### T	wo-varia	ble Model

# Two-variable model
mod2 = glm(Republican~SurveyUSA+DiffCount, data=Train, family="binomial")
# Make predictions
pred2 = predict(mod2, type="response")
z = table(Train$Republican, pred2 >= 0.5)
kable(z)

	FALSE	TRUE
0	45	2
1	1	52

summary(mod2)
## 
## Call:
## glm(formula = Republican ~ SurveyUSA + DiffCount, family = "binomial", 
##     data = Train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.01196  -0.00698   0.01005   0.05074   1.54975  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -1.1405     1.2456  -0.916   0.3599  
## SurveyUSA     0.2976     0.1949   1.527   0.1267  
## DiffCount     0.7673     0.4188   1.832   0.0669 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 138.269  on 99  degrees of freedom
## Residual deviance:  12.439  on 97  degrees of freedom
## AIC: 18.439
## 
## Number of Fisher Scoring iterations: 10

Analysis

# Smart baseline accuracy
z = table(Test$Republican, sign(Test$Rasmussen))
kable(z)

	-1	0	1
0	18	2	4
1	0	0	21


# Test set predictions
TestPrediction = predict(mod2, newdata=Test, type="response")
z = table(Test$Republican, TestPrediction >= 0.5)
kable(z)

	FALSE	TRUE
0	23	1
1	0	21


# Analyze mistake
subset(Test, TestPrediction >= 0.5 & Republican == 0)
##      State Year Rasmussen SurveyUSA DiffCount     PropR Republican
## 24 Florida 2012         2         0         6 0.6666667          0

Analytics Edge: Unit 3 - Predicting the Winner Before any Votes are Cast

Sulman Khan

October 26, 2018

Election Forecasting

United States Presidential Elections

The Electoral College

2000 Election: Bush vs Gore

Election Prediction

The Dataset

Simple Approaches to Missing Data

Multiple Imputation

Election Forecasting in R

Load in the data

Load mice package

Multiple imputations

Split the data

Baseline

Multicollinearity

Logistic Regression

Make predictions

Analysis