In this project we will use a simple logistic regression to predict USA 2012 election results. We will use the state R data set and a dataset from different survey companies.Let’s start by loading the datasets, spliting in training and testing and visualize a USA map

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
library(ggmap)
library(maps)
statesMap=map_data("state")
str(statesMap)
## 'data.frame':    15537 obs. of  6 variables:
##  $ long     : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
##  $ lat      : num  30.4 30.4 30.4 30.3 30.3 ...
##  $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ region   : chr  "alabama" "alabama" "alabama" "alabama" ...
##  $ subregion: chr  NA NA NA NA ...
polling<-read.csv("PollingImputed.csv")
Train<-subset(polling,Year>=2004 & Year<2012)
str(Train)
## 'data.frame':    100 obs. of  7 variables:
##  $ State     : Factor w/ 50 levels "Alabama","Alaska",..: 1 1 2 2 3 3 4 4 5 5 ...
##  $ Year      : int  2004 2008 2004 2008 2004 2008 2004 2008 2004 2008 ...
##  $ Rasmussen : int  11 21 19 16 5 5 7 10 -11 -27 ...
##  $ SurveyUSA : int  18 25 21 18 15 3 5 7 -11 -24 ...
##  $ DiffCount : int  5 5 1 6 8 9 8 5 -8 -5 ...
##  $ PropR     : num  1 1 1 1 1 1 1 1 0 0 ...
##  $ Republican: int  1 1 1 1 1 1 1 1 0 0 ...
Test<-subset(polling,Year==2012)
ggplot(statesMap, aes(x = long, y = lat, group = group)) + geom_polygon(fill = "white", color = "black")

Next we will create our logistic regression model and the prediction vector

mod1<-glm(Republican~SurveyUSA+DiffCount,data=Train,family="binomial")
summary(mod1)
## 
## Call:
## glm(formula = Republican ~ SurveyUSA + DiffCount, family = "binomial", 
##     data = Train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.04741  -0.00977   0.00561   0.03751   1.32999  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -0.6827     1.0468  -0.652   0.5143  
## SurveyUSA     0.3309     0.2226   1.487   0.1371  
## DiffCount     0.6619     0.3663   1.807   0.0708 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 138.269  on 99  degrees of freedom
## Residual deviance:  11.154  on 97  degrees of freedom
## AIC: 17.154
## 
## Number of Fisher Scoring iterations: 9
TestPrediction = predict(mod1, newdata=Test, type="response")

TestPrediction gives the predicted probabilities for each state, but let’s also create a vector of “Republican” or “Democrat” predictions and put the predictions and state labels in a data.frame so that we can use ggplot

TestPredictionBinary = as.numeric(TestPrediction > 0.5)
predictionDataFrame = data.frame(TestPrediction, TestPredictionBinary, Test$State)

Now, we need to merge “predictionDataFrame” with the map data “statesMap”. Before doing so, we need to convert the Test.State variable to lowercase, so that it matches the region variable in statesMap

predictionDataFrame$region = tolower(predictionDataFrame$Test.State)
predictionMap = merge(statesMap, predictionDataFrame, by = "region")
predictionMap = predictionMap[order(predictionMap$order),]

Now we are ready to color the US map with our predictions! You can color the states according to our binary predictions

ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPredictionBinary)) + geom_polygon(color = "black")

We see that the legend displays a blue gradient for outcomes between 0 and 1. However, when plotting the binary predictions there are only two possible outcomes: 0 or 1. Let’s replot the map with discrete outcomes. We can also change the color scheme to blue and red, to match the blue color associated with the Democratic Party in the US and the red color associated with the Republican Party in the US

ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPredictionBinary))+ geom_polygon(color = "black") + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")