In this project we will use a simple logistic regression to predict USA 2012 election results. We will use the state R data set and a dataset from different survey companies.Let’s start by loading the datasets, spliting in training and testing and visualize a USA map
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
library(ggmap)
library(maps)
statesMap=map_data("state")
str(statesMap)
## 'data.frame': 15537 obs. of 6 variables:
## $ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ...
## $ lat : num 30.4 30.4 30.4 30.3 30.3 ...
## $ group : num 1 1 1 1 1 1 1 1 1 1 ...
## $ order : int 1 2 3 4 5 6 7 8 9 10 ...
## $ region : chr "alabama" "alabama" "alabama" "alabama" ...
## $ subregion: chr NA NA NA NA ...
polling<-read.csv("PollingImputed.csv")
Train<-subset(polling,Year>=2004 & Year<2012)
str(Train)
## 'data.frame': 100 obs. of 7 variables:
## $ State : Factor w/ 50 levels "Alabama","Alaska",..: 1 1 2 2 3 3 4 4 5 5 ...
## $ Year : int 2004 2008 2004 2008 2004 2008 2004 2008 2004 2008 ...
## $ Rasmussen : int 11 21 19 16 5 5 7 10 -11 -27 ...
## $ SurveyUSA : int 18 25 21 18 15 3 5 7 -11 -24 ...
## $ DiffCount : int 5 5 1 6 8 9 8 5 -8 -5 ...
## $ PropR : num 1 1 1 1 1 1 1 1 0 0 ...
## $ Republican: int 1 1 1 1 1 1 1 1 0 0 ...
Test<-subset(polling,Year==2012)
ggplot(statesMap, aes(x = long, y = lat, group = group)) + geom_polygon(fill = "white", color = "black")
Next we will create our logistic regression model and the prediction vector
mod1<-glm(Republican~SurveyUSA+DiffCount,data=Train,family="binomial")
summary(mod1)
##
## Call:
## glm(formula = Republican ~ SurveyUSA + DiffCount, family = "binomial",
## data = Train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.04741 -0.00977 0.00561 0.03751 1.32999
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6827 1.0468 -0.652 0.5143
## SurveyUSA 0.3309 0.2226 1.487 0.1371
## DiffCount 0.6619 0.3663 1.807 0.0708 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 138.269 on 99 degrees of freedom
## Residual deviance: 11.154 on 97 degrees of freedom
## AIC: 17.154
##
## Number of Fisher Scoring iterations: 9
TestPrediction = predict(mod1, newdata=Test, type="response")
TestPrediction gives the predicted probabilities for each state, but let’s also create a vector of “Republican” or “Democrat” predictions and put the predictions and state labels in a data.frame so that we can use ggplot
TestPredictionBinary = as.numeric(TestPrediction > 0.5)
predictionDataFrame = data.frame(TestPrediction, TestPredictionBinary, Test$State)
Now, we need to merge “predictionDataFrame” with the map data “statesMap”. Before doing so, we need to convert the Test.State variable to lowercase, so that it matches the region variable in statesMap
predictionDataFrame$region = tolower(predictionDataFrame$Test.State)
predictionMap = merge(statesMap, predictionDataFrame, by = "region")
predictionMap = predictionMap[order(predictionMap$order),]
Now we are ready to color the US map with our predictions! You can color the states according to our binary predictions
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPredictionBinary)) + geom_polygon(color = "black")
We see that the legend displays a blue gradient for outcomes between 0 and 1. However, when plotting the binary predictions there are only two possible outcomes: 0 or 1. Let’s replot the map with discrete outcomes. We can also change the color scheme to blue and red, to match the blue color associated with the Democratic Party in the US and the red color associated with the Republican Party in the US
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPredictionBinary))+ geom_polygon(color = "black") + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")