Election Forecasting Revisited

Background Information on the Dataset

In the recitation from Unit 3, we used logistic regression on polling data in order to construct US presidential election predictions. We separated our data into a training set, containing data from 2004 and 2008 polls, and a test set, containing the data from 2012 polls. We then proceeded to develop a logistic regression model to forecast the 2012 US presidential election.

In this homework problem, we’ll revisit our logistic regression model from Unit 3, and learn how to plot the output on a map of the United States. Unlike what we did in the Crime lecture, this time we’ll be plotting predictions rather than data!

First, load the ggplot2, maps, and ggmap packages using the library function. All three packages should be installed on your computer from lecture, but if not, you may need to install them too using the install.packages function.

# Load packages
library(maps)
library(devtools)  
library(ggmap)
register_google(key = "AIzaSyBlCZXGDK9dN3Vf_N1qdI6mPfFFCA34ubs")

Then, load the US map and save it to the variable statesMap, like we did during the Crime lecture:

# Load StateMap
statesMap = map_data("state")

The maps package contains other built-in maps, including a US county map, a world map, and maps for France and Italy.

Drawing a Map of the US

If you look at the structure of the statesMap data frame using the str function, you should see that there are 6 variables. One of the variables, group, defines the different shapes or polygons on the map. Sometimes a state may have multiple groups, for example, if it includes islands.

# Output structure 
str(statesMap)
## 'data.frame':    15537 obs. of  6 variables:
##  $ long     : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
##  $ lat      : num  30.4 30.4 30.4 30.3 30.3 ...
##  $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ region   : chr  "alabama" "alabama" "alabama" "alabama" ...
##  $ subregion: chr  NA NA NA NA ...

How many different groups are there?

z = table(statesMap$group)
kable(z)
Var1 Freq
1 202
2 149
3 312
4 516
5 79
6 91
7 94
8 10
9 872
10 381
11 233
12 329
13 257
14 256
15 113
16 397
17 650
18 399
19 566
20 36
21 220
22 30
23 460
24 370
25 373
26 382
27 315
28 238
29 208
30 70
31 125
32 205
33 78
34 16
35 290
36 21
37 168
38 37
39 733
40 12
41 105
42 238
43 284
44 236
45 172
46 66
47 304
48 166
49 289
50 1088
51 59
52 129
53 96
54 15
55 623
56 17
57 17
58 19
59 44
60 448
61 373
62 388
63 68

There are 63 different groups.

Which one defined the color of the outline of the states?

# Draw Map of US
ggplot(statesMap, aes(x = long, y = lat, group = group)) + geom_polygon(fill = "white", color = "black") 

Color = outline of the states

Coloring the States by Predictions

Now, let’s color the map of the US according to our 2012 US presidential election predictions from the Unit 3 Recitation. We’ll rebuild the model here, using the dataset PollingImputed.csv. Be sure to use this file so that you don’t have to redo the imputation to fill in the missing values, like we did in the Unit 3 Recitation.

Load the data using the read.csv function, and call it “polling”. Then split the data using the subset function into a training set called “Train” that has observations from 2004 and 2008, and a testing set called “Test” that has observations from 2012.

Note that we only have 45 states in our testing set, since we are missing observations for Alaska, Delaware, Alabama, Wyoming, and Vermont, so these states will not appear colored in our map.

Then, create a logistic regression model and make predictions on the test set using the following commands:

# Read data 
polling = read.csv("PollingImputed.csv")

# Split the data
Train = subset(polling, Year == 2004 | Year == 2008)
Test = subset(polling, Year == 2012)
# Logistic Regression
mod2 = glm(Republican~SurveyUSA+DiffCount, data=Train, family="binomial")
# Make predictions
TestPrediction = predict(mod2, newdata=Test, type="response")
# Vector of Republican/Democrat
TestPredictionBinary = as.numeric(TestPrediction > 0.5)
# Store predictions and state labels into a dataframe
predictionDataFrame = data.frame(TestPrediction, TestPredictionBinary, Test$State)

For how many states is our binary prediction 1 (for 2012), corresponding to Republican?

# Tabulate how many states is our prediction 1
z = table(TestPredictionBinary)
kable(z)
TestPredictionBinary Freq
0 23
1 22

22 states have a binary prediction of 1

What is the average predicted probability of our model (on the Test set, for 2012)?

# Average predicted probability
summary(TestPrediction)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000001 0.0000926 0.0648667 0.4852626 0.9986385 0.9998655

Average Predicted Probability = 0.4852

Merge Data

Now, we need to merge “predictionDataFrame” with the map data “statesMap”, like we did in lecture. Before doing so, we need to convert the Test.State variable to lowercase, so that it matches the region variable in statesMap. Do this by typing the following in your R console:

# PredictionBinary
predictionDataFrame$region = tolower(predictionDataFrame$Test.State)

Now, merge the two data frames using the following command:

# Merge the data
predictionMap = merge(statesMap, predictionDataFrame, by = "region")

Lastly, we need to make sure the observations are in order so that the map is drawn properly, by typing the following:

# Order the observations
predictionMap = predictionMap[order(predictionMap$order),]
How many observations are there in predictionMap?
# Number of observations
nrow(predictionMap)
## [1] 15034

15034 observations.

How many observations are there in statesMap?
# Number of observations
nrow(statesMap)
## [1] 15537

15537 observations.

When we merged the data in the previous problem, it caused the number of observations to change. Why?

When we merge data, it only merged the observations that exist in both data sets. So since we are merging based on the region variable, we will lose all observations that have a value of “region” that doesn’t exist in both data frames. You can change this default behavior by using the all.x and all.y arguments of the merge function.

Color the US map with our predictions

You can color the states according to our binary predictions by typing the following in your R console:

# Color the US Map
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPredictionBinary)) + geom_polygon(color = "black")

The states appear light blue and dark blue in this map. Which color represents a Republican prediction?

Our logistic regression model assigned 1 to Republican and 0 to Democrat. As we can see from the legend, 1 corresponds to a light blue color on the map and 0 corresponds to a dark blue color on the map.

Replot with discrete outcomes

We see that the legend displays a blue gradient for outcomes between 0 and 1. However, when plotting the binary predictions there are only two possible outcomes: 0 or 1. Let’s replot the map with discrete outcomes. We can also change the color scheme to blue and red, to match the blue color associated with the Democratic Party in the US and the red color associated with the Republican Party in the US. This can be done with the following command:

# Replot to show discrete
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPredictionBinary))+ geom_polygon(color = "black") + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")

Alternatively, we could plot the probabilities instead of the binary predictions. Change the plot command above to instead color the states by the variable TestPrediction. You should see a gradient of colors ranging from red to blue.

# Replot to show discrete
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black") + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")

Do the colors of the states in the map for TestPrediction look different from the colors of the states in the map with TestPredictionBinary? Why or why not?

The two maps look very similar. This is because most of our predicted probabilities are close to 0 or close to 1.

Understanding the Predictions

In the 2012 election, the state of Florida ended up being a very close race. It was ultimately won by the Democratic party.

Did we predict this state correctly or incorrectly?

We incorrectly predicted this state by predicting that it would be won by the Republican party.

What was our predicted probability for the state of Florida?

z = predictionDataFrame
kable(z)
TestPrediction TestPredictionBinary Test.State region
7 0.9739028 1 Arizona arizona
10 0.9994949 1 Arkansas arkansas
13 0.0000926 0 California california
16 0.0094330 0 Colorado colorado
19 0.0000343 0 Connecticut connecticut
24 0.9640395 1 Florida florida
27 0.9901680 1 Georgia georgia
30 0.0000478 0 Hawaii hawaii
33 0.9996372 1 Idaho idaho
36 0.0000926 0 Illinois illinois
39 0.9992970 1 Indiana indiana
42 0.0648667 0 Iowa iowa
45 0.9506137 1 Kansas kansas
48 0.9901659 1 Kentucky kentucky
51 0.9994949 1 Louisiana louisiana
54 0.0009383 0 Maine maine
57 0.0000024 0 Maryland maryland
60 0.0000001 0 Massachusetts massachusetts
63 0.0000177 0 Michigan michigan
66 0.0004843 0 Minnesota minnesota
69 0.9325489 1 Mississippi mississippi
72 0.9990219 1 Missouri missouri
75 0.9986385 1 Montana montana
78 0.9998655 1 Nebraska nebraska
81 0.0001795 0 Nevada nevada
84 0.0000665 0 New Hampshire new hampshire
87 0.0000127 0 New Jersey new jersey
90 0.0018172 0 New Mexico new mexico
93 0.0000013 0 New York new york
96 0.9506205 1 North Carolina north carolina
99 0.9998655 1 North Dakota north dakota
102 0.0000024 0 Ohio ohio
105 0.9996372 1 Oklahoma oklahoma
108 0.0035166 0 Oregon oregon
111 0.0000926 0 Pennsylvania pennsylvania
114 0.0004844 0 Rhode Island rhode island
117 0.9994949 1 South Carolina south carolina
120 0.9949023 1 South Dakota south dakota
123 0.9996372 1 Tennessee tennessee
126 0.9973641 1 Texas texas
129 0.9992969 1 Utah utah
134 0.0181252 0 Virginia virginia
137 0.0000246 0 Washington washington
140 0.9981049 1 West Virginia west virginia
143 0.0006740 0 Wisconsin wisconsin

Predicted Probability = 0.9640395

What does this imply?

Our prediction model did not do a very good job of correctly predicting the state of Florida, and we were very confident in our incorrect prediction.

Parameter Settings

In this part, we’ll explore what the different parameter settings of geom_polygon do. Throughout the problem, use the help page for geom_polygon, which can be accessed by ?geom_polygon. To see more information about a certain parameter, just type a question mark and then the parameter name to get the help page for that parameter. Experiment with different parameter settings to try and replicate the plots!

We’ll be asking questions about the following three plots:

# Plot 1
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black", linetype=3) + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")

# Plot 2
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black", size=3) + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")

# Plot 3
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black", alpha=0.3) + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")

What is the name of the parameter we changed to create plot (1)?

linetype

What is the name of the parameter we changed to create plot (2)?

size

Plot (3) was created by changing the value of a different geom_polygon parameter to have value 0.3. Which parameter did we use?

alpha