library(ggplot2)
library(maps)
statesMap <- map_data("state")
str(statesMap)
## 'data.frame': 15537 obs. of 6 variables:
## $ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ...
## $ lat : num 30.4 30.4 30.4 30.3 30.3 ...
## $ group : num 1 1 1 1 1 1 1 1 1 1 ...
## $ order : int 1 2 3 4 5 6 7 8 9 10 ...
## $ region : chr "alabama" "alabama" "alabama" "alabama" ...
## $ subregion: chr NA NA NA NA ...
One of the variables, group, defines the different shapes or polygons on the map. Sometimes a state may have multiple groups, for example, if it includes islands.
table(statesMap$group)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 202 149 312 516 79 91 94 10 872 381 233 329 257 256 113
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 397 650 399 566 36 220 30 460 370 373 382 315 238 208 70
## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
## 125 205 78 16 290 21 168 37 733 12 105 238 284 236 172
## 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 66 304 166 289 1088 59 129 96 15 623 17 17 19 44 448
## 61 62 63
## 373 388 68
ggplot(statesMap,aes(x = long, y = lat, group = group)) + geom_polygon(fill = "white", color = "black")
Load the data
setwd("C:/Users/jzchen/Documents/Courses/Analytics Edge/Unit_7_Visualization")
polling <- read.csv("PollingImputed.csv")
str(polling)
## 'data.frame': 145 obs. of 7 variables:
## $ State : Factor w/ 50 levels "Alabama","Alaska",..: 1 1 2 2 3 3 3 4 4 4 ...
## $ Year : int 2004 2008 2004 2008 2004 2008 2012 2004 2008 2012 ...
## $ Rasmussen : int 11 21 19 16 5 5 8 7 10 13 ...
## $ SurveyUSA : int 18 25 21 18 15 3 5 5 7 21 ...
## $ DiffCount : int 5 5 1 6 8 9 4 8 5 2 ...
## $ PropR : num 1 1 1 1 1 ...
## $ Republican: int 1 1 1 1 1 1 1 1 1 1 ...
Split data to training and test set
Train <- subset(polling, Year >= 2004 & Year <= 2008)
Test <- subset( polling, Year >= 2012)
Note that we only have 45 states in our testing set, since we are missing observations for Alaska, Delaware, Alabama, Wyoming, and Vermont, so these states will not appear colored in our map.
Create a logistic regression model and evaluate it Conver the outcome to a binary variable and store it in a new variable
mod2 <- glm(Republican~SurveyUSA + DiffCount, data = Train, family = binomial)
TestPrediction <- predict(mod2, newdata = Test, type = "response")
TestPredictionBinary <- as.numeric(TestPrediction > 0.5)
predictionDataFrame <- data.frame(TestPrediction, TestPredictionBinary, Test$State)
table(predictionDataFrame$TestPredictionBinary)
##
## 0 1
## 23 22
First convert Test.State variable to lowercase
predictionDataFrame$region <- tolower(predictionDataFrame$Test.State)
predictionMap <- merge(statesMap, predictionDataFrame, by = "region")
Make sure the observations are in order so that the map is drawn properly
predictionMap <- predictionMap[order(predictionMap$order),]
str(predictionMap)
## 'data.frame': 15034 obs. of 9 variables:
## $ region : chr "arizona" "arizona" "arizona" "arizona" ...
## $ long : num -115 -115 -115 -115 -115 ...
## $ lat : num 35 35.1 35.1 35.2 35.2 ...
## $ group : num 2 2 2 2 2 2 2 2 2 2 ...
## $ order : int 204 205 206 207 208 209 210 211 212 213 ...
## $ subregion : chr NA NA NA NA ...
## $ TestPrediction : num 0.974 0.974 0.974 0.974 0.974 ...
## $ TestPredictionBinary: num 1 1 1 1 1 1 1 1 1 1 ...
## $ Test.State : Factor w/ 50 levels "Alabama","Alaska",..: 3 3 3 3 3 3 3 3 3 3 ...
str(statesMap)
## 'data.frame': 15537 obs. of 6 variables:
## $ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ...
## $ lat : num 30.4 30.4 30.4 30.3 30.3 ...
## $ group : num 1 1 1 1 1 1 1 1 1 1 ...
## $ order : int 1 2 3 4 5 6 7 8 9 10 ...
## $ region : chr "alabama" "alabama" "alabama" "alabama" ...
## $ subregion: chr NA NA NA NA ...
When we merge data, it only merged the observations that exist in both data sets. So since we are merging based on the region variable, we will lose all observations that have a value of “region” that doesn’t exist in both data frames. You can change this default behavior by using the all.x and all.y arguments of the merge function.
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPredictionBinary)) + geom_polygon(color = "black")
We see that the legend displays a blue gradient for outcomes between 0 and 1. However, when plotting the binary predictions there are only two possible outcomes: 0 or 1. Let’s replot the map with discrete outcomes. We can also change the color scheme to blue and red, to match the blue color associated with the Democratic Party in the US and the red color associated with the Republican Party in the US.
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPredictionBinary)) + geom_polygon(color = "black") + scale_fill_gradient(low ="blue", high = "red", guide = "legend", breaks = c(0, 1), labels = c("Democrat", "Republican"),name = "Prediction2012")
Alternatively, we could plot the probabilities instead of the binary predictions. Change the plot command above to instead color the states by the variable TestPrediction. We should see a gradient of colors ranging from red to blue.
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction)) + geom_polygon(color = "black") + scale_fill_gradient(low ="blue", high = "red", guide = "legend", breaks = c(0, 1), labels = c("Democrat", "Republican"),name = "Prediction2012")
The only state that appears purple (the color between red and blue) is the state of Iowa, so the maps look very similar. If we take a look at TestPrediction, we can see that most of our predicted probabilities are very close to 0 or very close to 1.
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black", alpha=0.3) + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")
The “alpha” parameter controls the transparency or darkness of the color. A smaller value of alpha will make the colors lighter.
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black", size = 3) + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black", linetype = 3) + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")