Techniques: geom_polygon parameters, merge

library(ggplot2)
library(maps)
statesMap <- map_data("state")
str(statesMap)
## 'data.frame':    15537 obs. of  6 variables:
##  $ long     : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
##  $ lat      : num  30.4 30.4 30.4 30.3 30.3 ...
##  $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ region   : chr  "alabama" "alabama" "alabama" "alabama" ...
##  $ subregion: chr  NA NA NA NA ...

One of the variables, group, defines the different shapes or polygons on the map. Sometimes a state may have multiple groups, for example, if it includes islands.

table(statesMap$group)
## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
##  202  149  312  516   79   91   94   10  872  381  233  329  257  256  113 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
##  397  650  399  566   36  220   30  460  370  373  382  315  238  208   70 
##   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45 
##  125  205   78   16  290   21  168   37  733   12  105  238  284  236  172 
##   46   47   48   49   50   51   52   53   54   55   56   57   58   59   60 
##   66  304  166  289 1088   59  129   96   15  623   17   17   19   44  448 
##   61   62   63 
##  373  388   68

Make a map of the United States

ggplot(statesMap,aes(x = long, y = lat, group = group)) + geom_polygon(fill = "white", color = "black")

Coloring the states by predictions

Load the data

setwd("C:/Users/jzchen/Documents/Courses/Analytics Edge/Unit_7_Visualization")
polling <- read.csv("PollingImputed.csv")
str(polling)
## 'data.frame':    145 obs. of  7 variables:
##  $ State     : Factor w/ 50 levels "Alabama","Alaska",..: 1 1 2 2 3 3 3 4 4 4 ...
##  $ Year      : int  2004 2008 2004 2008 2004 2008 2012 2004 2008 2012 ...
##  $ Rasmussen : int  11 21 19 16 5 5 8 7 10 13 ...
##  $ SurveyUSA : int  18 25 21 18 15 3 5 5 7 21 ...
##  $ DiffCount : int  5 5 1 6 8 9 4 8 5 2 ...
##  $ PropR     : num  1 1 1 1 1 ...
##  $ Republican: int  1 1 1 1 1 1 1 1 1 1 ...

Split data to training and test set

Train <- subset(polling, Year >= 2004 & Year <= 2008)
Test <- subset( polling, Year >= 2012)

Note that we only have 45 states in our testing set, since we are missing observations for Alaska, Delaware, Alabama, Wyoming, and Vermont, so these states will not appear colored in our map.

Create a logistic regression model and evaluate it Conver the outcome to a binary variable and store it in a new variable

mod2 <- glm(Republican~SurveyUSA + DiffCount, data = Train, family = binomial)
TestPrediction <- predict(mod2, newdata = Test, type = "response")
TestPredictionBinary <- as.numeric(TestPrediction > 0.5)
predictionDataFrame <- data.frame(TestPrediction, TestPredictionBinary, Test$State)
table(predictionDataFrame$TestPredictionBinary)
## 
##  0  1 
## 23 22

Merge data frame with the map data stateMap

First convert Test.State variable to lowercase

predictionDataFrame$region <- tolower(predictionDataFrame$Test.State)
predictionMap <- merge(statesMap, predictionDataFrame, by = "region")

Make sure the observations are in order so that the map is drawn properly

predictionMap <- predictionMap[order(predictionMap$order),]
str(predictionMap)
## 'data.frame':    15034 obs. of  9 variables:
##  $ region              : chr  "arizona" "arizona" "arizona" "arizona" ...
##  $ long                : num  -115 -115 -115 -115 -115 ...
##  $ lat                 : num  35 35.1 35.1 35.2 35.2 ...
##  $ group               : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ order               : int  204 205 206 207 208 209 210 211 212 213 ...
##  $ subregion           : chr  NA NA NA NA ...
##  $ TestPrediction      : num  0.974 0.974 0.974 0.974 0.974 ...
##  $ TestPredictionBinary: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Test.State          : Factor w/ 50 levels "Alabama","Alaska",..: 3 3 3 3 3 3 3 3 3 3 ...
str(statesMap)
## 'data.frame':    15537 obs. of  6 variables:
##  $ long     : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
##  $ lat      : num  30.4 30.4 30.4 30.3 30.3 ...
##  $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ region   : chr  "alabama" "alabama" "alabama" "alabama" ...
##  $ subregion: chr  NA NA NA NA ...

When we merge data, it only merged the observations that exist in both data sets. So since we are merging based on the region variable, we will lose all observations that have a value of “region” that doesn’t exist in both data frames. You can change this default behavior by using the all.x and all.y arguments of the merge function.

Color the predictions on the map

ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPredictionBinary)) + geom_polygon(color = "black")

We see that the legend displays a blue gradient for outcomes between 0 and 1. However, when plotting the binary predictions there are only two possible outcomes: 0 or 1. Let’s replot the map with discrete outcomes. We can also change the color scheme to blue and red, to match the blue color associated with the Democratic Party in the US and the red color associated with the Republican Party in the US.

ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPredictionBinary)) + geom_polygon(color = "black") + scale_fill_gradient(low ="blue", high = "red", guide = "legend", breaks = c(0, 1), labels = c("Democrat", "Republican"),name = "Prediction2012")

Alternatively, we could plot the probabilities instead of the binary predictions. Change the plot command above to instead color the states by the variable TestPrediction. We should see a gradient of colors ranging from red to blue.

ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction)) + geom_polygon(color = "black") + scale_fill_gradient(low ="blue", high = "red", guide = "legend", breaks = c(0, 1), labels = c("Democrat", "Republican"),name = "Prediction2012")

The only state that appears purple (the color between red and blue) is the state of Iowa, so the maps look very similar. If we take a look at TestPrediction, we can see that most of our predicted probabilities are very close to 0 or very close to 1.

Parameter settings

ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black", alpha=0.3) + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")

The “alpha” parameter controls the transparency or darkness of the color. A smaller value of alpha will make the colors lighter.

ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black", size = 3) + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")

ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black", linetype = 3) + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")