1. Drawing a Map of the US
1.1
If you look at the structure of the statesMap data frame using the str function, you should see that there are 6 variables. One of the variables, group, defines the different shapes or polygons on the map. Sometimes a state may have multiple groups, for example, if it includes islands. How many different groups are there?
statesMap = map_data('state')
table(statesMap$group) %>% length
[1] 63
### 方法二
statesMap = map_data("state")
str(statesMap)
'data.frame': 15537 obs. of 6 variables:
$ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ...
$ lat : num 30.4 30.4 30.4 30.3 30.3 ...
$ group : num 1 1 1 1 1 1 1 1 1 1 ...
$ order : int 1 2 3 4 5 6 7 8 9 10 ...
$ region : chr "alabama" "alabama" "alabama" "alabama" ...
$ subregion: chr NA NA NA NA ...
head(statesMap)
table(statesMap$group) # 63
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
202 149 312 516 79 91 94 10 872 381 233 329 257 256 113 397 650
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
399 566 36 220 30 460 370 373 382 315 238 208 70 125 205 78 16
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
290 21 168 37 733 12 105 238 284 236 172 66 304 166 289 1088 59
52 53 54 55 56 57 58 59 60 61 62 63
129 96 15 623 17 17 19 44 448 373 388 68
length(table(statesMap$group))
[1] 63
1.2
You can draw a map of the United States by typing the following in your R console:
ggplot(statesMap, aes(x=long, y=lat, group=group)) +
geom_polygon(fill="white", color="black")

We specified two colors in geom_polygon – fill and color. Which one defined the color of the outline of the states?
- fill: 行政區內所填充之顏色
- color: 行政區界線之顏色(O)
2 Coloring the States by Predictions
2.1 Predictive Model
Now, let’s color the map of the US according to our 2012 US presidential election predictions from the Unit 3 Recitation. We’ll rebuild the model here, using the dataset PollingImputed.csv. Be sure to use this file so that you don’t have to redo the imputation to fill in the missing values, like we did in the Unit 3 Recitation.
Load the data using the read.csv function, and call it “polling”. Then split the data using the subset function into a training set called “Train” that has observations from 2004 and 2008, and a testing set called “Test” that has observations from 2012.
Note that we only have 45 states in our testing set, since we are missing observations for Alaska, Delaware, Alabama, Wyoming, and Vermont, so these states will not appear colored in our map.
Then, create a logistic regression model and make predictions on the test set using the following commands:
polling = read.csv('data/PollingImputed.csv')
trn = subset(polling, Year != 2012)
tst = subset(polling, Year == 2012)
mod2 = glm(Republican~SurveyUSA+DiffCount, trn, family=binomial)
pred = predict(mod2,tst,type='response')
repub = as.numeric(pred > 0.5)
df = data.frame(pred, repub, state=tst$State)
head(df)
For how many states is our binary prediction 1 (for 2012), corresponding to Republican?
sum(repub)
[1] 22
What is the average predicted probability of our model (on the Test set, for 2012)?
mean(pred)
[1] 0.4853
### 方法二
PollingImputed <- read.csv("Data/PollingImputed.csv")
TR <- subset(PollingImputed, Year == 2004 | Year == 2008)
TS <- subset(PollingImputed, Year == 2012)
str(TS)
'data.frame': 45 obs. of 7 variables:
$ State : Factor w/ 50 levels "Alabama","Alaska",..: 3 4 5 6 7 9 10 11 12 13 ...
$ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
$ Rasmussen : int 8 13 -12 3 -7 2 5 -22 31 -22 ...
$ SurveyUSA : int 5 21 -14 -2 -13 0 8 -24 24 -16 ...
$ DiffCount : int 4 2 -6 -5 -8 6 4 -2 1 -5 ...
$ PropR : num 0.833 1 0 0.308 0 ...
$ Republican: int 1 1 0 0 0 0 1 0 1 0 ...
mod2 = glm(Republican~SurveyUSA+DiffCount, data=TR, family="binomial")
TestPrediction = predict(mod2, newdata=TS, type="response")
TestPredictionBinary = as.numeric(TestPrediction > 0.5)
predictionDataFrame = data.frame(TestPrediction, TestPredictionBinary, TS$State)
str(predictionDataFrame)
'data.frame': 45 obs. of 3 variables:
$ TestPrediction : num 0.9739028 0.9994949 0.0000926 0.009433 0.0000343 ...
$ TestPredictionBinary: num 1 1 0 0 0 1 1 0 1 0 ...
$ TS.State : Factor w/ 50 levels "Alabama","Alaska",..: 3 4 5 6 7 9 10 11 12 13 ...
summary(predictionDataFrame) # Mean = 0.4852626
TestPrediction TestPredictionBinary TS.State
Min. :0.0000 Min. :0.000 Arizona : 1
1st Qu.:0.0001 1st Qu.:0.000 Arkansas : 1
Median :0.0649 Median :0.000 California : 1
Mean :0.4853 Mean :0.489 Colorado : 1
3rd Qu.:0.9986 3rd Qu.:1.000 Connecticut: 1
Max. :0.9999 Max. :1.000 Florida : 1
(Other) :39
table(predictionDataFrame$TestPredictionBinary) # 22
0 1
23 22
2.2 Merge Data into Map
Now, we need to merge “predictionDataFrame” with the map data “statesMap”, like we did in lecture. Before doing so, we need to convert the Test.State variable to lowercase, so that it matches the region variable in statesMap. Do this by typing the following in your R console:
df$region = tolower(df$state)
pmap = merge(statesMap, df, by='region')
How many observations are there in predictionMap?
nrow(pmap) # 15034
[1] 15034
How many observations are there in stateMap?
nrow(statesMap) # 15537
[1] 15537
### 方法二
predictionDataFrame$region = tolower(predictionDataFrame$TS.State)
predictionMap = merge(statesMap, predictionDataFrame, by = "region")
predictionMap = predictionMap[order(predictionMap$order),]
str(predictionMap) # 15034
'data.frame': 15034 obs. of 9 variables:
$ region : chr "arizona" "arizona" "arizona" "arizona" ...
$ long : num -115 -115 -115 -115 -115 ...
$ lat : num 35 35.1 35.1 35.2 35.2 ...
$ group : num 2 2 2 2 2 2 2 2 2 2 ...
$ order : int 204 205 206 207 208 209 210 211 212 213 ...
$ subregion : chr NA NA NA NA ...
$ TestPrediction : num 0.974 0.974 0.974 0.974 0.974 ...
$ TestPredictionBinary: num 1 1 1 1 1 1 1 1 1 1 ...
$ TS.State : Factor w/ 50 levels "Alabama","Alaska",..: 3 3 3 3 3 3 3 3 3 3 ...
str(statesMap) # 15537
'data.frame': 15537 obs. of 6 variables:
$ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ...
$ lat : num 30.4 30.4 30.4 30.3 30.3 ...
$ group : num 1 1 1 1 1 1 1 1 1 1 ...
$ order : int 1 2 3 4 5 6 7 8 9 10 ...
$ region : chr "alabama" "alabama" "alabama" "alabama" ...
$ subregion: chr NA NA NA NA ...
2.3 The Rule of merge()
When we merged the data in the previous problem, it caused the number of observations to change. Why? Check out the help page for merge by typing ?merge to help you answer this question.
- Because we only make predictions for 45 states, we no longer have observations for some of the states. These observations were removed in the merging process.
2.4 Plot the color map
Now we are ready to color the US map with our predictions! You can color the states according to our binary predictions by typing the following in your R console:
pmap = pmap[order(pmap$group, pmap$order) , ]
ggplot(pmap, aes(x=long, y=lat, group=group, fill=repub)) +
geom_polygon(color='black')

### 方法二
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPredictionBinary)) + geom_polygon(color = "black")

str(PollingImputed) # (1) Light blue
'data.frame': 145 obs. of 7 variables:
$ State : Factor w/ 50 levels "Alabama","Alaska",..: 1 1 2 2 3 3 3 4 4 4 ...
$ Year : int 2004 2008 2004 2008 2004 2008 2012 2004 2008 2012 ...
$ Rasmussen : int 11 21 19 16 5 5 8 7 10 13 ...
$ SurveyUSA : int 18 25 21 18 15 3 5 5 7 21 ...
$ DiffCount : int 5 5 1 6 8 9 4 8 5 2 ...
$ PropR : num 1 1 1 1 1 ...
$ Republican: int 1 1 1 1 1 1 1 1 1 1 ...
# Republican欄位的1,即表示為Republican;若為0,則為Democrat
The states appear light blue and dark blue in this map. Which color represents a Republican prediction?
2.5
We see that the legend displays a blue gradient for outcomes between 0 and 1. However, when plotting the binary predictions there are only two possible outcomes: 0 or 1. Let’s replot the map with discrete outcomes. We can also change the color scheme to blue and red, to match the blue color associated with the Democratic Party in the US and the red color associated with the Republican Party in the US. This can be done with the following command:
Alternatively, we could plot the probabilities instead of the binary predictions. Change the plot command above to instead color the states by the variable TestPrediction.
ggplot(pmap, aes(x=long, y=lat, group=group, fill=pred)) +
geom_polygon(color='black') +
scale_fill_gradient(
low="blue", high="red",
guide="legend", breaks= c(0,1),
labels=c("Democrat", "Republican"), name="Prediction 2012")

# scale: 顏色
### 方法二
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPredictionBinary)) + geom_polygon(color = "black") + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")

ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction)) + geom_polygon(color = "black") + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")

# (1) The two maps look very similar. This is because most of our predicted probabilities are close to 0 or close to 1.
# 輸入後兩張圖長得差不多
# TestPredictionBinary是Discrete,僅有「是 (1, Republican)」、「否 (0, Democrat)」的2種結果
# TestPrediction則是Continous,機率0-1,故結果從0-1之間的任何機率都有可能
# 二者差異不大的原因是:由於數值的範圍都是(0, 1),所以當如果該地區的政治立場極為分明時,結果便會偏向0或是1,這樣得出的結果便會與原先Binary的結果差異不大
# 唯一可能會出現顏色差異大的情況是:當整個美國所有地區的政治立場皆不鮮明時,會導致機率接近0.5,這時候整張圖會偏向機率=0.5時所出現所fill的紫色
You should see a gradient of colors ranging from red to blue. Do the colors of the states in the map for TestPrediction look different from the colors of the states in the map with TestPredictionBinary? Why or why not?
- 顏色在TR、TS之間有一點點不一樣 +輸入後兩張圖長得差不多
- TestPredictionBinary是Discrete,僅有「是 (1, Republican)」、「否 (0, Democrat)」的2種結果
- TestPrediction則是Continous,機率0-1,故結果從0-1之間的任何機率都有可能
- 二者差異不大的原因是:由於數值的範圍都是(0, 1),所以當如果該地區的政治立場極為分明時,結果便會偏向0或是1,這樣得出的結果便會與原先Binary的結果差異不大
- 唯一可能會出現顏色差異大的情況是:當整個美國所有地區的政治立場皆不鮮明時,整張圖會偏向機率=0.5時所出現所fill的紫色
3. Understanding the Predictions
3.1
In the 2012 election, the state of Florida ended up being a very close race. It was ultimately won by the Democratic party.
df$pred[ df$state == 'Florida'] # 0.96404
[1] 0.964
Did we predict this state correctly or incorrectly?
- We incorrectly predicted this state by predicting that it would be won by the Republican party.
3.2
What was our predicted probability for the state of Florida?
df$pred[ df$state == 'Florida'] # 0.96404
[1] 0.964
### 方法二
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction)) + geom_polygon(color = "black") + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")

table(PollingImputed$State, PollingImputed$Republican)
0 1
Alabama 0 2
Alaska 0 2
Arizona 0 3
Arkansas 0 3
California 3 0
Colorado 2 1
Connecticut 3 0
Delaware 2 0
Florida 2 1
Georgia 0 3
Hawaii 3 0
Idaho 0 3
Illinois 3 0
Indiana 1 2
Iowa 2 1
Kansas 0 3
Kentucky 0 3
Louisiana 0 3
Maine 3 0
Maryland 3 0
Massachusetts 3 0
Michigan 3 0
Minnesota 3 0
Mississippi 0 3
Missouri 0 3
Montana 0 3
Nebraska 0 3
Nevada 2 1
New Hampshire 3 0
New Jersey 3 0
New Mexico 2 1
New York 3 0
North Carolina 1 2
North Dakota 0 3
Ohio 2 1
Oklahoma 0 3
Oregon 3 0
Pennsylvania 3 0
Rhode Island 3 0
South Carolina 0 3
South Dakota 0 3
Tennessee 0 3
Texas 0 3
Utah 0 3
Vermont 2 0
Virginia 2 1
Washington 3 0
West Virginia 0 3
Wisconsin 3 0
Wyoming 0 2
predictionDataFrame # 9.640395e-01 = 0.9640395
### (4) Our prediction model did not do a very good job of correctly predicting the state of Florida, and we were very confident in our incorrect prediction.
What does this imply?
- Our prediction model did not do a very good job of correctly predicting the state of Florida, and we were very confident in our incorrect prediction.
4. Parameter Settings
In this part, we’ll explore what the different parameter settings of geom_polygon do. Throughout the problem, use the help page for geom_polygon, which can be accessed by ?geom_polygon. To see more information about a certain parameter, just type a question mark and then the parameter name to get the help page for that parameter. Experiment with different parameter settings to try and replicate the plots!
We’ll be asking questions about the following three plots:
grad = scale_fill_gradient(
low="blue", high="red",
guide="legend", breaks= c(0,1),
labels=c("Democrat", "Republican"), name="Prediction 2012")
ggplot(pmap, aes(x=long, y=lat, group=group, fill=repub)) + grad +
geom_polygon(color='black',linetype=3,size=1) + ggtitle("Plot(1)")

ggplot(pmap, aes(x=long, y=lat, group=group, fill=repub)) + grad +
geom_polygon(color='black',linetype=1,size=3) + ggtitle("Plot(2)")

ggplot(pmap, aes(x=long, y=lat, group=group, fill=repub)) + grad +
geom_polygon(color='black',linetype=1,size=1,alpha=0.3) + ggtitle("Plot(3)")

4.1
Plots (1) and (2) were created by changing different parameters of geom_polygon from their default values.
### linetype
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black", linetype=3) + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")

### Plot(1): 虛線(linetype = 2)
What is the name of the parameter we changed to create plot (1)?
What is the name of the parameter we changed to create plot (2)?
4.2
Plot (3) was created by changing the value of a different geom_polygon parameter to have value 0.3. Which parameter did we use?
### alpha
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black", alpha=0.3) + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")

?alpha
### alpha: Modify colour transparency. Vectorised in both colour and alpha.
