主要議題:行政區界套圖
Sys.setlocale('LC_ALL','C')
## [1] "C"
library(ggplot2)
library(maps)
library(ggmap)
library(caTools)
If you look at the structure of the statesMap data frame using the str function, you should see that there are 6 variables. One of the variables, group, defines the different shapes or polygons on the map. Sometimes a state may have multiple groups, for example, if it includes islands. How many different groups are there?
statesMap = map_data("state")
table(statesMap$group)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 202 149 312 516 79 91 94 10 872 381 233 329 257 256 113
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 397 650 399 566 36 220 30 460 370 373 382 315 238 208 70
## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
## 125 205 78 16 290 21 168 37 733 12 105 238 284 236 172
## 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 66 304 166 289 1088 59 129 96 15 623 17 17 19 44 448
## 61 62 63
## 373 388 68
The variable “order” defines the order to connect the points within each group, and the variable “region” gives the name of the state.
You can draw a map of the United States by typing the following in your R console:
ggplot(statesMap, aes(x = long, y = lat, group = group)) + geom_polygon(fill = "white", color = "black")
#即用好多個多邊形畫成一個州的一個張圖,"polygon"即多邊形
We specified two colors in geom_polygon – fill and color. Which one defined the color of the outline of the states? + color
Now, let’s color the map of the US according to our 2012 US presidential election predictions from the Unit 3 Recitation. We’ll rebuild the model here, using the dataset PollingImputed.csv. Be sure to use this file so that you don’t have to redo the imputation to fill in the missing values, like we did in the Unit 3 Recitation.
Load the data using the read.csv function, and call it “polling”. Then split the data using the subset function into a training set called “Train” that has observations from 2004 and 2008, and a testing set called “Test” that has observations from 2012.
Note that we only have 45 states in our testing set, since we are missing observations for Alaska, Delaware, Alabama, Wyoming, and Vermont, so these states will not appear colored in our map.
Then, create a logistic regression model and make predictions on the test set using the following commands:
polling= read.csv("data/PollingImputed.csv")
tr= subset(polling, Year=="2004" | Year=="2008" )
ts= subset(polling, Year=="2012")
mod = glm(Republican~SurveyUSA+DiffCount, data=tr, family="binomial")
pred = predict(mod, newdata=ts, type="response")
pred_binary = as.numeric(pred > 0.5)
df = data.frame(pred, pred_binary, ts$State) #預測全部結果跟會投機率>.5的結果再加全部的州,以此三者資料來做data frame後畫圖
table(ts$Republican)
##
## 0 1
## 24 21
mean(pred)
## [1] 0.4852626
To make sure everything went smoothly, answer the following questions. For how many states is our binary prediction 1 (for 2012), corresponding to Republican? + 21
What is the average predicted probability of our model (on the Test set, for 2012)? + 0.4667
Now, we need to merge “predictionDataFrame” with the map data “statesMap”, like we did in lecture. Before doing so, we need to convert the Test.State variable to lowercase, so that it matches the region variable in statesMap. Do this by typing the following in your R console:
df$region = tolower(df$ts.State) #convert the ts.State variable to lowercase, so that it matches the region variable in statesMap.
map = merge(statesMap, df, by = "region") #merge the two data frames
map = map[order(map$order,map$group),] #make sure the observations are in order
How many observations are there in predictionMap? + 15034
How many observations are there in statesMap? + 15537
merge()When we merged the data in the previous problem, it caused the number of observations to change. Why? Check out the help page for merge by typing ?merge to help you answer this question. + Because we only make predictions for 45 states, we no longer have observations for some of the states. These observations were removed in the merging process. + 如果想要保留兩個df的所有observations,可以用all.x跟all.y=all來設定
Now we are ready to color the US map with our predictions! You can color the states according to our binary predictions by typing the following in your R console:
ggplot(map, aes(x = long, y = lat, group = group, fill = pred_binary)) + geom_polygon(color = "black")
#空格是因為沒有資料
The states appear light blue and dark blue in this map. Which color represents a Republican prediction? + Light blue
We see that the legend displays a blue gradient for outcomes between 0 and 1. However, when plotting the binary predictions there are only two possible outcomes: 0 or 1. Let’s replot the map with discrete outcomes. We can also change the color scheme to blue and red, to match the blue color associated with the Democratic Party in the US and the red color associated with the Republican Party in the US. This can be done with the following command:
#MIT
ggplot(map, aes(x = long, y = lat, group = group, fill = pred_binary))+ geom_polygon(color = "black") + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")
#me
ggplot(map, aes(x=long, y=lat, group=group, fill=pred_binary))+geom_polygon(color="black")+ scale_fill_gradient(low="purple", high="yellow", guide="legend", breaks=c(0,1), label=c("Democtrat","Republican"), name="Prediction 2012")
Alternatively, we could plot the probabilities instead of the binary predictions. Change the plot command above to instead color the states by the variable TestPrediction. You should see a gradient of colors ranging from red to blue. Do the colors of the states in the map for TestPrediction look different from the colors of the states in the map with TestPredictionBinary? Why or why not? + The two maps look very similar. This is because most of our predicted probabilities are close to 0 or close to 1.
In the 2012 election, the state of Florida ended up being a very close race. It was ultimately won by the Democratic party.
What was our predicted probability for the state of Florida?
df$pred[ df$state == 'Florida'] #df資料框裡state等於Florida的pred值即預測能力
## numeric(0)
#即使很高,但就是不是Republic
What does this imply? + Our prediction model did not do a very good job of correctly predicting the state of Florida, and we were very confident in our incorrect prediction.
In this part, we’ll explore what the different parameter settings of geom_polygon do. Throughout the problem, use the help page for geom_polygon, which can be accessed by ?geom_polygon. To see more information about a certain parameter, just type a question mark and then the parameter name to get the help page for that parameter. Experiment with different parameter settings to try and replicate the plots!
We’ll be asking questions about the following three plots:
#1 linetype
ggplot(map, aes(x=long, y=lat, group=group, fill=pred_binary)) +
geom_polygon(color="black", linetype=3) +
scale_fill_gradient(low="blue", high="red", guide="legend", breaks=c(0,1), label=c("Democtrat","Republican"), name="Prediction 2012")+
ggtitle("plot1")
#2 size
ggplot(map, aes(x=long, y=lat, group=group, fill=pred_binary)) +
geom_polygon(color="black", size=3) +
scale_fill_gradient(low="blue", high="red", guide="legend", breaks=c(0,1), label=c("Democtrat","Republican"), name="Prediction 2012")+
ggtitle("plot2")
#3 alpha
ggplot(map, aes(x=long, y=lat, group=group, fill=pred_binary)) +
geom_polygon(color="black", alpha=0.3) +
scale_fill_gradient(low="blue", high="red", guide="legend", breaks=c(0,1), label=c("Democtrat","Republican"), name="Prediction 2012")+
ggtitle("plot3")