主要議題:行政區界套圖

Sys.setlocale('LC_ALL','C')
## [1] "C"
library(ggplot2)
library(maps)
library(ggmap)
library(caTools)


1. Drawing a Map of the US

1.1

If you look at the structure of the statesMap data frame using the str function, you should see that there are 6 variables. One of the variables, group, defines the different shapes or polygons on the map. Sometimes a state may have multiple groups, for example, if it includes islands. How many different groups are there?

statesMap = map_data("state")
table(statesMap$group)
## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
##  202  149  312  516   79   91   94   10  872  381  233  329  257  256  113 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
##  397  650  399  566   36  220   30  460  370  373  382  315  238  208   70 
##   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45 
##  125  205   78   16  290   21  168   37  733   12  105  238  284  236  172 
##   46   47   48   49   50   51   52   53   54   55   56   57   58   59   60 
##   66  304  166  289 1088   59  129   96   15  623   17   17   19   44  448 
##   61   62   63 
##  373  388   68
  • 63

The variable “order” defines the order to connect the points within each group, and the variable “region” gives the name of the state.

1.2

You can draw a map of the United States by typing the following in your R console:

ggplot(statesMap, aes(x = long, y = lat, group = group)) + geom_polygon(fill = "white", color = "black")

#即用好多個多邊形畫成一個州的一個張圖,"polygon"即多邊形

We specified two colors in geom_polygon – fill and color. Which one defined the color of the outline of the states? + color

2 Coloring the States by Predictions

2.1 Predictive Model

Now, let’s color the map of the US according to our 2012 US presidential election predictions from the Unit 3 Recitation. We’ll rebuild the model here, using the dataset PollingImputed.csv. Be sure to use this file so that you don’t have to redo the imputation to fill in the missing values, like we did in the Unit 3 Recitation.

Load the data using the read.csv function, and call it “polling”. Then split the data using the subset function into a training set called “Train” that has observations from 2004 and 2008, and a testing set called “Test” that has observations from 2012.

Note that we only have 45 states in our testing set, since we are missing observations for Alaska, Delaware, Alabama, Wyoming, and Vermont, so these states will not appear colored in our map.

Then, create a logistic regression model and make predictions on the test set using the following commands:

polling= read.csv("data/PollingImputed.csv")
tr= subset(polling, Year=="2004" | Year=="2008" )
ts= subset(polling, Year=="2012")

mod = glm(Republican~SurveyUSA+DiffCount, data=tr, family="binomial")
pred = predict(mod, newdata=ts, type="response")

pred_binary = as.numeric(pred > 0.5)
df = data.frame(pred, pred_binary, ts$State) #預測全部結果跟會投機率>.5的結果再加全部的州,以此三者資料來做data frame後畫圖
table(ts$Republican)
## 
##  0  1 
## 24 21
mean(pred)
## [1] 0.4852626

To make sure everything went smoothly, answer the following questions. For how many states is our binary prediction 1 (for 2012), corresponding to Republican? + 21

What is the average predicted probability of our model (on the Test set, for 2012)? + 0.4667

2.2 Merge Data into Map

Now, we need to merge “predictionDataFrame” with the map data “statesMap”, like we did in lecture. Before doing so, we need to convert the Test.State variable to lowercase, so that it matches the region variable in statesMap. Do this by typing the following in your R console:

df$region = tolower(df$ts.State) #convert the ts.State variable to lowercase, so that it matches the region variable in statesMap.
map = merge(statesMap, df, by = "region") #merge the two data frames
map = map[order(map$order,map$group),] #make sure the observations are in order

How many observations are there in predictionMap? + 15034

How many observations are there in statesMap? + 15537

2.3 The Rule of merge()

When we merged the data in the previous problem, it caused the number of observations to change. Why? Check out the help page for merge by typing ?merge to help you answer this question. + Because we only make predictions for 45 states, we no longer have observations for some of the states. These observations were removed in the merging process. + 如果想要保留兩個df的所有observations,可以用all.x跟all.y=all來設定

2.4 Plot the color map

Now we are ready to color the US map with our predictions! You can color the states according to our binary predictions by typing the following in your R console:

ggplot(map, aes(x = long, y = lat, group = group, fill = pred_binary)) + geom_polygon(color = "black")

#空格是因為沒有資料

The states appear light blue and dark blue in this map. Which color represents a Republican prediction? + Light blue

2.5

We see that the legend displays a blue gradient for outcomes between 0 and 1. However, when plotting the binary predictions there are only two possible outcomes: 0 or 1. Let’s replot the map with discrete outcomes. We can also change the color scheme to blue and red, to match the blue color associated with the Democratic Party in the US and the red color associated with the Republican Party in the US. This can be done with the following command:

#MIT 
ggplot(map, aes(x = long, y = lat, group = group, fill = pred_binary))+ geom_polygon(color = "black") + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")

#me
ggplot(map, aes(x=long, y=lat, group=group, fill=pred_binary))+geom_polygon(color="black")+ scale_fill_gradient(low="purple", high="yellow", guide="legend", breaks=c(0,1), label=c("Democtrat","Republican"), name="Prediction 2012")

Alternatively, we could plot the probabilities instead of the binary predictions. Change the plot command above to instead color the states by the variable TestPrediction. You should see a gradient of colors ranging from red to blue. Do the colors of the states in the map for TestPrediction look different from the colors of the states in the map with TestPredictionBinary? Why or why not? + The two maps look very similar. This is because most of our predicted probabilities are close to 0 or close to 1.



3. Understanding the Predictions

3.1

In the 2012 election, the state of Florida ended up being a very close race. It was ultimately won by the Democratic party.

  • We incorrectly predicted this state by predicting that it would be won by the Republican party.
3.2

What was our predicted probability for the state of Florida?

df$pred[ df$state == 'Florida'] #df資料框裡state等於Florida的pred值即預測能力
## numeric(0)
#即使很高,但就是不是Republic

What does this imply? + Our prediction model did not do a very good job of correctly predicting the state of Florida, and we were very confident in our incorrect prediction.

4. Parameter Settings

In this part, we’ll explore what the different parameter settings of geom_polygon do. Throughout the problem, use the help page for geom_polygon, which can be accessed by ?geom_polygon. To see more information about a certain parameter, just type a question mark and then the parameter name to get the help page for that parameter. Experiment with different parameter settings to try and replicate the plots!

We’ll be asking questions about the following three plots:

#1 linetype
ggplot(map, aes(x=long, y=lat, group=group, fill=pred_binary)) +
   geom_polygon(color="black", linetype=3) + 
   scale_fill_gradient(low="blue", high="red", guide="legend", breaks=c(0,1), label=c("Democtrat","Republican"), name="Prediction 2012")+
  ggtitle("plot1")

#2 size
ggplot(map, aes(x=long, y=lat, group=group, fill=pred_binary)) +
   geom_polygon(color="black", size=3) + 
   scale_fill_gradient(low="blue", high="red", guide="legend", breaks=c(0,1), label=c("Democtrat","Republican"), name="Prediction 2012")+
  ggtitle("plot2")

#3 alpha
ggplot(map, aes(x=long, y=lat, group=group, fill=pred_binary)) +
   geom_polygon(color="black", alpha=0.3) + 
   scale_fill_gradient(low="blue", high="red", guide="legend", breaks=c(0,1), label=c("Democtrat","Republican"), name="Prediction 2012")+
  ggtitle("plot3")