A classic use of maps in displaying data is coloring the regions in a map so that they indicate the value of some variable in that state. As an example, we will create maps showing information about the US Congress.
Our first step is to load the ggplot2 library and use the map_data() function contained in that library to get the map of the US states.
library(ggplot2)
all_states <- map_data("state")
Next, we load in a dataset containing the total number of female senators and house representatives in each state (except Alaska and Hawaii – to keep things simple, we’re just looking at the 48 continental states and the District of Columbia1) Once we have this data loaded, we’ll rename the State variable to region so that it matches the variable name in our map data and merge the two datasets by the region variable.
congress<-read.csv("womenincongress.csv")
names(congress)[2] <- "region"
stateData <- merge(all_states,congress,by="region")
One nice thing about the structure of ggplot2 is that we can build the graph up piece by piece. This makes it easier to create multiple versions of similar graphs, because we can save the commands that multiple version will have in common as a ggplot object, then add on the varying elements separately. First, we’ll make a map in which the states are separated by grey borders and in which the fill or background color of the states is determined by the number of female senators representing that state. We’ll save this to an object named senatePlot. Note that this command does not print any plot to the screen. It simply saves the plot object.
senatePlot <- ggplot()+geom_polygon(data=stateData,aes(x=long, y=lat, group = group, fill=senators),color="grey50")+coord_map()
If you type senatePlot
in the console, the plot will be generated.
senatePlot
All we’ve down so far is provide the coordinates needed to fill in the states’ shapes and provided some colors. The legend is a bit foolish here. Since each state has two senators, the only possible values are 0, 1, and 2 and is would make more sense for our legend to show the three colors used to represent those values, rather than a gradient scale. One way to make this happen is to ask R to treat the variable senators as a categorical variable by using the as.factor() function.
senatePlot <- ggplot()+geom_polygon(data=stateData,aes(x=long, y=lat, group = group, fill=as.factor(senators)),color="grey50")+coord_map()
senatePlot
Now we have three distinct colors in the legend, but now we have three colors that don’t really reflect the inherent numerical scale of our values. Since we want to introduce a more appropriate scale to the way our graph is generated, we need to add a scale command to our ggplot object. There are many different scales available (check the documentation, or type scale_
in RStudio, to see them listed), but scale_fill_brewer contains color palettes designed to work well for displaying discrete values on maps, so we’ll use it. The scale command also controlls the text above our legend. The type argument has 3 possible values: seq (sequential), div (diverging) or qual (qualitative). The palette arguement has many options see documentation for list or colorbrewer2.org for demonstrations.
senatePlot <- senatePlot + coord_map()+scale_fill_brewer(name="Female Senators",type="seq",palette="Reds")
senatePlot
Let’s add a title and remove the lattiude and longitude labels, since they don’t really add information in this setting. This can be done quickly with the labs() command.
senatePlot <- senatePlot + labs(x="", y="", title="Women in the Senate")
senatePlot
The axis tick marks and numbered labels aren’t helpful either. General aspects of the plot such as these can be set individually by the theme() command (we could have used the theme() command to set the titles and axis labels too, but the lab() command is faster if you don’t need to make as many modifications as we are in this example). We’ll use the element_blank() element to provide a value for things we wish to remove.
senatePlot <- senatePlot + theme(axis.ticks.y = element_blank(),axis.text.y = element_blank(), axis.ticks.x = element_blank(),axis.text.x = element_blank())
senatePlot
That’s getting better, but the grey background and gridlines aren’t really adding any information, so we’d be better off without them too. We could change these settings with more arguments into the theme command, but we can also use one of the many preset themes to make several changes at once. Take a look at the following graphs:
senatePlot+theme_classic()
senatePlot+theme_bw()
senatePlot+theme_light()
senatePlot+theme_dark()
Notice that when we added the command for one of the preset themes, the axis labels and tick marks returned. The preset theme command has overriden some of the choices we specified earlier using the theme() command. In ggplot2 the order in which we add commands can be important. The command which is added last will be performed last and may override previously specified values.
Examine the difference between the following two plots, which are generated with the same commands, given in different orders.
senatePlot + theme(axis.ticks.y = element_blank(),axis.text.y = element_blank(), axis.ticks.x = element_blank(),axis.text.x = element_blank()) + theme_classic()
senatePlot + theme_classic() + theme(axis.ticks.y = element_blank(),axis.text.y = element_blank(), axis.ticks.x = element_blank(),axis.text.x = element_blank())
In the first plot, the theme() command preceeds the theme_classic() command, so the classic theme, which includes axis tick marks overrides the theme() command which removed those tick marks. In the second plot, the theme() command which removes the axis ticks comes after the theme_classic() command, so no axis tick marks appear.
With the Senate, there are only 3 possible values. Let’s look at the House, which has a variable number of represenatives, depending on the population of the state. We’ll start by using all the same options as we used in the Senate plot above.
housePlot <- ggplot()+geom_polygon(data=stateData,aes(x=long, y=lat, group = group, fill=as.factor(representatives)),color="grey50")+coord_map()+scale_fill_brewer(name="Female Represenatives",type="seq",palette="Reds")+labs(x="",y="",title="Women in the House")+theme_classic()+ theme(axis.ticks.y = element_blank(),axis.text.y = element_blank(), axis.ticks.x = element_blank(),axis.text.x = element_blank())
housePlot
While treating the value as a factor made sense for the Senate, we might be better of treating it as a numerical value in the House, since we have a larger number of possible values. The scale_fill_brewer command only works with discrete values, so we’ll remove that command when we remove the as.factor() command.
housePlot <- ggplot()+geom_polygon(data=stateData,aes(x=long, y=lat, group = group, fill=representatives),color="grey50")+coord_map()+labs(x="",y="",title="Women in the House")+theme_classic()+ theme(axis.ticks.y = element_blank(),axis.text.y = element_blank(), axis.ticks.x = element_blank(),axis.text.x = element_blank())
housePlot
That’s fine, but if we want more control over the colors, we can use the scale_fill_gradient command, which allows us to specify the color of the bottom and the top of our scale.
housePlot <- housePlot + scale_fill_gradient(name="Female Representatives",low="whitesmoke",high="darkred")
housePlot
This plot gives the impression that California has far more female representative than any other state. This is true, but it is a bit misleading, since California also has many more representatives (of any gender) than most states. It would be more appropriate to color the states according to the proportion of their representative who are women. So let’s add a new variable to our data set which gives that proportion.
stateData$repProp <- stateData$representatives/stateData$total
housePlot <- ggplot()+geom_polygon(data=stateData,aes(x=long, y=lat, group = group, fill=repProp),color="grey50")+coord_map()+labs(x="",y="",title="Women in the House")+theme_classic()+ theme(axis.ticks.y = element_blank(),axis.text.y = element_blank(), axis.ticks.x = element_blank(),axis.text.x = element_blank()) + scale_fill_gradient(name="Female Representatives",low="whitesmoke",high="darkred")
housePlot
Now we can see that California isn’t such an outlier in terms of female represenatives. Of course, now Wyoming and South Dakota appear to by outliers. This is because each of these states has one female represenative, who is the only represenative from that state. The top of our colorbar is overlapping with our legend label, so let’s move the label to the bottom of the colorbar by adding a guide argument to the map to our scale command.
housePlot + scale_fill_gradient(name="Female Representatives",low="whitesmoke",high="darkred",guide=guide_colorbar(title.position="bottom"))
## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.
Now that we know how to make maps of state-level data, we can quickly make these types of plots for many kinds of data. For instance, we can show the results of the 2012 election. (2012 election data taken from Wikipedia)
electionData <- read.csv("2012.csv")
names(electionData)[1] <- "region"
electionData$ObamaPerc <- electionData$ObamaVotes/(electionData$ObamaVotes+electionData$RomneyVotes+electionData$JohnsonVotes+electionData$SteinVotes)
electionData$RomneyPerc <- electionData$RomneyVotes/(electionData$ObamaVotes+electionData$RomneyVotes+electionData$JohnsonVotes+electionData$SteinVotes)
electionData <- merge(all_states,electionData,by="region")
electionPlot <- ggplot()+geom_polygon(data=electionData,aes(x=long, y=lat, group = group, fill=ObamaPerc),color="grey50")+coord_map()+labs(x="",y="",title="2012 Election Results")+theme_classic()+ theme(axis.ticks.y = element_blank(),axis.text.y = element_blank(), axis.ticks.x = element_blank(),axis.text.x = element_blank())
electionPlot
Rather than the default color scheme, we might want to show regions that favored Governor Romney, the Republican, in red and regions that favored President Obama, the Democrat, in blue.
electionPlot + scale_fill_gradient(name="Obama's Percenatage",low="red",high="blue")
It’s a bit difficult to distinguish patterns when so many of the colors are near each other. To get a finer degree of control over the colors, we can use scale_fill_gradient2, which allows us to specify the color in the middle of our gradient as well as the low and high ends. It typically works best if you also supply the value you wish to be used as the midpoint. For election data, the most sensible midpoint value is 0.5.
electionPlot + scale_fill_gradient2(name="Obama's Percenatage",low="red",mid="white",high="blue",midpoint=.5)
With this revised color scheme, we can clearly see which states were a very tight contest (e.g., Florida, Virginia), which states were firmly for Obama (e.g., California, Vermont), and which were firmly for Romney (e.g., Utah, Wyoming). Of course, it is still difficult to see the values for the small states in the Northeastern part of the US. We could subset the data to create a map of those states by themselves.
northEast <- subset(electionData, region %in% c("connecticut", "maine", "massachusetts", "new hampshire", "rhode island", "vermont", "new jersey", "new york", "pennsylvania", "delaware", "maryland", "district of columbia"))
NEelectionPlot <- ggplot()+geom_polygon(data=northEast,aes(x=long, y=lat, group = group, fill=ObamaPerc),color="grey50")+coord_map()+labs(x="",y="",title="2012 Election Results")+theme_classic()+ theme(axis.ticks.y = element_blank(),axis.text.y = element_blank(), axis.ticks.x = element_blank(),axis.text.x = element_blank()) + scale_fill_gradient2(name="Obama's Percenatage",low="red",mid="white",high="blue",midpoint=.5)
NEelectionPlot
Finally, we are able to see DC!
While the district has no voting representation in Congress, it does have a non-voting representative, Eleanor Holmes Norton, who is included in this data, although you will need to zoom in to see her effect.↩