A few days ago Bradley Boehmke recreated a graphic from Edward Tufte’s classic book Visual Display of Quantitative Information, 2nd Ed. (page 30). The implementation looked like a lot of yak shaving (especially for the annotations) but it’s still (or precisely because of that) a great piece.
Then yesterday I stumbled over Lisa Charlotte Rost’s blog and really liked the article and the accompanying visualization about Styling Choropleth Maps. You should go read that article now if you are interested in mapping. I’ll wait for you here… Welcome back! Where were we? So, I asked her how she created it and upon learning that the final graphic was put together in Illustrator I started wondering if this could be done with R and ggplot2. Inspired by Bradley’s post I went to work.
This is Lisa’s original image:
require(RCurl)
require(ggplot2)
require(grid)
require(gridExtra)
require(maps)
require(mapproj)
# require(RColorBrewer) # not really used - I thought i'd need it
require(classInt) # Jenks natural breaks
There are really no suprises here. I am using RCurl to download the data, ggplot and maps to draw histograms and… well… maps. There is only one package that I hadn’t used before and that is classInt. QGIS’ natural breaks coloring uses an algorithm by George Jenks which is to my good fortune implemented in that package. I was also thinking that I’d need RColorBrewer but Lisa seems to be using a sligtly treaked version of the YlOrRd palette so I ended up just copying her colors into a manual color scale.
The first thing I did, was getting the state shapes from the map package and download the current US census data from the web.
states <- map_data("state")
population<- read.csv(textConnection(getURL('https://www.census.gov/popest/data/national/totals/2014/files/NST_EST2014_ALLDATA.csv')))[-(1:5),]
Next I calculated some descriptive statistics to help me with the task of cutting the range of population sizes into the different intervals needed for the QGIS styling modes described in Lisa’s article. I then computed the breaks of the intervals and added the information to which interval each state belongs to the population data frame via the very useful cut method.
min = min(population$CENSUS2010POP)
max = max(population$CENSUS2010POP)
diff <- max - min
std = sd(population$CENSUS2010POP)
equal.interval = seq(min, max, by = diff/6)
quantile.interval = quantile(population$CENSUS2010POP, probs=seq(0, 1, by = 1/6))
std.interval = c(seq(min, max, by=std), max)
natural.interval = classIntervals(population$CENSUS2010PO, n = 6, style = 'jenks')$brks
population$population.equal = cut(population$CENSUS2010POP, breaks=equal.interval, include.lowest = TRUE)
population$population.quantile = cut(population$CENSUS2010POP, breaks=quantile.interval, include.lowest = TRUE)
population$population.std = cut(population$CENSUS2010POP, breaks=std.interval, include.lowest = TRUE)
population$population.natural = cut(population$CENSUS2010POP, breaks=natural.interval, include.lowest = TRUE)
Now every state (represented as a row in the population data frame) had not only it’s population (in the CENSUS2010POP column) associated to it but also the information which bucket it belongs to for the different styles. Here are the relevant fields of the first few rows.
head(population[c('NAME', 'CENSUS2010POP', 'population.equal', 'population.quantile', 'population.std', 'population.natural')])
## NAME CENSUS2010POP population.equal population.quantile
## 6 Alabama 4779736 [5.64e+05,6.68e+06] (4.09e+06,5.99e+06]
## 7 Alaska 710231 [5.64e+05,6.68e+06] [5.64e+05,1.18e+06]
## 8 Arizona 6392017 [5.64e+05,6.68e+06] (5.99e+06,9.61e+06]
## 9 Arkansas 2915918 [5.64e+05,6.68e+06] (2.76e+06,4.09e+06]
## 10 California 37253956 (3.11e+07,3.73e+07] (9.61e+06,3.73e+07]
## 11 Colorado 5029196 [5.64e+05,6.68e+06] (4.09e+06,5.99e+06]
## population.std population.natural
## 6 [5.64e+05,7.33e+06] (2.06e+06,4.78e+06]
## 7 [5.64e+05,7.33e+06] [5.64e+05,2.06e+06]
## 8 [5.64e+05,7.33e+06] (4.78e+06,8e+06]
## 9 [5.64e+05,7.33e+06] (2.06e+06,4.78e+06]
## 10 (3.44e+07,3.73e+07] (2.51e+07,3.73e+07]
## 11 [5.64e+05,7.33e+06] (4.78e+06,8e+06]
Fortunately the states data frame had a field called region that closely matched the name field of our population data frame - just one of them was all lower case. Thus by adding another column using the right string transformation enabled me to merge (or you might be more familiar with the word join for this kind of operation) the two data frames. This gave me one big data frame with all the information I needed to create the histograms on the left of Lisa’s graphic and also the maps.
population$region = tolower(population$NAME)
choro <- merge(states, population, sort = FALSE, by = "region")
choro <- choro[order(choro$order), ]
Now that the data was available it was pretty easy to get started with the vizualization. ggplot2 is just awesome to get going fast. The next graph shows a plot of the population distribution:
ggplot(population, aes(x=CENSUS2010POP)) + geom_histogram()
It was easy to use one of the interval lists for the QGIS styles as breaks for the histogram with only minor modifications:
ggplot(population, aes(x=CENSUS2010POP)) + geom_histogram(breaks=natural.interval, stat='bin')
And now I had to add the vertical lines indicating the data points:
ggplot(population, aes(x=CENSUS2010POP)) + geom_vline(aes(xintercept=CENSUS2010POP), color='red') + geom_histogram(breaks=natural.interval)
The map was equally easy to generate in it’s basic form:
ggplot(choro) + geom_polygon(aes(x=long, y=lat, group=group, fill=population.natural)) + coord_map()
Now all I had to do was to combine all the bits and pieces into a unit and tweak the look until I was happy. I should have expected it: This was actually the most tedious part. Therefore I’ll guide you through my final solution now instead of taking you on the journey of my struggles.
The next snipped was actually one of the last bit of code that I wrote. It creates a stripped down theme without borders, grid and axes so that I am able to match Lisa’s style.
common.theme = theme(axis.line = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
plot.title=element_blank(),
axis.text.x=element_blank(),
axis.title.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_text(size=10, face='bold'),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
map.theme = common.theme + theme(axis.title.y=element_blank())
The color scheme was also created rather late in the process. Lisa uses a color scale that matches the YlOrRd theme from Cynthia Brewer’s color scales very closely but it’s a little darker. Therefore I simply used a color picker ad copied her colors as hex calues to create my own color scale.
colors <- c("#FDC25A", "#FD7834", "#E9151C", "#891E14", "#45150C", "#000000") # copied from lisa's maps - not exactly but close to YlOrRd
scale_fill_lisa <- scale_fill_manual(values=colors)
qplot(color, fill=color, data=data.frame(color=colors)) + scale_fill_lisa + guides(fill=FALSE) + common.theme + ylab(NULL)
To be able to join the different histograms in a way that the vertical lines look like the cover the whole graphic I had even more theming to do. Depending on where the graph was located in the layout I added a negative margin an the top, bottom or both.
top.graph = theme(plot.margin = unit(c(0,.5,-.5,.5), "line"))
middle.graph = theme(plot.margin = unit(c(-.5,.5,-.5,.5), "line"))
bottom.graph = theme(plot.margin = unit(c(-.5,.5,.5,.5), "line"))
To still get a reasonable amount of whitespace between the histograms I hardcoded the y-scale.
common.scale_y = scale_y_continuous(limits=c(0,60))
I created reusable fragments for the vertical lines indicating the states’ population data points and and the small black line segments at the bottom of each histogram.
common.vline = geom_vline(aes(xintercept=CENSUS2010POP), color=colors[1], size=0.5)
interval_rug <- function(breaks) { return(geom_linerange(aes(x=CENSUS2010POP), ymin=-5, ymax=0, data=data.frame(CENSUS2010POP=breaks))) }
Finally everything was coming together into the actual hitograms. To not repeat myself too much I refactored the code into a function that assembles the histograms from just a few parameters.
hist <- function(interval, position, label) {
return(
ggplot(population, aes(x=CENSUS2010POP)) +
common.vline +
geom_histogram(breaks=interval, fill='black', color='black') +
geom_line(y=0) + # faux x-axis
interval_rug(interval) +
common.scale_y +
position +
common.theme +
ylab(label)
)
}
hist.equal <- hist(equal.interval, top.graph, "Equal Interval")
hist.quantile <- hist(quantile.interval, middle.graph, "Quantile")
hist.std <- hist(std.interval, middle.graph, "Std Deviation")
hist.natural <- hist(natural.interval, bottom.graph, "Natural Breaks")
I created a similar function for the maps. One new thing that I learned here was aes_string – apparently you can’t put a variable into an aes(fill=...) call.
map <- function(cuts) {
return(
ggplot(choro) + geom_polygon(aes_string(y='lat', x='long', group='group', fill=cuts), color="white", size=.2) + scale_fill_lisa + guides(fill=FALSE) + map.theme + coord_map()
)
}
map.equal <- map('population.equal')
map.quantile <- map('population.quantile')
map.std <- map('population.std')
map.natural <- map('population.natural')
By far the most ugly part of the whole endeavour was creating the header annotation in a way that looked reasonably similar to what Lisa had built. This reminded me about the feelings I had when reading Bradley Boehmke’s article and it brought Bret Victor’s Drawing Dynamic Visualizations to my mind. Manipulating symbols is just not a nice way to do (information) design work - especially with a slow feedback loop like this. In the end I got reasonably close the the original.
hist.title <- ggplot(data.frame(minmax=c(min,max)), aes(x=minmax)) +
scale_y_continuous(limits=c(0,1)) +
geom_blank() +
geom_linerange(ymin=0, ymax=1, size=.5) +
geom_line(y=1, size=.5) +
common.theme +
theme(plot.margin = unit(c(.5,.5,0,.5), "line")) + ylab('') +
annotation_custom(grob=textGrob(label=paste(' ',format(min/1000000, digits=2),'m people'), just='left', gp=gpar(fontsize = 10, fontface='bold')), xmin=min, xmax=min) +
annotation_custom(grob=textGrob(label=paste(format(max/1000000, digits=2),'m people '), just='right', gp=gpar(fontsize = 10, fontface='bold')), xmin=max, xmax=max)
Then I had to assemble everything into a single chart. This last snipped was also a piece of constant experimentation. Getting the width and height ratios right wasn’t exactly a fun exercise.
This is my final result.
grid.arrange(hist.title, textGrob(label="Population in\nUS-american states", just='left', x= unit(0, "npc"), gp=gpar(fontsize = 10, fontface='bold')),
hist.equal, map.equal,
hist.quantile, map.quantile,
hist.std, map.std,
hist.natural, map.natural,
ncol=2, widths=c(4,1), heights=c(1.2,3,3,3,3))
I am quite happy with the result. I proved to myself that it is possible to recreate Lisa’s graphic in R. While this was a fun exercise it was also a confirmation for myself (which didn’t come unexpected) that R is not the tool to design graphics. It’s awesome to get results fast. Exploring datasets with ggplot2 is a bliss. Tweaking the visuals however took longer than getting the basic graphics done. Maybe preparing a basic graph in R and exporting to SVG is a viable solution? I haven’t done much research yet.
I hope you enjoyed the article and could learn a thing or two. Again a big shout out to Lisa for creating the original graphic and to Bradley to inspire me to recreate it.
The following sources proved very useful during creating this graphic.