In his paper A Layered Grammar of Graphics, Hadley Wickham re-builded Minard’s inforgraphic of Napoleon’s march on Russia to illustrate the layers mechanism in ggplot2. Since I’m the learning-by-doing type of person, in this post, I will reproduce Hadley’s example from start to finish.

Prepare data with dplyr package

The dataset is made available at this link.

minard <- read.csv("minard.csv", fileEncoding="UTF-8-BOM")

Let’s take a look at the all the original dataset. First, some column names are not good variable names, for example surviv = number of survivors, direc = direction, lonc, lont and lonp are different names for longitude. Second, the dataset is actually 3 tables merged together, the first 3 columns make up one table, the last 5 colums form another table. I’ll use those 2 tables to reproduce Minard’s graph.

minard

There is no need to reshape or tidy the dataset in this situation so the dplyr package is all we need to segment the original dataset.

library(dplyr)
#select relevant columns, rename columns's names and remove NA values
troops <- select(minard, long = lonp, lat = latp, survivors = surviv, direction = direc, division)
cities <- na.omit(select(minard, long = lonc, lat = latc, city))
#display tables
troops
cities

Plotting using ggplot2 package

ggplot2‘s layers mechanisim enables its users to “divide and conquer” a wide range of graphics. As a user, you simply layer your way to the final graphic. Subsequent layers inherent previous layers’ settings and can override those settings if needed.

First, we use ggplot() to create the “base layer”, the troops dataset will be passed to subsequent layers, similarly, long variable will be mapped to x-axis and lat varible will be mapped to y-axis in latter layers unless overrided.

library(ggplot2)
layer1 <- ggplot(troops, aes(long, lat))
layer1

This is the default code of geom_path() which makes up the second layer:

geom_path(mapping = NULL, data = NULL, stat = "identity", position = "identity", ..., lineend = "butt", linejoin = "round", linemitre = 1, arrow = NULL, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
layer2 <- layer1 + geom_path(aes(size = survivors, color = direction, group = division), lineend = "round", linejoin = "round")
layer2

Now, to the third layer, the troops dataset specified in the call to ggplot() has now been overried with the cities dataset.

layer3 <- layer2 + geom_text(aes(label = city), size = 3, data = cities)
layer3

Did you notice that I didn’t name the next variable layer4 but named it finalGraph instead? I did that wasn’t because finalGraph is a better name, but because scale_size(), scale_color_manual(), xlab(), ylab(), ggtitle() add no new layer to the graph, they are SCALES functions, they control the mapping between data and aesthetics, in other words, they modify existing layers. But which layer do they modify? In this situation, they modified layer2 (i.e., the geom_path() layer).

The function scale_size() is very useful in this case to appropeiately display numeric values in the Survivors legend. The labels aesthetic can take function that takes the breaks value as input and returns labels as output, for example, here the comma() function takes the breaks value as input and changes a label from 1e+05 to 100,000.

You might be tempted to think that comma(breaks) also works but it doesn’t (I tried). Don’t let the = fools you, unlike v, breaks is the name of an aesthetic, not a variable in this case. Thus, you have to pass in the same vector assigned to breaks into comma(), here the vector is c(1, 2, 3) * 10^5. Now I hope you realize why, in R, = and <- should not be used interchangeably.

v <- c(1, 2, 3) * 10^5
finalGraph <- layer3 + scale_size("Survivors", range = c(1, 10), breaks = v, labels = comma(v)) +
  scale_color_manual("Direction", values = c("grey50", "red")) +
  xlab("Longitude") + ylab("Latitude") + ggtitle("Napoleon's march to Russia")
finalGraph

That’s it, thanks for reading!

