In his paper A Layered Grammar of Graphics, Hadley Wickham re-builded Minard’s inforgraphic of Napoleon’s march on Russia to illustrate the layers mechanism in ggplot2. Since I’m the learning-by-doing type of person, in this post, I will reproduce Hadley’s example from start to finish.
Prepare data with dplyr
package
The dataset is made available at this link.
minard <- read.csv("minard.csv", fileEncoding="UTF-8-BOM")
Let’s take a look at the all the original dataset. First, some column names are not good variable names, for example surviv
= number of survivors, direc
= direction, lonc
, lont
and lonp
are different names for longitude. Second, the dataset is actually 3 tables merged together, the first 3 columns make up one table, the last 5 colums form another table. I’ll use those 2 tables to reproduce Minard’s graph.
minard
There is no need to reshape or tidy the dataset in this situation so the dplyr
package is all we need to segment the original dataset.
library(dplyr)
#select relevant columns, rename columns's names and remove NA values
troops <- select(minard, long = lonp, lat = latp, survivors = surviv, direction = direc, division)
cities <- na.omit(select(minard, long = lonc, lat = latc, city))
#display tables
troops
cities
Plotting using ggplot2
package
ggplot2
‘s layers mechanisim enables its users to “divide and conquer” a wide range of graphics. As a user, you simply layer your way to the final graphic. Subsequent layers inherent previous layers’ settings and can override those settings if needed.
First, we use ggplot()
to create the “base layer”, the troops
dataset will be passed to subsequent layers, similarly, long
variable will be mapped to x-axis and lat
varible will be mapped to y-axis in latter layers unless overrided.
library(ggplot2)
layer1 <- ggplot(troops, aes(long, lat))
layer1

This is the default code of geom_path()
which makes up the second layer:
geom_path(mapping = NULL, data = NULL, stat = "identity", position = "identity", ..., lineend = "butt", linejoin = "round", linemitre = 1, arrow = NULL, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
- Let’s see how some important settings apply to our case:
- data = NULL:
layer2
inherits the dataset (troops
) specified in the call to ggplot() in layer1
- inherit.aes = TRUE: inherits the aesthetics in ggplot(), you don’t have to change this to
FALSE
to override any aesthetic, FALSE
means ignore all aesthetics in ggplot(), use it if you want to start a new layer from from scratch or there are too many aesthetics to override, TRUE
means keep aesthetics in ggplot(), combined them with new aesthetics or override them. In our case, what are inheritted from ggplot() are: map long
variable to x-axis and lat
variable to y-axis.
- show.legend = NA: includes legend if any aesthetics are mapped, here, path sizes are mapped to number of survivors, color to direction, so there will be two legend corressponding to those two varibles.
- na.rm = FALSE: removes missing values with a warning, since I deleted mising values at data preparation step, there won’t be any warning showing up, if set to
TRUE
, missing values will be removed without any warning. Unless I know the dataset extremely well, I will never set na.rm = TRUE, silent errors are the among the worst type of errors.
layer2 <- layer1 + geom_path(aes(size = survivors, color = direction, group = division), lineend = "round", linejoin = "round")
layer2

Now, to the third layer, the troops
dataset specified in the call to ggplot() has now been overried with the cities
dataset.
layer3 <- layer2 + geom_text(aes(label = city), size = 3, data = cities)
layer3

Did you notice that I didn’t name the next variable layer4
but named it finalGraph
instead? I did that wasn’t because finalGraph
is a better name, but because scale_size(), scale_color_manual(), xlab(), ylab(), ggtitle() add no new layer to the graph, they are SCALES functions, they control the mapping between data and aesthetics, in other words, they modify existing layers. But which layer do they modify? In this situation, they modified layer2
(i.e., the geom_path()
layer).
The function scale_size()
is very useful in this case to appropeiately display numeric values in the Survivors
legend. The labels
aesthetic can take function that takes the breaks
value as input and returns labels as output, for example, here the comma()
function takes the breaks
value as input and changes a label from 1e+05
to 100,000
.
You might be tempted to think that comma(breaks)
also works but it doesn’t (I tried). Don’t let the =
fools you, unlike v
, breaks
is the name of an aesthetic, not a variable in this case. Thus, you have to pass in the same vector assigned to breaks
into comma()
, here the vector is c(1, 2, 3) * 10^5
. Now I hope you realize why, in R, =
and <-
should not be used interchangeably.
v <- c(1, 2, 3) * 10^5
finalGraph <- layer3 + scale_size("Survivors", range = c(1, 10), breaks = v, labels = comma(v)) +
scale_color_manual("Direction", values = c("grey50", "red")) +
xlab("Longitude") + ylab("Latitude") + ggtitle("Napoleon's march to Russia")
finalGraph

That’s it, thanks for reading!
---
title: "Re-build Minard's Napoleon's march on Russia with dplyr & ggplot2"
output: html_notebook
---

In his paper [*A Layered Grammar of Graphics*](http://vita.had.co.nz/papers/layered-grammar.pdf), Hadley Wickham re-builded Minard's inforgraphic of [Napoleon's march on Russia](https://commons.wikimedia.org/wiki/File:Minard.png#/media/File:Minard.png) to illustrate the **layers** mechanism in ggplot2. Since I'm the learning-by-doing type of person, in this post, I will reproduce Hadley's example from start to finish.       

#Prepare data with `dplyr` package 
The dataset is made available at this [link](https://www.cs.uic.edu/~wilkinson/TheGrammarOfGraphics/minard.txt).  

```{r}
minard <- read.csv("minard.csv", fileEncoding="UTF-8-BOM")
```

Let's take a look at the all the original dataset. First, some column names are not good variable names, for example `surviv` = number of survivors, `direc` = direction, `lonc`, `lont` and `lonp` are different names for longitude. Second, the dataset is actually 3 tables merged together, the first 3 columns make up one table, the last 5 colums form another table. I'll use those 2 tables to reproduce Minard's graph.   
```{r}
minard
```

There is no need to reshape or tidy the dataset in this situation so the `dplyr` package is all we need to segment the original dataset.    
```{r}
library(dplyr)
#select relevant columns, rename columns's names and remove NA values
troops <- select(minard, long = lonp, lat = latp, survivors = surviv, direction = direc, division)
cities <- na.omit(select(minard, long = lonc, lat = latc, city))
#display tables
troops
cities
```

#Plotting using `ggplot2` package

`ggplot2`'s **layers mechanisim** enables its users to "divide and conquer" a wide range of graphics. As a user, you simply *layer* your way to the final graphic. Subsequent layers inherent previous layers' settings and can override those settings if needed.  

First, we use `ggplot()` to create the **"base layer"**, the `troops` dataset will be passed to subsequent layers, similarly, `long` variable will be mapped to x-axis and `lat` varible will be mapped to y-axis in latter layers unless overrided.  

```{r}
library(ggplot2)
layer1 <- ggplot(troops, aes(long, lat))
layer1
```

This is the [default](http://docs.ggplot2.org/current/geom_path.html) code of `geom_path()` which makes up the second layer:  
```{r}
geom_path(mapping = NULL, data = NULL, stat = "identity", position = "identity", ..., lineend = "butt", linejoin = "round", linemitre = 1, arrow = NULL, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
```

* Let's see how some important settings apply to our case: 
    + **data = NULL**: `layer2` inherits the dataset (`troops`) specified in the call to ggplot() in `layer1`
    + **inherit.aes = TRUE**: inherits the aesthetics in ggplot(), you don't have to change this to `FALSE` to override any aesthetic, `FALSE` means ignore all aesthetics in ggplot(), use it if you want to start a new layer from from scratch or there are too many aesthetics to override, `TRUE` means keep aesthetics in ggplot(), combined them with new aesthetics or override them. In our case, what are inheritted from ggplot() are: map `long` variable to x-axis and `lat` variable to y-axis.
    + **show.legend = NA**: includes legend if any aesthetics are mapped, here, path sizes are mapped to number of survivors, color to direction, so there will be two legend corressponding to those two varibles.
    + **na.rm = FALSE**: removes missing values with a warning, since I deleted mising values at data preparation step, there won't be any warning showing up, if set to `TRUE`, missing values will be removed without any warning. Unless I know the dataset extremely well, I will never set na.rm = TRUE, silent errors are the among the worst type of errors.

```{r}
layer2 <- layer1 + geom_path(aes(size = survivors, color = direction, group = division), lineend = "round", linejoin = "round")
layer2
```

Now, to the third layer, the `troops` dataset specified in the call to ggplot() has now been overried with the `cities` dataset.   
```{r}
layer3 <- layer2 + geom_text(aes(label = city), size = 3, data = cities)
layer3
```
Did you notice that I didn't name the next variable `layer4` but named it `finalGraph` instead? I did that wasn't because `finalGraph` is a better name, but because [scale_size()](http://docs.ggplot2.org/current/scale_size.html), [scale_color_manual()](http://docs.ggplot2.org/current/scale_manual.html), [xlab(), ylab(), ggtitle()](http://docs.ggplot2.org/current/labs.html) add no new layer to the graph, they are SCALES functions, they control the mapping between data and aesthetics, in other words, they modify existing layers. But which layer do they modify? In this situation, they modified `layer2` (i.e., the `geom_path()` layer).   

The function `scale_size()` is very useful in this case to appropeiately display numeric values in the `Survivors` legend. The `labels` aesthetic can take function that takes the `breaks` *value* as input and returns labels as output, for example, here the `comma()` function takes the `breaks` *value* as input and changes a label from `1e+05` to `100,000`. 

You might be tempted to think that `comma(breaks)` also works but it doesn't (I tried). Don't let the `=` fools you, unlike `v`, `breaks` is the name of an aesthetic, not a variable in this case. Thus, you have to pass in the same vector assigned to `breaks` into `comma()`, here the vector is `c(1, 2, 3) * 10^5`. Now I hope you realize why, in R, `=` and `<-` should **not** be used interchangeably.   

```{r}
v <- c(1, 2, 3) * 10^5
finalGraph <- layer3 + scale_size("Survivors", range = c(1, 10), breaks = v, labels = comma(v)) +
  scale_color_manual("Direction", values = c("grey50", "red")) +
  xlab("Longitude") + ylab("Latitude") + ggtitle("Napoleon's march to Russia")
finalGraph
```

That's it, thanks for reading!
