I’ve been having a lot of fun recently exploring the power of ggplot2 to make custom charts. One type of chart that is very popular in the data visualization community right now is Edward Tufte’s slopegraph, which generally connects data at two points in time with minimal non-data elements. I’ve been using slopegraphs in a few projects recently, so I thought I’d share the ggplot2 code.

I’ll be preparing a chart that shows projected population change to the end of the century by major world region from the United Nations’ World Population Prospects, The 2012 Revision. I employ the regional definitions from the UN.

First, let’s load the required libraries and fetch the data from the UN website. It is a large CSV file, so it could take a couple minutes depending on your internet speed, even with read_csv from the readr package. If you are reproducing, you may want to download the data yourself and read it in locally, which is much faster.

library(readr)
library(dplyr)
library(ggplot2)
library(magrittr)

dat <- read_csv("http://esa.un.org/wpp/ASCII-Data/ASCII_FILES/WPP2012_DB02_POPULATIONS_ANNUAL.csv")

I’ll use dplyr to clean up the data and get it ready for visualization. As a slopegraph only requires two data points per region, I only need a beginning (2015) and an endpoint (2099). I’ll be using the UN’s medium projection variant. I’ll also create a couple label columns, which are formatted so that they’ll appear nicely on the chart. For populations above 1 billion, I’ll show the values in billions; below 1 billion, I’ll show the population values in millions.

# These are the region names in the dataset
regions <- c("Africa", "Latin America and the Caribbean", "Northern America", "Europe", "Oceania", "Asia")

# Filter down to 2015 and 2099 and the medium projection variant, and calculate millions & billions 
dat2 <- dat %>%
  filter(VarID == 2, Location %in% regions, Time %in% c(2015, 2099)) %>%
  mutate(millions = round((PopTotal / 1000), 0), billions = round((PopTotal / 1000000), 1)) %>%
  select(Location, Time, millions, billions)

# Shorten a couple region names which I'll be using for labels
dat2$Location <- gsub("Latin America and the Caribbean", "Latin America", dat2$Location)
dat2$Location <- gsub("Northern America", "North America", dat2$Location)

# Prepare some label columns for the slopegraph
dat2 %<>%
  mutate(label15 = ifelse(billions > 1, 
                          paste0(Location, " ", as.character(billions), "b  "),
                          paste0(Location, " ", as.character(millions), "m  ")), 
         label99 = ifelse(billions > 1, 
                          paste0("  ", as.character(billions), "b"), 
                          paste0("  ", as.character(millions), "m")))

Now, I’ll build the slopegraph. My design includes dots at the endpoints of the lines, which was inspired by a recent implementation by FiveThirtyEight, though I’ve seen similar charts by Pew and The Economist in recent weeks as well. Notice that I include multiple calls to geom_text, which allows me finer-grained control over the placement of the data labels so that they are not too close to one another. I also plot the data on a logarithmic scale so variation for all regions is visible, and I remove almost all non-data elements in the theme call. The result appears below the code.

locs_to_adjust1 <- c("Africa", "Europe", "North America")

locs_to_adjust2 <- c("Asia", "Latin America")

ggplot(dat2) + 
  geom_line(aes(x = as.factor(Time), y = millions, group = Location, color = Location), size = 2) + 
  geom_point(aes(x = as.factor(Time), y = millions, color = Location), size = 5) + 
  theme_minimal(base_size = 18) + 
  scale_color_brewer(palette = "Dark2") + 
  xlab("") + 
  geom_text(data = subset(dat2, Time == 2015 & Location != "Latin America"), 
            aes(x = as.factor(Time), y = millions, color = Location, label = label15), 
            size = 6, hjust = 1) + 
  geom_text(data = subset(dat2, Time == 2015 & Location == "Latin America"), 
            aes(x = as.factor(Time), y = millions, color = Location, label = label15), 
            size = 6, hjust = 1, vjust = 0.8) + 
  geom_text(data = subset(dat2, Time == 2099 & Location %in% locs_to_adjust1), 
            aes(x = as.factor(Time), y = millions, color = Location, label = label99), 
            size = 6, hjust = 0, vjust = 0.8) + 
  geom_text(data = subset(dat2, Time == 2099 & Location %in% locs_to_adjust2), 
            aes(x = as.factor(Time), y = millions, color = Location, label = label99), 
            size = 6, hjust = 0, vjust = 0.2) + 
  geom_text(data = subset(dat2, Time == 2099 & Location == "Oceania"), 
            aes(x = as.factor(Time), y = millions, color = Location, label = label99), 
            size = 6, hjust = 0) + 
  scale_y_log10() + 
  theme(legend.position = "none", 
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_blank(), 
        axis.ticks.y = element_blank(),
        axis.ticks.x = element_blank(), 
        axis.title.y = element_blank(), 
        axis.text.y = element_blank(), 
        plot.title = element_text(size = 18)) + 
  ggtitle("Projected population change by region, 2015-2099")

I’m happy with the result; the slopegraph shows the two major trends that I’m interested in, namely Africa’s projected dramatic population increase, and Europe’s projected loss of 100 million people by the end of the century. Some information is lost in this particular implementation, however; while the chart implies slow population growth in Asia, Asia’s population is projected to peak in 2053 and then decline to the end of the century. As such, an alternate implementation could include a third datapoint at mid-century to account for this.

Certainly, this is not the only way to do slopegraphs in ggplot2, and I’ve borrowed from other tutorials on the web to get this done. Credit is due to the following authors, who I encourage you to go check out:

If you have any questions or suggestions, please let me know!