Stacking and dodging in ggplot

Motivation

I have been using ggplot happily to produce column charts for our paper (soon coming out, Szkalisity & Vanharanta 2025, please consider citing it if you use the extension described below), when one day my supervisor came and asked: would it be possible to apply shading to the columns based on a variable?

My initial thought was: of course it must be possible, ggplot is amazing! The setup we were using had already had a color and we applied dodging on it, so the only thing we needed was a dodge and stack combination (dodge the colors and stack the shades).

Then I found on stackoverflow a related question with essentially negative answer: https://stackoverflow.com/questions/12715635/ggplot2-bar-plot-with-both-stack-and-dodge and the negative reply: https://stackoverflow.com/questions/12592041/plotting-a-stacked-bar-plot/12592235#12592235

A year later this has been still a freshly opened feature request in the github pages of ggplot: https://github.com/tidyverse/ggplot2/issues/6324

An associated issue

The need for stacking and dodging is not just a request for an exotic use-case, the lack of it is a possibility for error, as outlined below. This section was motivated from the following source: https://rpubs.com/Mentors_Ubiqum/geom_col_1

Take the following dataset which contains the fate of passengers of the famous ocean liner RMS Titanic.

library(tidyverse)

df = as_tibble(Titanic)
df

## # A tibble: 32 × 5
##    Class Sex    Age   Survived     n
##    <chr> <chr>  <chr> <chr>    <dbl>
##  1 1st   Male   Child No           0
##  2 2nd   Male   Child No           0
##  3 3rd   Male   Child No          35
##  4 Crew  Male   Child No           0
##  5 1st   Female Child No           0
##  6 2nd   Female Child No           0
##  7 3rd   Female Child No          17
##  8 Crew  Female Child No           0
##  9 1st   Male   Adult No         118
## 10 2nd   Male   Adult No         154
## # ℹ 22 more rows

As you can see data frame contains the information stratified by travel class, sex, age-group (adult or child). The variable n says how many people fell into the given category

So let’s do a simple plotting of the number of people surviving and losing their lives:

df %>% ggplot() +
  aes(x = Survived, y = n) + geom_col()

Great, these numbers match up nicely with my childhood memories on the number of survivors.

So let’s plot these now stratfieid by travel classses:

df %>% ggplot() +
  aes(x = Survived, y = n, fill = Class) + geom_col()

So far nothing strange. But this visualization makes it hard to read e.g. the number of lost passengers on 3rd class. So it’d be natural to try displaying these next to one another. This functionality is called dodging in ggplot2.

df %>% ggplot() +
  aes(x = Survived, y = n, fill = Class) + geom_col(position = position_dodge())

Nice and easy, right?

Well, as you probably guessed, no. There is a mistake here, not that easy to notice and one that is closely related with the need of dodging and stacking together. So let’s do a double-check: what should we see in the plot above. The number of dying and survivng passengers. So let’s calculate this with tidyverse:

df %>% 
  group_by(Class, Survived) %>% 
  summarize(sum(n)) %>% 
  arrange(Survived) # just for easier organization

## # A tibble: 8 × 3
## # Groups:   Class [4]
##   Class Survived `sum(n)`
##   <chr> <chr>       <dbl>
## 1 1st   No            122
## 2 2nd   No            167
## 3 3rd   No            528
## 4 Crew  No            673
## 5 1st   Yes           203
## 6 2nd   Yes           118
## 7 3rd   Yes           178
## 8 Crew  Yes           212

So there should be 528 people perishing on 3rd class. But on the plot above there is only below 400. Also there should be 203 people surviving on 1st class and yet the plot shows only a bit above 100.

The reason for this discrepancy is the form of the data, remember that the passengers were stratified for Sex and age-group too. So when we add opacity to the columns displayed above the truth is revealed:

df %>% ggplot() +
  aes(x = Survived, y = n, fill = Class) + 
  geom_col(position = position_dodge(), color = "black", alpha = 0.5)

The individual rows of the data frame were plotted on the top of each other! Of course it’s our bad: we didn’t say how to aggregate the multiple rows in the data that are assigned to the same plotting category, so ggplot just did nothing, referred to as the identity statistic. The issue for possible misinterpretation was raised before here: https://github.com/tidyverse/ggplot2/issues/4766

What we would intuitively want is to dodge based on colours, but a stacking for the rest. There is a workaround for this by explicitly telling ggplot to aggregate the multiple rows with the sum function:

df %>% ggplot() +
  aes(x = Survived, y = n, fill = Class) + 
  stat_summary(fun = sum, # explicitely tells to sum up the different values
               geom = "col", # use the same geom as before
               position = position_dodge())

But this does not allow then for shading the classes e.g. by their age-group. For that you’d need a dodging and stacking functionality.

Dodge-stack functionality

Adding this feature is just a couple of extra code lines in the implementation of the position_dodge functionality. The updated ggplot2 package is available here: https://github.com/szkabel/ggplot2

Before it (hopefully) gets integrated into ggplot2 itself, you need to use the devtools package to run the modified version of ggplot2:

library(devtools)
load_all("./ggplot2") # Change here the . to the downloaded directory

then you can use it as follows to produce dodge-stacking directly within geom_col:

df %>% ggplot() +
  aes(x = Survived, y = n, fill = Class) + 
  geom_col(position = position_dodge(stack.overlap = "by_extent"), color = "black", alpha = 0.5)

Now you can do a further stratification to the dodged columns, e.g. to display the age-group distribution within the group.

df %>% 
  arrange(Age) %>% # Just for the ordering of the stack
  ggplot() +
  aes(x = Survived, y = n, fill = Class, alpha = Age, group = Class) +
  geom_col(position = position_dodge(stack.overlap = "by_extent")) +
  scale_alpha_manual(values = c("Adult" = 1.0, "Child" = 0.5))

Even though a bit more complicated and requires calculating the summaries in advance, but you can also add errorbars:

df %>% 
  group_by(Survived, Class, Age) %>%
  summarize(nofP = sum(n), 
            SE = sd(n)/n()) %>% # Calculate standard error
  ggplot() + aes(x = Survived,
                 y = nofP,
                 ymin = nofP-SE,
                 ymax = nofP+SE, 
                 fill = Class, alpha = Age, group = Class) + 
  geom_col(position = position_dodge(stack.overlap = "by_extent")) +
  geom_errorbar(position = position_dodge(stack.overlap = "by_center",width = 0.9), width = 0.2) +
  scale_alpha_manual(values = c("Adult" = 1, "Child" = 0.6))

In the hope that it will be useful.

Stacking and dodging in ggplot

Abel Szkalisity

2025-02-11

Motivation

An associated issue

Dodge-stack functionality