I have been using ggplot happily to produce column charts for our paper (soon coming out, Szkalisity & Vanharanta 2025, please consider citing it if you use the extension described below), when one day my supervisor came and asked: would it be possible to apply shading to the columns based on a variable?
My initial thought was: of course it must be possible, ggplot is amazing! The setup we were using had already had a color and we applied dodging on it, so the only thing we needed was a dodge and stack combination (dodge the colors and stack the shades).
Then I found on stackoverflow a related question with essentially negative answer: https://stackoverflow.com/questions/12715635/ggplot2-bar-plot-with-both-stack-and-dodge and the negative reply: https://stackoverflow.com/questions/12592041/plotting-a-stacked-bar-plot/12592235#12592235
A year later this has been still a freshly opened feature request in the github pages of ggplot: https://github.com/tidyverse/ggplot2/issues/6324
The need for stacking and dodging is not just a request for an exotic use-case, the lack of it is a possibility for error, as outlined below. This section was motivated from the following source: https://rpubs.com/Mentors_Ubiqum/geom_col_1
Take the following dataset which contains the fate of passengers of the famous ocean liner RMS Titanic.
library(tidyverse)
df = as_tibble(Titanic)
df
## # A tibble: 32 × 5
## Class Sex Age Survived n
## <chr> <chr> <chr> <chr> <dbl>
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
## 3 3rd Male Child No 35
## 4 Crew Male Child No 0
## 5 1st Female Child No 0
## 6 2nd Female Child No 0
## 7 3rd Female Child No 17
## 8 Crew Female Child No 0
## 9 1st Male Adult No 118
## 10 2nd Male Adult No 154
## # ℹ 22 more rows
As you can see data frame contains the information stratified by travel class, sex, age-group (adult or child). The variable n says how many people fell into the given category
So let’s do a simple plotting of the number of people surviving and losing their lives:
df %>% ggplot() +
aes(x = Survived, y = n) + geom_col()
Great, these numbers match up nicely with my childhood memories on the number of survivors.
So let’s plot these now stratfieid by travel classses:
df %>% ggplot() +
aes(x = Survived, y = n, fill = Class) + geom_col()
So far nothing strange. But this visualization makes it hard to read e.g. the number of lost passengers on 3rd class. So it’d be natural to try displaying these next to one another. This functionality is called dodging in ggplot2.
df %>% ggplot() +
aes(x = Survived, y = n, fill = Class) + geom_col(position = position_dodge())
Nice and easy, right?
Well, as you probably guessed, no. There is a mistake here, not that easy to notice and one that is closely related with the need of dodging and stacking together. So let’s do a double-check: what should we see in the plot above. The number of dying and survivng passengers. So let’s calculate this with tidyverse:
df %>%
group_by(Class, Survived) %>%
summarize(sum(n)) %>%
arrange(Survived) # just for easier organization
## # A tibble: 8 × 3
## # Groups: Class [4]
## Class Survived `sum(n)`
## <chr> <chr> <dbl>
## 1 1st No 122
## 2 2nd No 167
## 3 3rd No 528
## 4 Crew No 673
## 5 1st Yes 203
## 6 2nd Yes 118
## 7 3rd Yes 178
## 8 Crew Yes 212
So there should be 528 people perishing on 3rd class. But on the plot above there is only below 400. Also there should be 203 people surviving on 1st class and yet the plot shows only a bit above 100.
The reason for this discrepancy is the form of the data, remember that the passengers were stratified for Sex and age-group too. So when we add opacity to the columns displayed above the truth is revealed:
df %>% ggplot() +
aes(x = Survived, y = n, fill = Class) +
geom_col(position = position_dodge(), color = "black", alpha = 0.5)
The individual rows of the data frame were plotted on the top of each other! Of course it’s our bad: we didn’t say how to aggregate the multiple rows in the data that are assigned to the same plotting category, so ggplot just did nothing, referred to as the identity statistic. The issue for possible misinterpretation was raised before here: https://github.com/tidyverse/ggplot2/issues/4766
What we would intuitively want is to dodge based on colours, but a stacking for the rest. There is a workaround for this by explicitly telling ggplot to aggregate the multiple rows with the sum function:
df %>% ggplot() +
aes(x = Survived, y = n, fill = Class) +
stat_summary(fun = sum, # explicitely tells to sum up the different values
geom = "col", # use the same geom as before
position = position_dodge())
But this does not allow then for shading the classes e.g. by their age-group. For that you’d need a dodging and stacking functionality.
Adding this feature is just a couple of extra code lines in the implementation of the position_dodge functionality. The updated ggplot2 package is available here: https://github.com/szkabel/ggplot2
Before it (hopefully) gets integrated into ggplot2 itself, you need to use the devtools package to run the modified version of ggplot2:
library(devtools)
load_all("./ggplot2") # Change here the . to the downloaded directory
then you can use it as follows to produce dodge-stacking directly within geom_col:
df %>% ggplot() +
aes(x = Survived, y = n, fill = Class) +
geom_col(position = position_dodge(stack.overlap = "by_extent"), color = "black", alpha = 0.5)
Now you can do a further stratification to the dodged columns, e.g. to display the age-group distribution within the group.
df %>%
arrange(Age) %>% # Just for the ordering of the stack
ggplot() +
aes(x = Survived, y = n, fill = Class, alpha = Age, group = Class) +
geom_col(position = position_dodge(stack.overlap = "by_extent")) +
scale_alpha_manual(values = c("Adult" = 1.0, "Child" = 0.5))
Even though a bit more complicated and requires calculating the summaries in advance, but you can also add errorbars:
df %>%
group_by(Survived, Class, Age) %>%
summarize(nofP = sum(n),
SE = sd(n)/n()) %>% # Calculate standard error
ggplot() + aes(x = Survived,
y = nofP,
ymin = nofP-SE,
ymax = nofP+SE,
fill = Class, alpha = Age, group = Class) +
geom_col(position = position_dodge(stack.overlap = "by_extent")) +
geom_errorbar(position = position_dodge(stack.overlap = "by_center",width = 0.9), width = 0.2) +
scale_alpha_manual(values = c("Adult" = 1, "Child" = 0.6))
In the hope that it will be useful.