I saw this tweet this morning and was reminded of how cool raincloud plots are. With all the rain we have had in the last little bit, I thought it would be fun to try making a penguin raincloud plot.
That moment when you review a journal submission and you see dynamite plots: rage and joy at the same time! #dataviz #datavisualization #barbarplot #DoBetter pic.twitter.com/ncuxquEoAF
— Cédric Scherer (@CedScherer) March 26, 2021
library(tidyverse)
library(palmerpenguins)
library(ggeasy)
library(harrypotter)
library(gt)
library(cowplot)
source("theme_jen.R") # a script containing my custom ggplot theme
penguins <- penguins
penguin_summary <- penguins %>%
group_by(species) %>%
summarise(mean = mean(body_mass_g, na.rm = TRUE),
sd = sd(body_mass_g, na.rm = TRUE),
n = n(),
stderr = sd/sqrt(n))
gt(penguin_summary)
| species | mean | sd | n | stderr |
|---|---|---|---|---|
| Adelie | 3700.662 | 458.5661 | 152 | 37.19462 |
| Chinstrap | 3733.088 | 384.3351 | 68 | 46.60747 |
| Gentoo | 5076.016 | 504.1162 | 124 | 45.27097 |
penguin_summary %>%
ggplot(aes(x = species, y = mean, fill = species)) +
geom_col() +
scale_fill_hp_d(option = "HermioneGranger") +
geom_errorbar(aes(ymin=mean-stderr, ymax=mean+stderr),
size=.5, # line thickness
width=.2) + # bar width
labs(title = "Mean body mass by penguin species",
x = "Species",
y = "Mean body mass (g)") +
theme_jen() +
scale_y_continuous(expand = c(0, 0)) +
easy_remove_legend()
It is really common in psychology to average the hell out of your data before you plot it. The downside of this practice is that you lose all the interesting information about variability. I think the power of R comes from how easy it is to plot raw data. When you plot scores of individual participants (or even individual participants scores on individual trials) you can see patterns in your data that would have been lost had you just plotted means with standard error bars.
AND it is super helpful for catching data entry errors and/or outliers. Here are a few options for plotting raw data…
geom_point()geom_point() is the simplest way to see each individual data point, but it will plot individual points on top of each other, losing important information.
penguins %>%
ggplot(aes(x = species, y = body_mass_g, colour = species)) +
geom_point() +
scale_colour_hp_d(option = "NewtScamander") +
labs(title = "Mean body mass by penguin species",
x = "Species",
y = "Mean body mass (g)") +
theme_jen() +
scale_y_continuous(expand = c(0, 0), limits = c(2000,6500)) +
easy_remove_legend()
geom_jitter()The geom_jitter() function adds a tiny bit of noise, so that you don’t get overlapping points
penguins %>%
ggplot(aes(x = species, y = body_mass_g, colour = species)) +
geom_jitter(width = 0.3) + # width controls how much noise/spread
scale_colour_hp_d(option = "HermioneGranger") +
labs(title = "Mean body mass by penguin species",
x = "Species",
y = "Mean body mass (g)") +
theme_jen() +
scale_y_continuous(expand = c(0, 0), limits = c(2000,6500)) +
easy_remove_legend()
geom_violin()You can add a geom_violin to your jittered points to see the distribution of the data.
penguins %>%
ggplot(aes(x = species, y = body_mass_g, fill = species)) +
geom_violin(alpha = 0.5) + # alpha controls transparency
geom_jitter(width = 0.3, alpha = 0.5) +
scale_fill_hp_d(option = "HermioneGranger") +
labs(title = "Mean body mass by penguin species",
x = "Species",
y = "Mean body mass (g)") +
theme_jen() +
scale_y_continuous(expand = c(0, 0), limits = c(2000,6500)) +
easy_remove_legend()
Boxplots are also great at visualising variability and outliers. You can layer jittered points on top of a box plot too.
penguins %>%
ggplot(aes(x = species, y = body_mass_g, fill = species)) +
geom_boxplot(alpha = 0.5) + # alpha controls transparency
geom_jitter(width = 0.3, alpha = 0.5) +
scale_fill_hp_d(option = "HermioneGranger") +
labs(title = "Mean body mass by penguin species",
x = "Species",
y = "Mean body mass (g)") +
theme_jen() +
scale_y_continuous(expand = c(0, 0), limits = c(2000,6500)) +
easy_remove_legend()
Raincloud plots are a particularly nice way to represent violins, boxplots AND points without all the overlap. The last time I tried them I was following this post and it involved loading a custom script from someone’s github gist to get the geom_flat_violin() function. Sadly the link in that blogpost no longer works…
But apparently there is a new raincloud plots package!
Lets try it!
raincloudplots packageInstall the package it using the remotes package.
# remotes::install_github('jorvlan/raincloudplots')
library(raincloudplots)
The paper that was published with the original code/tutorial has a V2 update, which I am assuming is the best place to find info about the new package? Seems like the paper now contains tutorials about how to make raincloud via the modifiable scripts AND how to use the new R package (as well as how to make raincloud plots with Python and Matlab- comprehensive).
The descriptions in the paper are mostly about using rainclouds to visualise repeated-measures data, which is not really what I want for the penguins.
Maybe the vignette is better…
https://github.com/jorvlan/raincloudplots
The first thing I always do when trying a new package is to copy the code from the vignette and make sure I can make it run. This example uses the iris data. Apparently the data_1x1() function makes the data into the format you need.
Here I am using sepal.length and width (rather than two different segments of sepal length data) to test whether it will work with different columns.
sepalLW <- data_1x1(
array_1 = iris$Sepal.Length[1:50],
array_2 = iris$Sepal.Width[1:50],
jit_distance = .09,
jit_seed = 321)
And then the raincloud_1x1() function makes the plot.
raincloud_sepalLW <- raincloud_1x1(
data = sepalLW,
colors = (c('dodgerblue','darkorange')),
fills = (c('dodgerblue','darkorange')),
size = 1,
alpha = .6,
ort = 'h') +
scale_x_continuous(breaks=c(1,2), labels=c("width", "length"), limits=c(0, 3)) +
xlab("sepal part") +
ylab("measurement") +
theme_classic()
raincloud_sepalLW
The arrays seem to need indexed values (they probably don’t but that it how it is presented in the vignette, so lets try that to begin with). First, sort the penguins by species so can refer to them using indexing. Then use the data_1x1 function to set up the data…
sorted_penguins <- penguins %>%
arrange(species)
species_bodymass <- data_1x1(
array_1 = sorted_penguins$body_mass_g[153:220], # chinstrap
array_2 = sorted_penguins$body_mass_g[221:344], # gentoo
jit_distance = .09,
jit_seed = 321)
head(species_bodymass)
## y_axis x_axis id jit
## 1 3500 1 1 1.0820609
## 2 3900 1 2 1.0787114
## 3 3650 1 3 0.9528797
## 4 3525 1 4 0.9559133
## 5 3725 1 5 0.9802922
## 6 3950 1 6 0.9714124
Make the plot …
raincloud_penguins_H <- raincloud_1x1(
data = species_bodymass,
colors = (c('dodgerblue','darkorange')),
fills = (c('dodgerblue','darkorange')),
size = 1,
alpha = .6,
ort = 'h') +
scale_x_continuous(breaks=c(1,2), labels=c("chinstrap", "gentoo"), limits=c(0, 3)) +
xlab("Species") +
ylab("Body Mass (g)") +
theme_classic()
raincloud_penguins_H
In the example, ort = was set to ‘h’, but you can turn the plot vertical with ort = ‘v’
raincloud_penguins_V <- raincloud_1x1(
data = species_bodymass,
colors = (c('dodgerblue','darkorange')),
fills = (c('dodgerblue','darkorange')),
size = 1,
alpha = .6,
ort = 'v') +
scale_x_continuous(breaks=c(1,2), labels=c("chinstrap", "gentoo"), limits=c(0, 3)) +
xlab("Species") +
ylab("Body Mass (g)") +
theme_classic()
raincloud_penguins_V
The package will connect points for repeated measures designs. This is less relevant in the penguin data, but I guess we could try it out by plotting bill length and depth for each penguin?
Copy the code from the vignette and sub in penguin variables.
species_bills <- data_2x2(
array_1 = sorted_penguins$bill_length_mm[153:220],
array_2 = sorted_penguins$bill_depth_mm[153:220],
array_3 = sorted_penguins$bill_length_mm[221:344],
array_4 = sorted_penguins$bill_depth_mm[221:344],
labels = (c('chinstrap','gentoo')),
jit_distance = .09,
jit_seed = 321,
spread_x_ticks = TRUE)
Plot 2x2_repmes with connected points.
speciesbills <- raincloud_2x2_repmes(
data = species_bills,
colors = (c('dodgerblue', 'darkorange', 'dodgerblue', 'darkorange')),
fills = (c('dodgerblue', 'darkorange', 'dodgerblue', 'darkorange')),
size = 1,
alpha = .6,
spread_x_ticks = FALSE) +
scale_x_continuous(breaks=c(1,2), labels=c("length", "width"), limits=c(0, 3)) +
xlab("Bill part") +
ylab("Measurement") +
theme_classic()
speciesbills
agghh… it seems to have lost the species info…bummer. I don’t have the energy to work out what is going on at the moment.
I think what I don’t like about this package is that there is some magic happening in the formatting of the data which isn’t very transparent. When I look at the species_bills dataframe, info about species is still there (as a group variable) but I don’t know how to get it into the plot.
head(species_bills)
## y_axis x_axis id group jit
## 1 46.5 1 1 chinstrap 1.0820609
## 2 50.0 1 2 chinstrap 1.0787114
## 3 51.3 1 3 chinstrap 0.9528797
## 4 45.4 1 4 chinstrap 0.9559133
## 5 52.7 1 5 chinstrap 0.9802922
## 6 45.2 1 6 chinstrap 0.9714124
I seem to remember that the old script version constructed the plot with layers within ggplot, which I understood how to customise much better. Lets try that instead.
The link to the github script in the blog post doesn’t work anymore. Maybe I can download the script and just source my own copy of it?
I copied the code from here into an R script and saved it in my project folder. Then I can run the script using source().
source("R_rainclouds.R")
The R_rainclouds.R script gives you access to geom_flat_violin(). Make a plot that layers geom_flat_violin + geom_point + geom_boxplot.
penguins %>% ggplot(aes(x=species,y=body_mass_g, fill = species)) +
geom_flat_violin(position = position_nudge(x = .2, y = 0),adjust =2, alpha = 0.5) +
geom_point(position = position_jitter(width = .15), size = .8) +
geom_boxplot(aes(x = species, y = body_mass_g, fill = species),outlier.shape = NA, alpha = .5, width = .1, colour = "black")+
theme_jen() +
labs(title = "Raincloud plot of body mass by species", x = 'Species', y = 'Body mass (g)') +
easy_remove_legend()
BEAUTIFUL!
I was expecting the raincloudplot package to make constructing raincloud plots EASIER… not more difficult. I was also expecting to like the aesthesics of the package version better. It is probably just that I am more familiar with the ggplot idea of adding layers to a plot and customising the look. I might get used to the code style of the raincloudplot package if i used it a lot, but I like that with ggplot there is nothing hidden under the hood. You add each layer, run the code and see the plot build. Makes troubleshooting much easier. The package seems to do a lot of set up at the level of the data, which isn’t so user friendly.
if you want to try raincloud plots, just copy the R_raincloud.R script from here and save it in your project. Run it using
source().
NOTE I played with the r setup chunk in this post and worked out how to make the warnings and messages not show for all chunks! woot!
The default RMarkdown set up chunk looks like this …
The default set up chunk is set to not display in your knitted document (include = FALSE) and the only setting that is specified is echo = TRUE- this makes both the code and the output display in your knitted document.
I found this post by Karl Broman that covers Global chunk options useful and now my new one for this post looks like this.
Now my setup chunk is set to control the width and height of the figures making them all consistently 8 x 6, the fig.path argument will auto-magically save each of the figures as a png in a folder called Figs each time you knit the document.
You can also get those png files named useful things by changing the label on on the chunk that generates the plot. After the {r} just put a short label and it will call the plot that when it saves.
echo is still TRUE, which means both my code and output will display but now warning and messages are FALSE, so R is no longer going to display the warnings about packages or missing values from my plots in my knitted document. Sweet!
You will notice that I’ve added theme_jen() to my plots. You can write your own custom theme by writing a function like mine below. Mine is pretty similar theme_classic() with bigger font and more white space. I keep it in a script file that I can source at the top of my file.
theme_jen <- function () {
# define font up front
font <- "Helvetica"
# this theme uses theme_bw as the base
theme_bw() %+replace%
theme(
#get rid of grid lines/borders
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
# add white space top, right, bottom, left
plot.margin = unit(c(1, 1, 1, 1), "cm"),
# custom axis title/text/lines
axis.title = element_text(
family = font,
size = 14),
axis.text = element_text(
family = font,
size = 12),
# margin pulls text away from axis
axis.text.x = element_text(
margin=margin(5, b = 10)),
# black lines
axis.line = element_line(colour = "black", size = rel(1)),
# custom plot titles, subtitles, captions
plot.title = element_text(
family = font,
size = 18,
hjust = -0.1,
vjust = 4),
# custom plot subtitles
plot.subtitle = element_text(
family = font,
size = 14,
hjust = 0,
vjust = 3),
# custom captions
plot.caption = element_text(
family = font,
size = 10,
hjust = 1,
vjust = 2),
# custom legend
legend.title = element_text(
family = font,
size = 10,
hjust = 0),
legend.text = element_text(
family = font,
size = 8,
hjust = 0),
#no background on legend
legend.key = element_blank(),
# white background on plot
strip.background = element_rect(fill = "white",
colour = "black",
size = rel(2)), complete = TRUE)
}