Khatri Lab Design Principles

Author

Clove Taylor

This document is designed to be a reference for common language in figure design in R, mostly using ggplot and related packages.

Getting started

All of the examples shown use R 4.2.2 and rely heavily on ggplot and ggpubr.

To make everything easier, this code snippet will automatically apply basic themeing that improves the look of every plot. All of its options can be overridden using theme(...)—but make sure not to use any theme_* presets if you want it to be applied:

library(ggplot2)
library(ggpubr)

set_theme_defaults <- function() {
  theme_set(theme_pubr())
  theme_update(
    axis.title.x = element_text(margin = margin(t = 8)),
    axis.title.y = element_text(margin = margin(r = 8))
  )
}

After running this function, it’ll override the default theme for your session. Now you’re ready to go!

🎨 Colors

This section gives suggestions for how to use color effectively, as well as quick resources for color scales and palettes to use in your own figures.

Discrete scales

Discrete colors should be used for factors that are not ordered. Ordered factors that are not continuous have their own section (see below).

Good rules of thumb

Keep color meanings consistent across panels/figures
- Once a color is defined for a particular group, keep that color the same each time the group is shown.
- Don’t use the same color for a different group or meaning in the same figure.
Use color to guide the reader to the takeaway
- Emphasize treatment groups/new data with color.
- Use desaturated colors for control groups/null comparisons.
Take advantage of color psychology
- If a color tends to evoke a meaning for the reader (ex. blue=cold=low), then make use of that psychology rather than reversing it.
- Keep the implications of color consistent; once a color is used to represent something “bad”, that color and similar colors should be “bad” everywhere.
Follow normal practice with color blindness, etc.

BioRender

ggplot palette developed from BioRender default colors:

Nature

ggplot palette developed from Nature Reviews suggested colors:

Stanford colors:

Code

display_all(list(
  "Stanford" = c("#8C1515", "#175E54", "#279989", "#8F993E", "#6FA287"),
  "Stanford" =  c("#4298B5", "#007C92", "#E98300", "#E04F39", "#FEDD5C"),
  "Stanford" =   c("#620059", "#651C32", "#5D4B3C", "#7F7776", "#DAD7CB")
))

To quickly get a color in between two colors:

Code

blender <- circlize::colorRamp2(breaks=c(0, 1), colors=c("red", "blue"))
blender(0.5)

[1] "#CA0088FF"

Continuous scales

The scico package has a number of useful (especially diverging) scales, as well as some awful ones:

Code

library(scico)

scico::scico_palette_show()

Another good option is viridis, especially for binned data:

Code

library(viridis)

Loading required package: viridisLite

Code

plot_df = data.frame(frac = rbinom(100, 12, 0.3))

plot_viridis <- function(pal) {
  ggplot(plot_df,
       aes(x = sprintf("Option=%s", pal), fill = factor(frac))) +
  geom_bar(position="fill") + 
  scale_fill_viridis_d(guide="none", option=pal, direction=-1) + 
  scale_y_continuous(expand=expansion(mult=c(0, NULL))) +
  xlab(NULL) + ylab("Count")
}

ggarrange(
  plot_viridis("E"), 
  plot_viridis("A"),
  plot_viridis("D"),
  ncol=3, align="h"
)

Another simple choice is RColorBrewer:

These palettes can be used in ggplot with scale_*_brewer(pal = "...")

As a fallback, continuous scales can also be generated from discrete colors.

Using scale_fill_gradient :

Code

plot_df = data.frame(
  y = rnorm(1000),
  x = rbeta(1000, 1, 2)
) %>% mutate(z = x**2)

ggarrange(
  ggplot(plot_df, aes(x = x, y = y, fill = z)) +
  geom_point(pch = 21, size = 2) + 
  scale_fill_gradient(
    name="Value\n",
    low = natrev("blue", 4),
    high = natrev("yellow", 4),
    limits=c(0, 0.5), oob=scales::squish
  ) +
  theme(aspect.ratio = 1),
  ggplot(data.frame(
    y = rnorm(1000),
    x = rnorm(1000)
  ) %>% mutate(z = x), aes(x = x, y = y, fill = z)) +
  geom_point(pch = 21, size = 2) + 
  scale_fill_gradient2(
    name="Value\n",
    low = natrev("blue", 4),
    mid = "white",
    high = natrev("yellow", 4),
    limits=c(-1, 1), oob=scales::squish
  ) +
  theme(aspect.ratio = 1)
)

📊 Figures

This section will describe certain practical considerations when making figures, divided up by single- or multi-panel.

Single-panel

Single-panel figure guide, separated by figure type.

Bar/column plot

When to use it:

Binary outcome data separated by group,
Biological replicates of a continuous measurement by group.

Best practices:

Always add error bars, such as those generated by binom::__ (see below).
Refit the y scale such that there is no gap between the x-axis and the data.

Examples:

Bar plot of frequency data with error bars:

Note the use of expansion(mult=c(0, NULL)) to align the bars with the bottom axis.

Box/violin plot

When to use it:

Continuous data separated into discrete groups (ordered or unordered).
Single observations per individual. For per-individual replicates in the same group, consider a column plot.
For timeseries data, consider other methods (scatterplot + line, for example)

Best practices:

When possible, use a combination like the following:
1. Boxplot on top of violin plot,
2. Jitterplot on top of boxplot.
- Think about boxplots as giving a quick visual reference, while violin/jitter plots provide an extra layer for data transparency.

Examples:

Box + violin plot:

geom_violin(scale="width", alpha = 0.2, color = "#ffffff00") + geom_boxplot(width = 0.4)

Code

library(ggplot2)

ys = c()
gr = c()
for(i in 2:4) {
  ys <- c(ys, rnorm(100, mean = i))
  gr <- c(gr, rep(i-1, 100))
}

plot.df = data.frame(
  y = ys,
  g = gr
)

ggplot(plot.df,
       aes(x = factor(gr), fill = factor(gr), y = ys)) +
  scale_fill_biorender(guide="none") +
  
  #- \/ \/ -#
  geom_violin(scale="width", alpha = 0.2, color = "#ffffff00") +
  geom_boxplot(width = 0.4) +
  #- /\ /\ -#
  
  xlab("Group") + ylab("Value")

Heatmap/tile/dot plot

When to use it:

Comparing many values at once across groups (preferably normally distributed ones)

Best practices:

Be very careful about how data are scaled to produce a good color scale.
Consider adding labels to each cell annotating the value itself.
For dot plots, remember that area scales with the square rather than linearly.

Examples:

Tile plot of a confusion matrix:

Code

library(ggplot2)

plot_df = data.frame(
  group1 = factor(c("A", "A", "A", "B", "B", "B", "C", "C", "C"), levels=c("A", "B", "C"), ordered=T),
  group2 = factor(c("A", "B", "C", "A", "B", "C", "A", "B", "C"), levels=c("C", "B", "A"), ordered=T),
  value = rbinom(9, 10, 0.5)
)

ggplot(plot_df, aes(x = group1, y = group2, fill = value)) +
  geom_tile(color = "black") + theme_minimal(accent="#ffffff00") + 
  xlab("Observed") + ylab("Expected") +
  geom_text(aes(label = value)) +
  scale_fill_viridis_c(name = "Value", begin = 0.4) +
  theme(aspect.ratio = 1, legend.position = "right",
        axis.ticks = element_blank())

Multi-panel

For multi-panel figures, heavy use of ggpubr::ggarrange is recommended for these reasons:

No need for manual manipulation of alignment,
Ability to make legends/scales common across plots,
Capable of more complex grid layouts.

Common scales

Using the common.legend=T flag is useful, but it just copies the scale from the first plot. This can lead to unintentionally misleading values on the scale if multiple plots have different limits. Good practice is to set limits explicitly in each plot’s color or fill scale like this:

scale_*_continuous(..., limits=c(-1, 1), oob=scales::squish)

so that the common legend has the same (correct) limits.

Examples:

For horizontal layouts, use align="h" and set panel widths with widths=c(x1,x2):

Code

library(ggplot2)
library(ggpubr)

plot_df = data.frame(
  x = rnorm(1000, 5, 2),
  y = rnorm(1000, 5, 1),
  value = rnorm(1000, 5, 1)
)

ggarrange(
  ggplot(plot_df, aes(x = x, y = y, fill = value)) +
    geom_point(pch=21) + 
    scale_fill_viridis_c(name = "Value"),
  ggplot(plot_df, aes(x = x, y = y, fill = value)) +
    geom_point(pch=21) + 
    scale_fill_viridis_c(name = "Value"),
  common.legend = T, align="h", widths=c(3, 2)
)

For more complex layouts, ggarrange can be nested together for non-square grids:

Code

library(ggplot2)
library(ggpubr)

plot_df = data.frame(
  x = rnorm(1000, 5, 2),
  y = rnorm(1000, 5, 1),
  value = rnorm(1000, 5, 1)
)

important_plot = ggplot(plot_df, aes(x = x, y = y, fill = value)) +
      geom_point(pch=21) + 
      scale_fill_viridis_c(name = "Value")

ggarrange(
  ggarrange(
    important_plot,
    important_plot,
    common.legend = T, align="h", widths=c(3, 2)
  ),
  important_plot,
  ncol=1, common.legend=T, legend="none"
)

Axes and scales

Properly scaling continuous data is important—leaving plot defaults as is can lead to misleading interpretations and unnecessary complications. Here are some tips for effective scaling:

X and Y

When framing data, consider the effects of outliers and whether the extremes of the data should be filtered out if it distorts interpretation. One way to do so is:

ggplot(
  plot_data %>% dplyr::filter(
    x > quantile(x, 0.025) & x < quantile(x, 0.975),
    ...
  )
)

Filtering out just the top 2.5-1% can dramatically alter the framing of the other 97.5-99%. Just be careful to emphasize that some data is masked.

Another approach to presenting more systematically tailed data is to transform the axes (or data) using a log transform such as scale_*_log10().

Aesthetics

Scaling colors and fills correctly is critical to proper interpretation of data. Here are some tips for different types of aesthetics:

For diverging data, make sure to explicitly set the midpoint of the scale_fill or scale_color line. Otherwise it might be set to an arbitrary value, which can be misleading.
For all continuous data, consider whether outliers might cause the range of colors to be misleading. One useful approach is to set limits explicitly and cap outliers at that limit using:
```
scale_fill_continuous(..., limits=c(0, 2), oob=scales::squish)
```
Caution

Setting limits manually can easily be misleading as well—consider choosing limits based on upper and lower quantiles, or values that are defensible in writing.
Consider effective choices of color scales to match the extent of variability in the data. Ideally, the same difference in hue or luminance should correspond to a linear change in effect size. Often times, existing continuous scales will have already been optimized for this, which is a good reason to use them.

Significance layers

To add significance annotations to a plot, there are two main options:

Automatic p-values generated using ggpubr::geom_signif(),
Manual p-values annotated using geom_bracket or geom_signif.

Except in cases where a comparison is straightforward and does not need adjusting, manual annotation of P-values is recommended. geom_signif will default to unadjusted Wilcoxon P-values unless defaults are changed.

🎛 Themes

This section aims to standardize the finishing touches & exporting of figures for publication readiness.

If you use the snippet in Getting started, you’re already 90% there!

Figures generated in ggplot2 should use a very simple theme, following these guidelines:

No gridlines,
Large-enough text,
Thin but legible lines,
etc.

The quickest shortcut to meet these requirements is to use ggpubr::theme_pubr() :

Code

library(ggplot2)
library(ggpubr)

plot_df = data.frame(
  x = rnorm(100, 5, 2),
  y = rnorm(100, 5, 1)
)
ggarrange(
  ggplot(plot_df, aes(x = x, y = y)) + 
    geom_point() + theme_bw() + 
    theme(aspect.ratio = 1) +
    ggtitle("BAD (gridlines)"),
  ggplot(plot_df, aes(x = x, y = y)) + 
    geom_point() + theme_pubr() + 
    theme(aspect.ratio = 1) +
    ggtitle("GOOD (no gridlines)"),
  ncol=2, align="h"
)

Following the less-is-more approach, certain figures (like UMAPs) can even use theme_void or similar (featuring an annotation layer created by Claude):

Code

library(ggplot2)

plot_df = data.frame(
  x = c(rnorm(500, 5, 0.4), rnorm(1000, 3, 0.3)),
  y = c(rnorm(500, 5, 0.4), rnorm(1000, 3, 0.3)),
  values = c(rnorm(500, 5, 0.4), rnorm(1000, 3, 0.3))
)

add_umap_axes <- function(plot, length = 0.2, label_offset = 0.05, 
                           xlab = "UMAP1", ylab = "UMAP2", 
                           color = "black", text_size = 4) {
  
  # Extract plot ranges to position the guide in data coordinates
  built    <- ggplot_build(plot)
  x_range  <- built$layout$panel_params[[1]]$x.range
  y_range  <- built$layout$panel_params[[1]]$y.range
  
  x_span   <- diff(x_range)
  y_span   <- diff(y_range)
  
  # Anchor point (bottom left, with a small inset)
  ax <- x_range[1] + x_span * -0.05
  ay <- y_range[1] + y_span * -0.05
  
  # End points
  ax_end <- ax + x_span * length
  ay_end <- ay + y_span * length
  
  plot +
    # Horizontal arrow (UMAP1)
    annotate("segment",
             x = ax, xend = ax_end, y = ay, yend = ay,
             arrow = arrow(length = unit(0.15, "cm"), type = "closed"),
             color = color, linewidth = 0.5) +
    # Vertical arrow (UMAP2)
    annotate("segment",
             x = ax, xend = ax, y = ay, yend = ay_end,
             arrow = arrow(length = unit(0.15, "cm"), type = "closed"),
             color = color, linewidth = 0.5) +
    # Labels
    annotate("text",
             x = ax + x_span * (length / 2), 
             y = ay - y_span * label_offset,
             label = xlab, hjust = 0.5, vjust = 0.5,
             size = text_size, color = color) +
    annotate("text",
             x = ax - x_span * label_offset, 
             y = ay + y_span * (length / 2),
             label = ylab, hjust = 0.5, vjust = 0.5, angle = 90,
             size = text_size, color = color)
}

p<-ggplot(plot_df, aes(x = x, y = y, fill = values)) + 
    geom_point(pch=21) + theme_void() + 
    scale_fill_viridis_c(name = "Value") +
    theme(aspect.ratio = 1)
add_umap_axes(p)

📤 Exporting

When exporting figure outputs, a good approach is to keep a standardized pipeline so that each figure has a similar base size and is vectorized properly. Here is a quick script that optimizes this step:

save_plot <- function(base = 5, saveto = NULL) {
  
  last_plt <- ggplot2::last_plot()
  
  ratio = ifelse(!is.null(last_plt$theme$aspect.ratio),
                 last_plt$theme$aspect.ratio,
                 dev.size()[[1]] / dev.size()[[2]])
  
  if(is.null(saveto)) {
    if(rstudioapi::getActiveDocumentContext()$path == "") {
      saveto = getwd()
    } else {
      saveto = dirname(rstudioapi::getActiveDocumentContext()$path)
    }
  }
  
  gnames = paste0(
    substr(ifelse(is.null(last_plt$labels$x), as_label(last_plt$mapping$x), last_plt$labels$x), 0, 10), 
    "_by_", 
    substr(last_plt$labels$y, 0, 10))
  gnames = gsub(" ", "_", gnames)
  fname = file.path(saveto, paste0(gsub(".+/", "", rstudioapi::getActiveDocumentContext()$path), "_", gnames, "_", Sys.time(), ".pdf"))
  
  cowplot::save_plot(
    fname,
    last_plt,
    base_height = base,
    base_asp = ratio
  )
  
}

This function does the following:

Grabs the previously generated ggplot,
Extracts the aspect ratio either from the plot object (if set) or the size of the Plots window,
Sets a common size from the base argument (most important step!),
Saves it as PDF using an auto-generated filename based on the execution context.

This function helps significantly to unify the spacing and formatting of every plot you generate. For larger plots, increase base.

Example:

ggplot() + ...
save_plot(base = 5)

Will export the previously generated ggplot to the script’s directory as a PDF.

Vectorization

An important consideration in exporting plots is vector formatting. Exports from the above step will be vectorized by default.

When working with large figures with many elements (more than a few thousand—think UMAPs), it is often best to rasterize the large layer while maintaining vector elements for the rest of the plot.

Installing ggrastr

Some dependencies run into compilation errors when installing ggrastr — one workaround is to install using prebuilt binaries from CRAN.

✏️ External tools

For finishing touches or larger composition, using external software can be helpful. Some of the most useful tools are:

Adobe Illustrator, for generating multi-panel figures or editing parts of vectorized figure outputs;
BioRender, good for generating graphical abstracts with vector clip art;
Gemini — ask Andrew how to successfully prompt engineer graphical abstracts.