Introduction

Statistics gives us numbers. Visualization makes them mean something. A great chart can communicate more in two seconds than a regression table can in two minutes. A bad chart can mislead, distract, or simply waste your reader’s time.

Today’s lecture is about the craft of data visualization in R using ggplot2, the grammar of graphics. We will draw on the design philosophies of three influential practitioners:

Kieran Healy (Duke), author of Data Visualization: A Practical Introduction
Yan Holtz, creator of The R Graph Gallery and From Data to Viz
Cédric Scherer, an independent data visualization designer who has popularized the modern editorial ggplot style

Today’s roadmap:

Why visualization is a methods topic

The grammar of graphics

Healy’s principles: honesty, clarity, comparison

Holtz’s chart-type decision tree

Scherer’s editorial approach: from default to publication

A worked transformation: same data, four iterations

Color, type, and accessibility

Modern techniques: highlighting, animation, interactivity, distributions, heatmaps, dumbbell charts, and more

Visualizing regression models: coefficient plots, predicted probabilities, diagnostics

The #30DayChartChallenge

Part 1: Why Visualization Is a Methods Topic

We have spent the semester learning to estimate things: means, slopes, odds ratios, hazard ratios. Why end the course with visualization?

Three reasons.

1. Visualization is part of the analysis, not a decoration. A scatterplot reveals a non-linear relationship that a correlation coefficient hides. A residual plot exposes a violated assumption that an R² celebrates. A QQ plot finds a heavy tail that a t-test ignores. Every analysis you do should begin and end with looking at the data.

2. Visualization is how findings reach decision-makers. A clinician, a journalist, a policymaker, or a community partner is not going to read your beta coefficients. They will look at your figure. The figure is the one part of the paper that everyone reads.

3. Visualization is a discipline with its own theory. Bad charts are not just ugly; they are wrong. They make comparisons hard, they hide variation, they encode noise as signal. The grammar of graphics, perceptual research, and color theory are real things, and we should treat them the way we treat statistical theory.

“Above all else, show the data.” — Edward Tufte

The Same Data, Completely Different Stories

Anscombe’s Quartet is a classic example: four datasets with identical summary statistics (same mean, same variance, same correlation, same regression line) but radically different patterns. Without visualization, you would never know.

library(tidyverse)

anscombe_long <- anscombe |>
  pivot_longer(everything(),
               names_to = c(".value", "set"),
               names_pattern = "(.)(.)")

ggplot(anscombe_long, aes(x, y)) +
  geom_point(size = 3, color = "#2E86AB", alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE, color = "#e64173", linewidth = 1) +
  facet_wrap(~set, ncol = 4, scales = "free") +
  labs(title = "Anscombe's Quartet",
       subtitle = "Same mean, same variance, same correlation, same regression line. Very different data.") +
  theme_minimal(base_size = 12)

Every analysis you do should begin and end with looking at the data.

Part 2: The Grammar of Graphics

In 1999, Leland Wilkinson published The Grammar of Graphics, which described a unified framework for thinking about statistical charts. Hadley Wickham translated this into R as ggplot2 in 2005. The grammar is the reason ggplot2 feels different from base R plotting: instead of memorizing dozens of functions, you compose plots from a small set of building blocks.

The Seven Layers

The Seven Layers of the Grammar of Graphics
Layer	Question	Function
Data	What data am I plotting?	ggplot(data = …)
Aesthetics	Which variables map to which visual properties (x, y, color, size, shape)?	aes(x = …, y = …, color = …)
Geometries	What shape do I draw (point, line, bar, smooth)?	geom_*()
Facets	Do I split into small multiples?	facet_wrap() / facet_grid()
Statistics	Do I transform the data (mean, density, smooth)?	stat_*() or geom_smooth()
Coordinates	What coordinate system (Cartesian, polar, log)?	coord_() / scale_()
Theme	How does it look (fonts, gridlines, background)?	theme_*() / theme()

The genius of this framework is composability. To go from a scatterplot to a faceted scatterplot with a smoother, you add layers; you do not start over.

library(tidyverse)
library(palmerpenguins)

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point(alpha = 0.7, size = 2) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ island) +
  scale_color_brewer(palette = "Dark2") +
  labs(title = "Bill Dimensions of Palmer Penguins",
       subtitle = "By species and island",
       x = "Bill length (mm)", y = "Bill depth (mm)",
       color = "Species") +
  theme_minimal(base_size = 12)

The ggplot2 Template

Every ggplot follows the same skeleton:

ggplot(data = <DATA>,                         # 1. Data
       aes(x = <X>, y = <Y>, color = <Z>))   # 2. Aesthetics
  + geom_<TYPE>(...)                          # 3. Geometry
  + facet_wrap(~ <VAR>)                       # 4. Facets
  + scale_<AES>_<TYPE>(...)                   # 5. Scales
  + labs(title = "...", x = "...", y = "...")  # 6. Labels
  + theme_minimal()                           # 7. Theme

You will use this template for every chart in this course and beyond.

Part 3: Kieran Healy’s Principles

In Data Visualization: A Practical Introduction (2018), Healy organizes good practice around three questions: is the chart substantively good, is it perceptually good, and is it aesthetically good?

Substantive Standards

Show the data. Whenever possible, show individual observations, not just summaries.
Compare like with like. Stratify intentionally; do not let groups be confounded by sample size or scale.
Quantify uncertainty. Always show confidence intervals or standard errors when reporting estimates.

library(broom)

penguins |>
  drop_na(sex) |>
  ggplot(aes(x = species, y = body_mass_g, color = sex)) +
  geom_jitter(width = 0.15, alpha = 0.4, size = 1.5) +
  stat_summary(fun.data = mean_cl_normal, geom = "pointrange",
               position = position_dodge(width = 0.5),
               size = 0.8, linewidth = 1) +
  scale_color_manual(values = c("female" = "#D55E00", "male" = "#0072B2")) +
  labs(title = "Body Mass by Species and Sex",
       subtitle = "Individual observations with mean and 95% CI",
       x = NULL, y = "Body mass (g)", color = "Sex") +
  theme_minimal(base_size = 12)

Perceptual Standards

Human perception is not uniform across visual encodings. Cleveland and McGill (1984) ranked encodings by accuracy:

Cleveland-McGill Hierarchy of Visual Encodings
Rank	Encoding	Use_for
1	Position on a common scale	Most quantitative comparisons
2	Position on identical but non-aligned scales	Small multiples (faceting)
3	Length	Bar charts
4	Angle / slope	Pie chart slices (use sparingly)
5	Area	Bubble charts (use cautiously)
6	Volume	Almost never
7	Color hue	Categorical groups
8	Color saturation	Sequential / diverging variables

Practical implication: Prefer dot plots and bar charts to pie charts. Prefer faceting to stacking. Use color for categories, not for quantities (unless you use a perceptually uniform scale like viridis).

Aesthetic Standards

Healy’s third pillar is what most beginners notice last but readers notice first: typography, white space, alignment, color harmony. A clean theme, a sans-serif font, restrained gridlines, and direct labeling will make a competent chart feel professional.

Part 4: Yan Holtz and Choosing the Right Chart Type

Yan Holtz built two enormously useful resources: The R Graph Gallery (https://r-graph-gallery.com) and From Data to Viz (https://www.data-to-viz.com). The latter is a decision tree: tell me what kind of data you have, and I will tell you which chart types make sense.

A Simplified Decision Guide

Chart Type Decision Guide (Holtz)
Data type	Good chart types	Avoid
One numeric variable	Histogram, density plot, boxplot, violin	Pie chart for distributions
One categorical variable	Bar chart, lollipop, treemap	3D pie chart, donut
Two numeric variables	Scatterplot, hexbin, 2D density	3D scatter, dual-axis
Numeric × categorical	Boxplot, violin, ridge plot, jitter + summary	Bar with no error
Two categorical variables	Mosaic, heatmap, grouped bar	Stacked bars when comparing groups
Time series	Line chart, area chart, slope chart	Bar charts of time series
Map data	Choropleth, dot map, cartogram	Choropleth without normalizing by area or population
Network / hierarchy	Network graph, dendrogram, sunburst	Force-directed network with > 200 nodes

The Anti-Pie-Chart Argument

Pie charts encode quantities as angles, which sit near the bottom of the perceptual hierarchy. Worse, they make comparisons across groups nearly impossible. A dot plot or a horizontal bar chart almost always tells the same story more clearly.

library(patchwork)

dat <- tibble(group = LETTERS[1:6],
              value = c(22, 18, 17, 16, 14, 13))

p1 <- ggplot(dat, aes(x = "", y = value, fill = group)) +
  geom_col(width = 1) +
  coord_polar("y") +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Pie chart") +
  theme_void()

p2 <- ggplot(dat, aes(x = value, y = fct_reorder(group, value))) +
  geom_col(fill = "steelblue") +
  geom_text(aes(label = value), hjust = -0.2) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.1))) +
  labs(title = "Bar chart (sorted)", x = NULL, y = NULL) +
  theme_minimal()

p1 + p2

Which one lets you instantly say which group is third largest?

Part 5: Cédric Scherer and the Editorial Style

Cédric Scherer takes ggplot2 beyond the defaults and into the territory of editorial design — the kind of charts you see in The New York Times, Reuters, and Our World in Data. His work demonstrates that ggplot2 can produce publication-grade graphics with no post-processing, if you commit to learning the theme system.

The Scherer Approach

Start with the data, not the chart type. Ask what story the data tells.
Strip ruthlessly. Remove everything that does not carry information: redundant titles, gridlines that compete with the data, default gray backgrounds.
Direct label. Put labels next to the lines or points they describe, not in a legend off to the side.
Use type as a design element. A clean serif title, a small italic subtitle, a tiny grey caption with the data source.
Annotate. Pull the reader’s eye to the most important point with a curved arrow, a callout, a subtle shaded region.
Iterate. Cédric’s published charts often go through 20+ versions.

Tools Beyond Base ggplot2

ggtext — markdown and HTML inside titles, subtitles, and labels
patchwork — composing multi-panel figures
ggrepel — non-overlapping text labels
showtext / sysfonts — custom fonts (Google Fonts in your charts)
MetBrewer, paletteer, scico — color palettes from artists and scientific colormaps
ggdist — visualizing uncertainty and distributions

penguin_means <- penguins |>
  drop_na() |>
  group_by(species, year) |>
  summarise(mass = mean(body_mass_g), .groups = "drop")

ggplot(penguin_means, aes(x = year, y = mass, color = species)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  ggrepel::geom_text_repel(
    data = filter(penguin_means, year == max(year)),
    aes(label = species),
    hjust = 0, nudge_x = 0.1, direction = "y", segment.color = NA
  ) +
  scale_color_manual(values = c("Adelie" = "#FF6B35",
                                 "Chinstrap" = "#A23B72",
                                 "Gentoo" = "#2E86AB")) +
  scale_x_continuous(breaks = 2007:2009, expand = expansion(mult = c(0.05, 0.25))) +
  labs(title = "Average Body Mass of Palmer Penguins, 2007-2009",
       subtitle = "Direct labels replace the legend; gridlines are softened",
       x = NULL, y = "Body mass (g)",
       caption = "Source: palmerpenguins R package") +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(color = "grey40"),
    plot.caption = element_text(color = "grey60", size = 9)
  )

Part 6: A Worked Transformation

Same data. Four versions. Watch the evolution.

Iteration 1: The default

ggplot(penguins, aes(x = species, y = body_mass_g)) +
  geom_boxplot()

This is honest. It is also forgettable.

Iteration 2: Show the data

ggplot(penguins, aes(x = species, y = body_mass_g, fill = species)) +
  geom_boxplot(alpha = 0.5, outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.4)

Now we can see how many penguins are in each species and where the outliers actually live.

Iteration 3: Add labels and theme

ggplot(penguins, aes(x = species, y = body_mass_g, fill = species)) +
  geom_boxplot(alpha = 0.5, outlier.shape = NA, color = "grey20") +
  geom_jitter(width = 0.2, alpha = 0.4, color = "grey30") +
  scale_fill_manual(values = c("#FF6B35", "#A23B72", "#2E86AB")) +
  labs(title = "Body Mass of Palmer Penguins by Species",
       x = NULL, y = "Body mass (g)") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

Iteration 4: Editorial polish

library(ggdist)

ggplot(drop_na(penguins, body_mass_g),
       aes(x = species, y = body_mass_g, fill = species)) +
  stat_halfeye(adjust = 0.5, width = 0.6, .width = 0,
               justification = -0.3, point_color = NA) +
  geom_boxplot(width = 0.15, outlier.shape = NA, alpha = 0.7) +
  geom_jitter(width = 0.05, alpha = 0.3, size = 1.2) +
  scale_fill_manual(values = c("#FF6B35", "#A23B72", "#2E86AB")) +
  coord_cartesian(xlim = c(1.2, NA), clip = "off") +
  labs(title = "Gentoo Penguins Are Substantially Heavier",
       subtitle = "Distribution, median, and individual observations of body mass by species",
       x = NULL, y = "Body mass (g)",
       caption = "Source: palmerpenguins R package") +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(color = "grey40"),
    plot.caption = element_text(color = "grey60", size = 9)
  )

The final chart shows: the distribution (raincloud), the summary (boxplot), the raw data (jitter), and a conclusion baked into the title. That is the difference between a chart and a finding.

Part 7: Color, Type, and Accessibility

Color Principles

Categorical data: use a qualitative palette where each color is distinct but no color “ranks” higher than another. Examples: Set2, Dark2, the Okabe-Ito palette.
Sequential data: use a single-hue or multi-hue gradient that runs from light to dark. Examples: viridis, mako, Blues.
Diverging data: use two hues meeting at a neutral midpoint. Examples: RdBu, BrBG. Use only when there is a meaningful zero.
Always check colorblind safety. About 8% of men have some form of color vision deficiency. Use viridis-family palettes by default; check with colorBlindness::cvdPlot() or prismatic::clr_protan().

okabe_ito <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442",
               "#0072B2", "#D55E00", "#CC79A7", "#000000")

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point(size = 2, alpha = 0.8) +
  scale_color_manual(values = okabe_ito) +
  labs(title = "Okabe-Ito Palette: Designed for Colorblind Accessibility") +
  theme_minimal()

Why Colorblind Safety Matters

About 8% of men and 0.5% of women have some form of color vision deficiency. Here is a side-by-side comparison of a problematic palette versus a safe one:

library(patchwork)

p_bad <- ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point(size = 2.5, alpha = 0.8) +
  scale_color_manual(values = c("red", "green", "blue")) +
  labs(title = "Red-Green-Blue: invisible to ~8% of men",
       x = "Bill length (mm)", y = "Bill depth (mm)") +
  theme_minimal(base_size = 12)

p_good <- ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point(size = 2.5, alpha = 0.8) +
  scale_color_manual(values = c("#E69F00", "#56B4E9", "#009E73")) +
  labs(title = "Okabe-Ito: safe for all viewers",
       x = "Bill length (mm)", y = "Bill depth (mm)") +
  theme_minimal(base_size = 12)

p_bad + p_good +
  plot_layout(guides = "collect") &
  theme(legend.position = "bottom")

Rule of thumb: use viridis-family palettes by default. Check with colorBlindness::cvdPlot() or coblis.com.

Typography

Use one font family consistently.
The title should be bold and slightly larger, the subtitle regular grey, the caption small grey.
Avoid serifs in dense charts; reserve them for editorial titles.
Sans-serifs that work well: Inter, Source Sans Pro, Roboto, Lato, IBM Plex Sans.

Direct Labeling

Whenever possible, put the label next to the thing it labels. Legends force the reader’s eye to bounce between the chart and a key. Direct labels eliminate that bounce.

The Final Checklist

Before you submit a chart, ask:

Does the title state the finding, not the topic?
Is the y-axis at zero (for bar charts) or appropriately scaled?
Are units labeled?
Is there a source caption?
Is the color palette colorblind-safe?
Are uncertainty intervals shown?
Could a reader explain the chart in one sentence?

Part 8: Modern Techniques and Advanced Aesthetics

The first seven parts gave you a foundation. This section is the “show me what’s possible” tour. These techniques are what separate a competent ggplot2 user from someone whose figures get retweeted.

8.1 Highlighting with `gghighlight`

A common problem: you have many groups but only a few matter to your story. Faceting is one answer; highlighting is the other. The gghighlight package fades out everything except the lines or points you want the reader to focus on.

library(gghighlight)

gapminder_like <- penguins |>
  drop_na() |>
  group_by(species, year) |>
  summarise(mean_mass = mean(body_mass_g), .groups = "drop")

ggplot(gapminder_like, aes(x = year, y = mean_mass, color = species)) +
  geom_line(linewidth = 1.4) +
  geom_point(size = 3) +
  gghighlight(species == "Gentoo",
              unhighlighted_params = list(color = "grey80", linewidth = 0.8)) +
  scale_color_manual(values = c("Gentoo" = "#2E86AB")) +
  labs(title = "Gentoo Penguins Stand Out",
       subtitle = "Other species are visible but recede into the background",
       x = NULL, y = "Mean body mass (g)") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

Why this matters: the eye is drawn instantly to the highlighted line, but the reader still has the full context of how Gentoo compares to the other species. This is far more powerful than deleting the other groups.

8.2 Animation with `gganimate`

Time-series and longitudinal data come alive when animated. The gganimate package extends ggplot2 with a small set of transition_*(), enter_*(), and ease_aes() functions. The result is a GIF or MP4 that can be embedded in slide decks, dashboards, or HTML reports.

library(gganimate)
library(gapminder)

p <- ggplot(gapminder,
            aes(x = gdpPercap, y = lifeExp,
                size = pop, color = continent)) +
  geom_point(alpha = 0.7) +
  scale_x_log10() +
  scale_size(range = c(2, 12), guide = "none") +
  scale_color_brewer(palette = "Set2") +
  labs(title = "Year: {frame_time}",
       subtitle = "Hans Rosling's classic visualization, animated",
       x = "GDP per capita (log scale)",
       y = "Life expectancy (years)",
       color = "Continent") +
  theme_minimal(base_size = 14) +
  transition_time(year) +
  ease_aes("cubic-in-out")

animate(p, nframes = 100, fps = 10, width = 800, height = 500,
        renderer = gifski_renderer())

anim_save("gapminder.gif")

Three key animation transitions:

Function	What it animates
`transition_time()`	A continuous variable (e.g., year, day)
`transition_states()`	A discrete variable, with pauses on each state
`transition_reveal()`	Reveals the data progressively along an axis

Tip from Cédric Scherer: animations should serve a narrative purpose. If the same insight is clearer in a static small-multiple, prefer the static version. Animation costs the reader attention; spend it deliberately.

8.3 Interactive Charts with `plotly` and `ggiraph`

For HTML reports, dashboards, and Shiny apps, interactivity lets the reader explore. Two approaches:

`plotly::ggplotly()` — instant interactivity from any ggplot

library(plotly)

p <- ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g,
                          color = species,
                          text = paste0("Island: ", island,
                                        "<br>Sex: ", sex,
                                        "<br>Year: ", year))) +
  geom_point(alpha = 0.7, size = 2) +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal()

ggplotly(p, tooltip = "text")

`ggiraph` — finer control, supports animation, click events, and Shiny

library(ggiraph)

p <- ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g, color = species)) +
  geom_point_interactive(
    aes(tooltip = paste("Species:", species, "<br>Island:", island),
        data_id = species),
    size = 2.5
  ) +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal()

girafe(ggobj = p, options = list(opts_hover(css = "stroke:black;stroke-width:2px;")))

8.4 Distributions Done Right with `ggdist` and `ggridges`

Boxplots hide bimodality. Histograms hide group differences. Modern distribution geoms show shape, summary, and uncertainty in one chart.

Raincloud plots (`ggdist`)

A raincloud combines a half-violin (the cloud), a boxplot (the umbrella), and jittered raw points (the rain). It is the single most informative way to show a distribution.

library(ggdist)

ggplot(drop_na(penguins, body_mass_g),
       aes(x = body_mass_g, y = species, fill = species)) +
  stat_halfeye(adjust = 0.6, .width = 0, justification = -0.2,
               point_color = NA, alpha = 0.7) +
  geom_boxplot(width = 0.15, outlier.shape = NA, alpha = 0.5) +
  geom_jitter(height = 0.07, alpha = 0.3, size = 1.2) +
  scale_fill_manual(values = c("#FF6B35", "#A23B72", "#2E86AB")) +
  labs(title = "Raincloud Plot of Penguin Body Mass",
       subtitle = "Distribution + boxplot + raw observations",
       x = "Body mass (g)", y = NULL) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

Ridge plots (`ggridges`)

When you want to compare distributions across many groups in a small space.

library(ggridges)

ggplot(diamonds, aes(x = price, y = cut, fill = cut)) +
  geom_density_ridges(alpha = 0.8, scale = 1.1) +
  scale_x_log10(labels = scales::dollar) +
  scale_fill_viridis_d(option = "mako") +
  labs(title = "Distribution of Diamond Prices by Cut",
       subtitle = "Ridge plots make many distributions comparable",
       x = "Price (log scale)", y = NULL) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

8.5 Uncertainty That You Can See

Statisticians love confidence intervals. Readers often miss them because they look like skinny error bars on top of bold point estimates. ggdist and ggdist::stat_dist_*() let you visualize the whole posterior or sampling distribution.

library(ggdist)

set.seed(1220)
estimates <- tibble(
  predictor = c("Smoking", "Exercise", "Income", "Sleep", "Age"),
  estimate  = c(0.65, -0.45, -0.20, -0.30, 0.05),
  se        = c(0.10, 0.08, 0.07, 0.09, 0.04)
)

ggplot(estimates, aes(y = fct_reorder(predictor, estimate),
                      xdist = distributional::dist_normal(estimate, se))) +
  stat_halfeye(.width = c(0.5, 0.95), fill = "#2E86AB",
               slab_alpha = 0.6, point_size = 3) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey40") +
  labs(title = "Coefficient Plot with Full Sampling Distributions",
       subtitle = "Thick band = 50% interval, thin line = 95% interval",
       x = "Log odds ratio", y = NULL) +
  theme_minimal(base_size = 12)

This is far more honest than a forest plot of point estimates with whiskers, because the reader can see the shape of the uncertainty.

8.6 Patchwork: Composing Figures

A single panel rarely tells a complete story. The patchwork package lets you compose multi-panel figures with arithmetic-like syntax: p1 + p2, p1 / p2, (p1 | p2) / p3.

library(patchwork)

p_scatter <- ggplot(penguins, aes(bill_length_mm, body_mass_g, color = species)) +
  geom_point(alpha = 0.7) +
  scale_color_manual(values = c("#FF6B35", "#A23B72", "#2E86AB")) +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none") +
  labs(x = "Bill length (mm)", y = "Body mass (g)")

p_top <- ggplot(penguins, aes(bill_length_mm, fill = species)) +
  geom_density(alpha = 0.6) +
  scale_fill_manual(values = c("#FF6B35", "#A23B72", "#2E86AB")) +
  theme_void() +
  theme(legend.position = "none")

p_right <- ggplot(penguins, aes(body_mass_g, fill = species)) +
  geom_density(alpha = 0.6) +
  scale_fill_manual(values = c("#FF6B35", "#A23B72", "#2E86AB")) +
  coord_flip() +
  theme_void() +
  theme(legend.position = "none")

(p_top + plot_spacer() + p_scatter + p_right +
    plot_layout(ncol = 2, widths = c(4, 1), heights = c(1, 4))) +
  plot_annotation(
    title = "Marginal Density + Scatterplot",
    subtitle = "Composed with patchwork",
    theme = theme(plot.title = element_text(face = "bold", size = 14))
  )

Patchwork Syntax Cheat Sheet

The arithmetic is intuitive:

p1 + p2              # side by side
p1 / p2              # stacked vertically
(p1 | p2) / p3       # two on top, one below
p1 + p2 + p3 + plot_layout(ncol = 2)  # grid layout
p1 + plot_annotation(title = "Combined Figure")  # annotation

8.7 Annotation as a First-Class Citizen

The single fastest way to make a chart “editorial” is to annotate the finding directly on the chart. ggtext lets you use markdown and HTML in titles, subtitles, and labels. geom_curve and annotate let you point at things.

library(ggtext)

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point(alpha = 0.7, size = 2.2) +
  annotate("curve", x = 175, y = 5500, xend = 187, yend = 5000,
           arrow = arrow(length = unit(0.2, "cm")),
           curvature = -0.3, color = "grey30") +
  annotate("text", x = 173, y = 5650,
           label = "Gentoos cluster\nin the heavy / long-flipper corner",
           hjust = 0, size = 3.4, color = "grey20", lineheight = 0.95) +
  scale_color_manual(values = c("Adelie" = "#FF6B35",
                                 "Chinstrap" = "#A23B72",
                                 "Gentoo" = "#2E86AB")) +
  labs(title = "Body mass scales with flipper length, but **species matter**",
       subtitle = "Highlighted with a callout instead of a legend explainer",
       x = "Flipper length (mm)", y = "Body mass (g)",
       color = NULL) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_markdown(face = "bold", size = 14),
        legend.position = "top")

8.8 Spatial Visualization with `sf`

Public health data are almost always spatial. The sf package gives ggplot2 first-class support for shapefiles, projections, and choropleths.

library(sf)
library(tigris)
options(tigris_use_cache = TRUE)

ny_counties <- counties(state = "NY", cb = TRUE, class = "sf")

# Hypothetical FMD prevalence by county
ny_counties$fmd_prev <- runif(nrow(ny_counties), 8, 18)

ggplot(ny_counties) +
  geom_sf(aes(fill = fmd_prev), color = "white", linewidth = 0.2) +
  scale_fill_viridis_c(option = "rocket", direction = -1,
                       name = "FMD %") +
  labs(title = "Frequent Mental Distress Prevalence by NY County",
       subtitle = "Hypothetical data for demonstration") +
  theme_void(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

8.9 Tables That Are Visualizations: `gt` and `gtExtras`

Sometimes the right chart is a table. The gt package and its extension gtExtras let you build editorial-quality tables with inline sparklines, color-graded cells, and embedded plots.

library(gt)
library(gtExtras)

penguins |>
  drop_na() |>
  group_by(species) |>
  summarise(
    n = n(),
    mean_mass = mean(body_mass_g),
    masses = list(body_mass_g),
    .groups = "drop"
  ) |>
  gt() |>
  gt_plt_dist(masses, type = "density", fill_color = "#2E86AB") |>
  fmt_number(mean_mass, decimals = 0) |>
  cols_label(species = "Species", n = "N",
             mean_mass = "Mean Mass (g)", masses = "Distribution") |>
  tab_header(title = md("**Penguin Body Mass by Species**"),
             subtitle = "With inline density plots") |>
  gt_theme_538()

Species	N	Mean Mass (g)
Penguin Body Mass by Species
With inline density plots
Adelie	146	3,706
Chinstrap	68	3,733
Gentoo	119	5,092

8.10 Heatmaps and Correlation Matrices

Heatmaps turn a matrix of numbers into a pattern you can see. Essential for correlation tables, confusion matrices, and time-by-group summaries.

# Correlation matrix of numeric penguin variables
cor_data <- penguins |>
  drop_na() |>
  select(where(is.numeric)) |>
  cor() |>
  as.data.frame() |>
  rownames_to_column("var1") |>
  pivot_longer(-var1, names_to = "var2", values_to = "cor")

ggplot(cor_data, aes(x = var1, y = var2, fill = cor)) +
  geom_tile(color = "white", linewidth = 0.8) +
  geom_text(aes(label = round(cor, 2),
                color = abs(cor) > 0.6),
            size = 4.5, fontface = "bold") +
  scale_fill_gradient2(low = "#D55E00", mid = "white", high = "#0072B2",
                       midpoint = 0, limits = c(-1, 1),
                       name = "Correlation") +
  scale_color_manual(values = c("TRUE" = "white", "FALSE" = "grey20"),
                     guide = "none") +
  scale_x_discrete(labels = \(x) str_replace_all(x, "_", "\n")) +
  scale_y_discrete(labels = \(x) str_replace_all(x, "_", "\n")) +
  labs(title = "Correlation Heatmap of Penguin Measurements",
       subtitle = "Color intensity encodes strength; text encodes exact value",
       x = NULL, y = NULL) +
  coord_fixed() +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 10),
        axis.text.y = element_text(size = 10),
        panel.grid = element_blank())

8.11 Dumbbell Charts: Showing Change

When you need to show the difference between two time points or conditions, dumbbell charts are more effective than grouped bars. They encode direction, magnitude, and rank simultaneously.

# Simulated epi example: disease rates before/after intervention
intervention_data <- tibble(
  county = c("Albany", "Saratoga", "Rensselaer", "Schenectady",
             "Columbia", "Greene", "Warren", "Washington"),
  before = c(15.2, 12.8, 18.1, 16.5, 11.3, 14.7, 9.8, 13.2),
  after  = c(11.1, 10.2, 12.4, 13.8, 9.5, 11.0, 8.1, 10.9)
) |>
  mutate(change = after - before,
         county = fct_reorder(county, change))

ggplot(intervention_data) +
  geom_segment(aes(x = before, xend = after,
                   y = county, yend = county),
               color = "grey60", linewidth = 1.2) +
  geom_point(aes(x = before, y = county), color = "#D55E00",
             size = 4) +
  geom_point(aes(x = after, y = county), color = "#0072B2",
             size = 4) +
  annotate("text", x = 19, y = 8.3, label = "Before",
           color = "#D55E00", fontface = "bold", size = 4.5) +
  annotate("text", x = 19, y = 7.7, label = "After",
           color = "#0072B2", fontface = "bold", size = 4.5) +
  labs(title = "Every County Improved After the Intervention",
       subtitle = "Rate per 1,000 population, before vs. after community health program",
       x = "Rate per 1,000", y = NULL) +
  scale_x_continuous(limits = c(7, 20)) +
  theme_minimal(base_size = 12)

8.12 Fonts, Themes, and Reusable Style

Once you have a look you like, package it as a function and reuse it across every chart in your manuscript or thesis. This is what professional outlets do.

theme_epi553 <- function(base_size = 12) {
  theme_minimal(base_size = base_size) +
    theme(
      plot.title = element_text(face = "bold", size = base_size + 2,
                                margin = margin(b = 4)),
      plot.subtitle = element_text(color = "grey40",
                                   margin = margin(b = 12)),
      plot.caption = element_text(color = "grey60", size = base_size - 3,
                                  hjust = 0, margin = margin(t = 12)),
      panel.grid.minor = element_blank(),
      panel.grid.major = element_line(color = "grey92"),
      axis.title = element_text(color = "grey30"),
      axis.text = element_text(color = "grey30"),
      strip.text = element_text(face = "bold", color = "grey20"),
      legend.position = "top",
      legend.title = element_text(face = "bold", size = base_size - 1),
      plot.title.position = "plot",
      plot.caption.position = "plot"
    )
}

ggplot(penguins, aes(bill_length_mm, body_mass_g, color = species)) +
  geom_point(alpha = 0.8, size = 2.2) +
  scale_color_manual(values = c("#FF6B35", "#A23B72", "#2E86AB")) +
  labs(title = "Reusable EPI 553 Theme in Action",
       subtitle = "Define once, apply everywhere",
       x = "Bill length (mm)", y = "Body mass (g)", color = "Species",
       caption = "Source: palmerpenguins") +
  theme_epi553()

Practical tip: drop theme_epi553() (or whatever you call yours) into a R/themes.R file in your project. Source it from every analysis. Your figures will look consistent without effort.

8.13 The Modern Workflow

Here is the workflow used by the data viz teams at The Pudding, FiveThirtyEight, and The Economist:

Sketch first. On paper. What is the story?
Prototype in ggplot2 with default themes. Get the geometry right.
Iterate the encoding. Try three chart types. Pick one.
Layer in annotations. Title is the finding. Direct labels. Callouts.
Polish the theme. Fonts, colors, spacing.
Export at the right resolution. Use ggsave() with dpi = 300 for print, dpi = 96 for web. SVG for vectors.
Show it to a colleague. If they can’t read the finding in 10 seconds, iterate.

ggsave("figure_1.png", width = 8, height = 5, dpi = 300, bg = "white")
ggsave("figure_1.svg", width = 8, height = 5)        # vector for editorial
ggsave("figure_1.pdf", width = 8, height = 5, device = cairo_pdf)

Part 9: Visualizing Regression Models

You have spent the semester building models. Now make them visible.

The Problem with Regression Tables

A regression table is a wall of numbers. A coefficient plot communicates direction, magnitude, uncertainty, and significance in one glance. A table requires row-by-row mental math.

Consider this table:

Term	Estimate	SE	p
(Intercept)	-1.23	0.41	0.003
Smoking	0.65	0.10	<0.001
Exercise	-0.45	0.08	<0.001
Income	-0.20	0.07	0.004
Sleep	-0.30	0.09	0.001
Age	0.05	0.04	0.211

Now compare it to the visual version:

library(ggdist)
library(distributional)

estimates <- tibble(
  predictor = c("Smoking", "Exercise", "Income", "Sleep", "Age"),
  estimate  = c(0.65, -0.45, -0.20, -0.30, 0.05),
  se        = c(0.10, 0.08, 0.07, 0.09, 0.04)
) |>
  mutate(
    lower = estimate - 1.96 * se,
    upper = estimate + 1.96 * se,
    significant = !(lower <= 0 & upper >= 0)
  )

ggplot(estimates, aes(x = estimate,
                      y = fct_reorder(predictor, estimate),
                      color = significant)) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
  geom_pointrange(aes(xmin = lower, xmax = upper),
                  size = 0.8, linewidth = 1.1) +
  scale_color_manual(values = c("TRUE" = "#2E86AB", "FALSE" = "grey60"),
                     guide = "none") +
  labs(title = "Same Model, Instantly Readable",
       subtitle = "Blue = significant at p < 0.05; grey = non-significant",
       x = "Log odds ratio (95% CI)", y = NULL) +
  theme_minimal(base_size = 12)

Forest Plots from Real Models

Fit a logistic regression on the penguins data and plot the odds ratios directly.

library(broom)

# Fit a logistic regression: predict heavy penguin (above median mass)
model_data <- penguins |>
  drop_na() |>
  mutate(heavy = as.integer(body_mass_g > median(body_mass_g)))

fit <- glm(heavy ~ bill_length_mm + bill_depth_mm + flipper_length_mm +
             species + sex,
           data = model_data, family = binomial)

# Tidy the model output (use Wald CIs for stability)
model_tidy <- tidy(fit, exponentiate = TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(
    conf.low = exp(log(estimate) - 1.96 * std.error),
    conf.high = exp(log(estimate) + 1.96 * std.error),
    term = case_match(term,
      "bill_length_mm" ~ "Bill length (mm)",
      "bill_depth_mm" ~ "Bill depth (mm)",
      "flipper_length_mm" ~ "Flipper length (mm)",
      "speciesChinstrap" ~ "Chinstrap vs. Adelie",
      "speciesGentoo" ~ "Gentoo vs. Adelie",
      "sexmale" ~ "Sex (male vs. female)"
    ),
    significant = p.value < 0.05
  )

ggplot(model_tidy, aes(x = estimate, y = fct_reorder(term, estimate),
                       color = significant)) +
  geom_vline(xintercept = 1, linetype = "dashed", color = "grey50") +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high),
                  size = 0.8, linewidth = 1.1) +
  geom_text(aes(label = sprintf("OR = %.2f", estimate)),
            vjust = -1, size = 3.8, show.legend = FALSE) +
  scale_color_manual(values = c("TRUE" = "#e64173", "FALSE" = "grey60"),
                     guide = "none") +
  scale_x_log10() +
  labs(title = "Predictors of Above-Median Body Mass (Logistic Regression)",
       subtitle = "Odds ratios with 95% CI on log scale; red = p < 0.05",
       x = "Odds Ratio (log scale)", y = NULL,
       caption = "Model: glm(heavy ~ bill + flipper + species + sex, family = binomial)") +
  theme_minimal(base_size = 12)

Predicted Probability Curves

Show what your model predicts across the range of a key variable, holding others at their means.

library(scales)

# Generate predictions across flipper length range
pred_grid <- tibble(
  flipper_length_mm = seq(170, 235, length.out = 200),
  bill_length_mm = mean(model_data$bill_length_mm),
  bill_depth_mm = mean(model_data$bill_depth_mm),
  species = "Adelie",
  sex = "female"
)

preds <- augment(fit, newdata = pred_grid, type.predict = "response",
                 se_fit = TRUE) |>
  mutate(lower = pmax(.fitted - 1.96 * .se.fit, 0),
         upper = pmin(.fitted + 1.96 * .se.fit, 1))

ggplot(preds, aes(x = flipper_length_mm, y = .fitted)) +
  geom_ribbon(aes(ymin = lower, ymax = upper),
              fill = "#2E86AB", alpha = 0.2) +
  geom_line(color = "#2E86AB", linewidth = 1.3) +
  geom_rug(data = model_data,
           aes(x = flipper_length_mm, y = heavy),
           sides = "tb", alpha = 0.15, color = "grey40") +
  scale_y_continuous(labels = label_percent()) +
  labs(title = "Predicted Probability of Above-Median Mass by Flipper Length",
       subtitle = "Logistic regression (Adelie, female); rug marks show observed data",
       x = "Flipper length (mm)", y = "Predicted probability",
       caption = "Shaded band = approximate 95% confidence interval") +
  theme_minimal(base_size = 12)

Predicted probability curves are one of the most effective ways to communicate logistic regression results to non-statisticians.

Marginal Effects with ggeffects

The ggeffects package automates predicted value plots for any model class. One line to get predicted values, and a built-in plot() method for quick visualization.

library(ggeffects)

preds <- ggpredict(fit, terms = c("flipper_length_mm [170:235 by=1]", "sex"),
                   condition = c(species = "Adelie"))

ggplot(as.data.frame(preds),
       aes(x = x, y = predicted, color = group, fill = group)) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.15,
              color = NA) +
  geom_line(linewidth = 1.2) +
  scale_color_manual(values = c("female" = "#D55E00", "male" = "#0072B2")) +
  scale_fill_manual(values = c("female" = "#D55E00", "male" = "#0072B2")) +
  scale_y_continuous(labels = label_percent()) +
  labs(title = "Marginal Effect of Flipper Length by Sex (Adelie)",
       subtitle = "ggeffects automates predicted values for any model",
       x = "Flipper length (mm)", y = "P(above-median mass)",
       color = "Sex", fill = "Sex") +
  theme_minimal(base_size = 12)

You can also extract the data for custom ggplot with as.data.frame(preds) and build your own visualization from scratch.

Diagnostic Plots: the `performance` + `see` Packages

The performance package (from easystats) provides model diagnostics; see visualizes them with ggplot2. check_model() replaces the base R plot(model) with a modern, multi-panel diagnostic dashboard.

# Fit a linear model for diagnostics demo
lm_fit <- lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + species,
             data = drop_na(penguins))

library(performance)
library(see)

check_model(lm_fit, check = c("linearity", "normality", "qq", "homogeneity"))

Model Comparison Visualization

When comparing nested or competing models, visualize the fit statistics side by side. This shows how stable an effect is as covariates are added.

# Fit competing models
m1 <- glm(heavy ~ flipper_length_mm, data = model_data, family = binomial)
m2 <- glm(heavy ~ flipper_length_mm + species, data = model_data, family = binomial)
m3 <- glm(heavy ~ flipper_length_mm + species + sex, data = model_data, family = binomial)
m4 <- fit  # full model from earlier

# Helper to get Wald CIs
tidy_wald <- function(mod, label) {
  tidy(mod, exponentiate = TRUE) |>
    mutate(conf.low = exp(log(estimate) - 1.96 * std.error),
           conf.high = exp(log(estimate) + 1.96 * std.error),
           model = label)
}

# Compare coefficients across models
models_tidy <- bind_rows(
  tidy_wald(m1, "Model 1:\nFlipper only"),
  tidy_wald(m2, "Model 2:\n+ Species"),
  tidy_wald(m3, "Model 3:\n+ Sex"),
  tidy_wald(m4, "Model 4:\n+ Bill measures")
) |>
  filter(term == "flipper_length_mm")

ggplot(models_tidy, aes(x = estimate, y = model)) +
  geom_vline(xintercept = 1, linetype = "dashed", color = "grey50") +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high),
                  color = "#2E86AB", size = 1, linewidth = 1.1) +
  geom_text(aes(label = sprintf("OR = %.2f (%.2f, %.2f)",
                                estimate, conf.low, conf.high)),
            vjust = -1.2, size = 3.8, color = "grey30") +
  scale_x_log10() +
  labs(title = "How Stable Is the Flipper Length Effect Across Models?",
       subtitle = "Odds ratio for flipper_length_mm as covariates are added",
       x = "Odds Ratio (log scale)", y = NULL,
       caption = "Stable estimates across nested models suggest robust association") +
  theme_minimal(base_size = 12)

The #30DayChartChallenge

Every April, the data visualization community participates in the #30DayChartChallenge: one prompt per day, one chart per day, shared on social media. Created in 2021 by Cedric Scherer and Dominic Roye, inspired by the #30DayMapChallenge.

Why It Matters for You

Forces you to try chart types you would never pick (waffle charts, bump charts, slope graphs, treemaps)
Builds a public portfolio of your work
Exposes you to how the global community solves the same prompt differently
Many prompts are epi-relevant: uncertainty, distributions, time series, part-to-whole, relationships

Five Categories (based on “The Graphic Continuum”)

Comparisons (days 1-6)
Distributions (days 7-12)
Relationships (days 13-18)
Time series (days 19-24)
Uncertainties (days 25-30)

Techniques You Can Steal

Prompt	Chart Type	R Package
Part-to-whole	Waffle chart	`waffle`
Ranking	Bump chart	`ggbump`
Slope	Slope chart	`geom_segment`
Circular	Polar bar	`coord_polar()`
Uncertainty	Gradient intervals	`ggdist`
Relationships	Network	`ggraph` + `tidygraph`
Neo-geometric	Voronoi	`ggforce`
Storytelling	Annotated timeline	`ggtext` + `annotate`

Where to Explore

github.com/30DayChartChallenge – all editions since 2021
Search #30DayChartChallenge on Twitter/X or Mastodon
Cedric Scherer’s contributions – R code for every entry
R Graph Gallery – reproducible R code for all chart types

The 2026 edition is at github.com/30DayChartChallenge/Edition2026.

Bonus: Waffle Chart

A waffle chart is a part-to-whole alternative to pie charts, popularized by the #30DayChartChallenge. Each square represents a unit.

library(waffle)

penguin_counts <- penguins |>
  drop_na() |>
  count(species) |>
  mutate(n_scaled = round(n / 5))  # each square = 5 penguins

waffle(
  c("Adelie" = penguin_counts$n_scaled[1],
    "Chinstrap" = penguin_counts$n_scaled[2],
    "Gentoo" = penguin_counts$n_scaled[3]),
  rows = 5,
  size = 1,
  colors = c("#FF6B35", "#A23B72", "#2E86AB"),
  title = "Palmer Penguins by Species",
  xlab = "1 square = 5 penguins"
) +
  theme(plot.title = element_text(face = "bold", size = 16),
        legend.position = "bottom")

Challenge for you: pick a prompt from the #30DayChartChallenge and create a chart using a dataset from this semester. Share it!

Summary

Principle	Practical Rule
Grammar of graphics	Build plots in layers; do not start from scratch
Show the data	Prefer jitter + summary to bar charts of means
Quantify uncertainty	CIs, error bars, halfeye plots
Position > color > area	Use the highest-accuracy encoding for the comparison that matters most
Direct label	Replace legends with labels on the data
Strip ruthlessly	If a gridline does not help the reader, remove it
Title is a finding	“Gentoo Penguins Are Heavier” beats “Body Mass by Species”
Visualize models	Coefficient plots and predicted probabilities beat regression tables
Iterate	Your fourth draft is better than your first

Next Lecture (April 28)

Course Review: putting the entire semester together

References and Resources

Healy, K. (2018). Data Visualization: A Practical Introduction. Princeton University Press. https://socviz.co
Wilke, C. O. (2019). Fundamentals of Data Visualization. O’Reilly. https://clauswilke.com/dataviz/
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis (3rd ed.). Springer. https://ggplot2-book.org
Holtz, Y. The R Graph Gallery. https://r-graph-gallery.com
Holtz, Y. & Healy, C. From Data to Viz. https://www.data-to-viz.com
Scherer, C. Personal portfolio and tutorials. https://www.cedricscherer.com
Tufte, E. (2001). The Visual Display of Quantitative Information (2nd ed.). Graphics Press.
Cleveland, W. S., & McGill, R. (1984). Graphical perception. Journal of the American Statistical Association, 79(387), 531-554.

No lab activity for this lecture.

Data Visualization in R

EPI 553 — Principles of Statistical Inference II

Muntasir Masum

April 24, 2026