Statistics gives us numbers. Visualization makes them mean something. A great chart can communicate more in two seconds than a regression table can in two minutes. A bad chart can mislead, distract, or simply waste your reader’s time.
Today’s lecture is about the craft of data visualization in R using ggplot2, the grammar of graphics. We will draw on the design philosophies of three influential practitioners:
Today’s roadmap:
- Why visualization is a methods topic
- The grammar of graphics
- Healy’s principles: honesty, clarity, comparison
- Holtz’s chart-type decision tree
- Scherer’s editorial approach: from default to publication
- A worked transformation: same data, four iterations
- Color, type, and accessibility
- Modern techniques: highlighting, animation, interactivity, distributions, heatmaps, dumbbell charts, and more
- Visualizing regression models: coefficient plots, predicted probabilities, diagnostics
- The #30DayChartChallenge
We have spent the semester learning to estimate things: means, slopes, odds ratios, hazard ratios. Why end the course with visualization?
Three reasons.
1. Visualization is part of the analysis, not a decoration. A scatterplot reveals a non-linear relationship that a correlation coefficient hides. A residual plot exposes a violated assumption that an R² celebrates. A QQ plot finds a heavy tail that a t-test ignores. Every analysis you do should begin and end with looking at the data.
2. Visualization is how findings reach decision-makers. A clinician, a journalist, a policymaker, or a community partner is not going to read your beta coefficients. They will look at your figure. The figure is the one part of the paper that everyone reads.
3. Visualization is a discipline with its own theory. Bad charts are not just ugly; they are wrong. They make comparisons hard, they hide variation, they encode noise as signal. The grammar of graphics, perceptual research, and color theory are real things, and we should treat them the way we treat statistical theory.
“Above all else, show the data.” — Edward Tufte
Anscombe’s Quartet is a classic example: four datasets with identical summary statistics (same mean, same variance, same correlation, same regression line) but radically different patterns. Without visualization, you would never know.
library(tidyverse)
anscombe_long <- anscombe |>
pivot_longer(everything(),
names_to = c(".value", "set"),
names_pattern = "(.)(.)")
ggplot(anscombe_long, aes(x, y)) +
geom_point(size = 3, color = "#2E86AB", alpha = 0.8) +
geom_smooth(method = "lm", se = FALSE, color = "#e64173", linewidth = 1) +
facet_wrap(~set, ncol = 4, scales = "free") +
labs(title = "Anscombe's Quartet",
subtitle = "Same mean, same variance, same correlation, same regression line. Very different data.") +
theme_minimal(base_size = 12)Every analysis you do should begin and end with looking at the data.
In 1999, Leland Wilkinson published The Grammar of Graphics, which described a unified framework for thinking about statistical charts. Hadley Wickham translated this into R as ggplot2 in 2005. The grammar is the reason ggplot2 feels different from base R plotting: instead of memorizing dozens of functions, you compose plots from a small set of building blocks.
| Layer | Question | Function |
|---|---|---|
| Data | What data am I plotting? | ggplot(data = …) |
| Aesthetics | Which variables map to which visual properties (x, y, color, size, shape)? | aes(x = …, y = …, color = …) |
| Geometries | What shape do I draw (point, line, bar, smooth)? | geom_*() |
| Facets | Do I split into small multiples? | facet_wrap() / facet_grid() |
| Statistics | Do I transform the data (mean, density, smooth)? | stat_*() or geom_smooth() |
| Coordinates | What coordinate system (Cartesian, polar, log)? | coord_() / scale_() |
| Theme | How does it look (fonts, gridlines, background)? | theme_*() / theme() |
The genius of this framework is composability. To go from a scatterplot to a faceted scatterplot with a smoother, you add layers; you do not start over.
library(tidyverse)
library(palmerpenguins)
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_point(alpha = 0.7, size = 2) +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ island) +
scale_color_brewer(palette = "Dark2") +
labs(title = "Bill Dimensions of Palmer Penguins",
subtitle = "By species and island",
x = "Bill length (mm)", y = "Bill depth (mm)",
color = "Species") +
theme_minimal(base_size = 12)Every ggplot follows the same skeleton:
ggplot(data = <DATA>, # 1. Data
aes(x = <X>, y = <Y>, color = <Z>)) # 2. Aesthetics
+ geom_<TYPE>(...) # 3. Geometry
+ facet_wrap(~ <VAR>) # 4. Facets
+ scale_<AES>_<TYPE>(...) # 5. Scales
+ labs(title = "...", x = "...", y = "...") # 6. Labels
+ theme_minimal() # 7. ThemeYou will use this template for every chart in this course and beyond.
In Data Visualization: A Practical Introduction (2018), Healy organizes good practice around three questions: is the chart substantively good, is it perceptually good, and is it aesthetically good?
library(broom)
penguins |>
drop_na(sex) |>
ggplot(aes(x = species, y = body_mass_g, color = sex)) +
geom_jitter(width = 0.15, alpha = 0.4, size = 1.5) +
stat_summary(fun.data = mean_cl_normal, geom = "pointrange",
position = position_dodge(width = 0.5),
size = 0.8, linewidth = 1) +
scale_color_manual(values = c("female" = "#D55E00", "male" = "#0072B2")) +
labs(title = "Body Mass by Species and Sex",
subtitle = "Individual observations with mean and 95% CI",
x = NULL, y = "Body mass (g)", color = "Sex") +
theme_minimal(base_size = 12)Human perception is not uniform across visual encodings. Cleveland and McGill (1984) ranked encodings by accuracy:
| Rank | Encoding | Use_for |
|---|---|---|
| 1 | Position on a common scale | Most quantitative comparisons |
| 2 | Position on identical but non-aligned scales | Small multiples (faceting) |
| 3 | Length | Bar charts |
| 4 | Angle / slope | Pie chart slices (use sparingly) |
| 5 | Area | Bubble charts (use cautiously) |
| 6 | Volume | Almost never |
| 7 | Color hue | Categorical groups |
| 8 | Color saturation | Sequential / diverging variables |
Practical implication: Prefer dot plots and bar charts to pie charts. Prefer faceting to stacking. Use color for categories, not for quantities (unless you use a perceptually uniform scale like viridis).
Healy’s third pillar is what most beginners notice last but readers notice first: typography, white space, alignment, color harmony. A clean theme, a sans-serif font, restrained gridlines, and direct labeling will make a competent chart feel professional.
Yan Holtz built two enormously useful resources: The R Graph Gallery (https://r-graph-gallery.com) and From Data to Viz (https://www.data-to-viz.com). The latter is a decision tree: tell me what kind of data you have, and I will tell you which chart types make sense.
| Data type | Good chart types | Avoid |
|---|---|---|
| One numeric variable | Histogram, density plot, boxplot, violin | Pie chart for distributions |
| One categorical variable | Bar chart, lollipop, treemap | 3D pie chart, donut |
| Two numeric variables | Scatterplot, hexbin, 2D density | 3D scatter, dual-axis |
| Numeric × categorical | Boxplot, violin, ridge plot, jitter + summary | Bar with no error |
| Two categorical variables | Mosaic, heatmap, grouped bar | Stacked bars when comparing groups |
| Time series | Line chart, area chart, slope chart | Bar charts of time series |
| Map data | Choropleth, dot map, cartogram | Choropleth without normalizing by area or population |
| Network / hierarchy | Network graph, dendrogram, sunburst | Force-directed network with > 200 nodes |
Pie charts encode quantities as angles, which sit near the bottom of the perceptual hierarchy. Worse, they make comparisons across groups nearly impossible. A dot plot or a horizontal bar chart almost always tells the same story more clearly.
library(patchwork)
dat <- tibble(group = LETTERS[1:6],
value = c(22, 18, 17, 16, 14, 13))
p1 <- ggplot(dat, aes(x = "", y = value, fill = group)) +
geom_col(width = 1) +
coord_polar("y") +
scale_fill_brewer(palette = "Set2") +
labs(title = "Pie chart") +
theme_void()
p2 <- ggplot(dat, aes(x = value, y = fct_reorder(group, value))) +
geom_col(fill = "steelblue") +
geom_text(aes(label = value), hjust = -0.2) +
scale_x_continuous(expand = expansion(mult = c(0, 0.1))) +
labs(title = "Bar chart (sorted)", x = NULL, y = NULL) +
theme_minimal()
p1 + p2Which one lets you instantly say which group is third largest?
Cédric Scherer takes ggplot2 beyond the defaults and into the territory of editorial design — the kind of charts you see in The New York Times, Reuters, and Our World in Data. His work demonstrates that ggplot2 can produce publication-grade graphics with no post-processing, if you commit to learning the theme system.
ggtext — markdown and HTML inside
titles, subtitles, and labelspatchwork — composing multi-panel
figuresggrepel — non-overlapping text
labelsshowtext / sysfonts —
custom fonts (Google Fonts in your charts)MetBrewer, paletteer,
scico — color palettes from artists and scientific
colormapsggdist — visualizing uncertainty and
distributionspenguin_means <- penguins |>
drop_na() |>
group_by(species, year) |>
summarise(mass = mean(body_mass_g), .groups = "drop")
ggplot(penguin_means, aes(x = year, y = mass, color = species)) +
geom_line(linewidth = 1.2) +
geom_point(size = 3) +
ggrepel::geom_text_repel(
data = filter(penguin_means, year == max(year)),
aes(label = species),
hjust = 0, nudge_x = 0.1, direction = "y", segment.color = NA
) +
scale_color_manual(values = c("Adelie" = "#FF6B35",
"Chinstrap" = "#A23B72",
"Gentoo" = "#2E86AB")) +
scale_x_continuous(breaks = 2007:2009, expand = expansion(mult = c(0.05, 0.25))) +
labs(title = "Average Body Mass of Palmer Penguins, 2007-2009",
subtitle = "Direct labels replace the legend; gridlines are softened",
x = NULL, y = "Body mass (g)",
caption = "Source: palmerpenguins R package") +
theme_minimal(base_size = 12) +
theme(
legend.position = "none",
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(color = "grey40"),
plot.caption = element_text(color = "grey60", size = 9)
)Same data. Four versions. Watch the evolution.
This is honest. It is also forgettable.
ggplot(penguins, aes(x = species, y = body_mass_g, fill = species)) +
geom_boxplot(alpha = 0.5, outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.4)Now we can see how many penguins are in each species and where the outliers actually live.
ggplot(penguins, aes(x = species, y = body_mass_g, fill = species)) +
geom_boxplot(alpha = 0.5, outlier.shape = NA, color = "grey20") +
geom_jitter(width = 0.2, alpha = 0.4, color = "grey30") +
scale_fill_manual(values = c("#FF6B35", "#A23B72", "#2E86AB")) +
labs(title = "Body Mass of Palmer Penguins by Species",
x = NULL, y = "Body mass (g)") +
theme_minimal(base_size = 12) +
theme(legend.position = "none")library(ggdist)
ggplot(drop_na(penguins, body_mass_g),
aes(x = species, y = body_mass_g, fill = species)) +
stat_halfeye(adjust = 0.5, width = 0.6, .width = 0,
justification = -0.3, point_color = NA) +
geom_boxplot(width = 0.15, outlier.shape = NA, alpha = 0.7) +
geom_jitter(width = 0.05, alpha = 0.3, size = 1.2) +
scale_fill_manual(values = c("#FF6B35", "#A23B72", "#2E86AB")) +
coord_cartesian(xlim = c(1.2, NA), clip = "off") +
labs(title = "Gentoo Penguins Are Substantially Heavier",
subtitle = "Distribution, median, and individual observations of body mass by species",
x = NULL, y = "Body mass (g)",
caption = "Source: palmerpenguins R package") +
theme_minimal(base_size = 12) +
theme(
legend.position = "none",
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(color = "grey40"),
plot.caption = element_text(color = "grey60", size = 9)
)The final chart shows: the distribution (raincloud), the summary (boxplot), the raw data (jitter), and a conclusion baked into the title. That is the difference between a chart and a finding.
Set2, Dark2, the Okabe-Ito
palette.viridis,
mako, Blues.RdBu, BrBG. Use only when
there is a meaningful zero.colorBlindness::cvdPlot() or
prismatic::clr_protan().okabe_ito <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442",
"#0072B2", "#D55E00", "#CC79A7", "#000000")
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_point(size = 2, alpha = 0.8) +
scale_color_manual(values = okabe_ito) +
labs(title = "Okabe-Ito Palette: Designed for Colorblind Accessibility") +
theme_minimal()About 8% of men and 0.5% of women have some form of color vision deficiency. Here is a side-by-side comparison of a problematic palette versus a safe one:
library(patchwork)
p_bad <- ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_point(size = 2.5, alpha = 0.8) +
scale_color_manual(values = c("red", "green", "blue")) +
labs(title = "Red-Green-Blue: invisible to ~8% of men",
x = "Bill length (mm)", y = "Bill depth (mm)") +
theme_minimal(base_size = 12)
p_good <- ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_point(size = 2.5, alpha = 0.8) +
scale_color_manual(values = c("#E69F00", "#56B4E9", "#009E73")) +
labs(title = "Okabe-Ito: safe for all viewers",
x = "Bill length (mm)", y = "Bill depth (mm)") +
theme_minimal(base_size = 12)
p_bad + p_good +
plot_layout(guides = "collect") &
theme(legend.position = "bottom")Rule of thumb: use
viridis-family palettes by default. Check withcolorBlindness::cvdPlot()or coblis.com.
Whenever possible, put the label next to the thing it labels. Legends force the reader’s eye to bounce between the chart and a key. Direct labels eliminate that bounce.
Before you submit a chart, ask:
The first seven parts gave you a foundation. This section is the “show me what’s possible” tour. These techniques are what separate a competent ggplot2 user from someone whose figures get retweeted.
gghighlightA common problem: you have many groups but only a few matter to your
story. Faceting is one answer; highlighting is the
other. The gghighlight package fades out everything except
the lines or points you want the reader to focus on.
library(gghighlight)
gapminder_like <- penguins |>
drop_na() |>
group_by(species, year) |>
summarise(mean_mass = mean(body_mass_g), .groups = "drop")
ggplot(gapminder_like, aes(x = year, y = mean_mass, color = species)) +
geom_line(linewidth = 1.4) +
geom_point(size = 3) +
gghighlight(species == "Gentoo",
unhighlighted_params = list(color = "grey80", linewidth = 0.8)) +
scale_color_manual(values = c("Gentoo" = "#2E86AB")) +
labs(title = "Gentoo Penguins Stand Out",
subtitle = "Other species are visible but recede into the background",
x = NULL, y = "Mean body mass (g)") +
theme_minimal(base_size = 12) +
theme(legend.position = "none")Why this matters: the eye is drawn instantly to the highlighted line, but the reader still has the full context of how Gentoo compares to the other species. This is far more powerful than deleting the other groups.
gganimateTime-series and longitudinal data come alive when animated. The
gganimate package extends ggplot2 with a small set of
transition_*(), enter_*(), and
ease_aes() functions. The result is a GIF or MP4 that can
be embedded in slide decks, dashboards, or HTML reports.
library(gganimate)
library(gapminder)
p <- ggplot(gapminder,
aes(x = gdpPercap, y = lifeExp,
size = pop, color = continent)) +
geom_point(alpha = 0.7) +
scale_x_log10() +
scale_size(range = c(2, 12), guide = "none") +
scale_color_brewer(palette = "Set2") +
labs(title = "Year: {frame_time}",
subtitle = "Hans Rosling's classic visualization, animated",
x = "GDP per capita (log scale)",
y = "Life expectancy (years)",
color = "Continent") +
theme_minimal(base_size = 14) +
transition_time(year) +
ease_aes("cubic-in-out")
animate(p, nframes = 100, fps = 10, width = 800, height = 500,
renderer = gifski_renderer())
anim_save("gapminder.gif")Three key animation transitions:
| Function | What it animates |
|---|---|
transition_time() |
A continuous variable (e.g., year, day) |
transition_states() |
A discrete variable, with pauses on each state |
transition_reveal() |
Reveals the data progressively along an axis |
Tip from Cédric Scherer: animations should serve a narrative purpose. If the same insight is clearer in a static small-multiple, prefer the static version. Animation costs the reader attention; spend it deliberately.
plotly and
ggiraphFor HTML reports, dashboards, and Shiny apps, interactivity lets the reader explore. Two approaches:
plotly::ggplotly() — instant interactivity from any
ggplotggiraph — finer control, supports animation, click
events, and Shinylibrary(ggiraph)
p <- ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g, color = species)) +
geom_point_interactive(
aes(tooltip = paste("Species:", species, "<br>Island:", island),
data_id = species),
size = 2.5
) +
scale_color_brewer(palette = "Dark2") +
theme_minimal()
girafe(ggobj = p, options = list(opts_hover(css = "stroke:black;stroke-width:2px;")))ggdist and
ggridgesBoxplots hide bimodality. Histograms hide group differences. Modern distribution geoms show shape, summary, and uncertainty in one chart.
ggdist)A raincloud combines a half-violin (the cloud), a boxplot (the umbrella), and jittered raw points (the rain). It is the single most informative way to show a distribution.
library(ggdist)
ggplot(drop_na(penguins, body_mass_g),
aes(x = body_mass_g, y = species, fill = species)) +
stat_halfeye(adjust = 0.6, .width = 0, justification = -0.2,
point_color = NA, alpha = 0.7) +
geom_boxplot(width = 0.15, outlier.shape = NA, alpha = 0.5) +
geom_jitter(height = 0.07, alpha = 0.3, size = 1.2) +
scale_fill_manual(values = c("#FF6B35", "#A23B72", "#2E86AB")) +
labs(title = "Raincloud Plot of Penguin Body Mass",
subtitle = "Distribution + boxplot + raw observations",
x = "Body mass (g)", y = NULL) +
theme_minimal(base_size = 12) +
theme(legend.position = "none")ggridges)When you want to compare distributions across many groups in a small space.
library(ggridges)
ggplot(diamonds, aes(x = price, y = cut, fill = cut)) +
geom_density_ridges(alpha = 0.8, scale = 1.1) +
scale_x_log10(labels = scales::dollar) +
scale_fill_viridis_d(option = "mako") +
labs(title = "Distribution of Diamond Prices by Cut",
subtitle = "Ridge plots make many distributions comparable",
x = "Price (log scale)", y = NULL) +
theme_minimal(base_size = 12) +
theme(legend.position = "none")Statisticians love confidence intervals. Readers often miss them
because they look like skinny error bars on top of bold point estimates.
ggdist and ggdist::stat_dist_*() let you
visualize the whole posterior or sampling
distribution.
library(ggdist)
set.seed(1220)
estimates <- tibble(
predictor = c("Smoking", "Exercise", "Income", "Sleep", "Age"),
estimate = c(0.65, -0.45, -0.20, -0.30, 0.05),
se = c(0.10, 0.08, 0.07, 0.09, 0.04)
)
ggplot(estimates, aes(y = fct_reorder(predictor, estimate),
xdist = distributional::dist_normal(estimate, se))) +
stat_halfeye(.width = c(0.5, 0.95), fill = "#2E86AB",
slab_alpha = 0.6, point_size = 3) +
geom_vline(xintercept = 0, linetype = "dashed", color = "grey40") +
labs(title = "Coefficient Plot with Full Sampling Distributions",
subtitle = "Thick band = 50% interval, thin line = 95% interval",
x = "Log odds ratio", y = NULL) +
theme_minimal(base_size = 12)This is far more honest than a forest plot of point estimates with whiskers, because the reader can see the shape of the uncertainty.
A single panel rarely tells a complete story. The
patchwork package lets you compose multi-panel figures with
arithmetic-like syntax: p1 + p2, p1 / p2,
(p1 | p2) / p3.
library(patchwork)
p_scatter <- ggplot(penguins, aes(bill_length_mm, body_mass_g, color = species)) +
geom_point(alpha = 0.7) +
scale_color_manual(values = c("#FF6B35", "#A23B72", "#2E86AB")) +
theme_minimal(base_size = 11) +
theme(legend.position = "none") +
labs(x = "Bill length (mm)", y = "Body mass (g)")
p_top <- ggplot(penguins, aes(bill_length_mm, fill = species)) +
geom_density(alpha = 0.6) +
scale_fill_manual(values = c("#FF6B35", "#A23B72", "#2E86AB")) +
theme_void() +
theme(legend.position = "none")
p_right <- ggplot(penguins, aes(body_mass_g, fill = species)) +
geom_density(alpha = 0.6) +
scale_fill_manual(values = c("#FF6B35", "#A23B72", "#2E86AB")) +
coord_flip() +
theme_void() +
theme(legend.position = "none")
(p_top + plot_spacer() + p_scatter + p_right +
plot_layout(ncol = 2, widths = c(4, 1), heights = c(1, 4))) +
plot_annotation(
title = "Marginal Density + Scatterplot",
subtitle = "Composed with patchwork",
theme = theme(plot.title = element_text(face = "bold", size = 14))
)The single fastest way to make a chart “editorial” is to
annotate the finding directly on the chart.
ggtext lets you use markdown and HTML in titles, subtitles,
and labels. geom_curve and annotate let you
point at things.
library(ggtext)
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point(alpha = 0.7, size = 2.2) +
annotate("curve", x = 175, y = 5500, xend = 187, yend = 5000,
arrow = arrow(length = unit(0.2, "cm")),
curvature = -0.3, color = "grey30") +
annotate("text", x = 173, y = 5650,
label = "Gentoos cluster\nin the heavy / long-flipper corner",
hjust = 0, size = 3.4, color = "grey20", lineheight = 0.95) +
scale_color_manual(values = c("Adelie" = "#FF6B35",
"Chinstrap" = "#A23B72",
"Gentoo" = "#2E86AB")) +
labs(title = "Body mass scales with flipper length, but **species matter**",
subtitle = "Highlighted with a callout instead of a legend explainer",
x = "Flipper length (mm)", y = "Body mass (g)",
color = NULL) +
theme_minimal(base_size = 12) +
theme(plot.title = element_markdown(face = "bold", size = 14),
legend.position = "top")sfPublic health data are almost always spatial. The sf
package gives ggplot2 first-class support for shapefiles, projections,
and choropleths.
library(sf)
library(tigris)
options(tigris_use_cache = TRUE)
ny_counties <- counties(state = "NY", cb = TRUE, class = "sf")
# Hypothetical FMD prevalence by county
ny_counties$fmd_prev <- runif(nrow(ny_counties), 8, 18)
ggplot(ny_counties) +
geom_sf(aes(fill = fmd_prev), color = "white", linewidth = 0.2) +
scale_fill_viridis_c(option = "rocket", direction = -1,
name = "FMD %") +
labs(title = "Frequent Mental Distress Prevalence by NY County",
subtitle = "Hypothetical data for demonstration") +
theme_void(base_size = 12) +
theme(plot.title = element_text(face = "bold"))gt and
gtExtrasSometimes the right chart is a table. The gt package and
its extension gtExtras let you build editorial-quality
tables with inline sparklines, color-graded cells, and embedded
plots.
library(gt)
library(gtExtras)
penguins |>
drop_na() |>
group_by(species) |>
summarise(
n = n(),
mean_mass = mean(body_mass_g),
masses = list(body_mass_g),
.groups = "drop"
) |>
gt() |>
gt_plt_dist(masses, type = "density", fill_color = "#2E86AB") |>
fmt_number(mean_mass, decimals = 0) |>
cols_label(species = "Species", n = "N",
mean_mass = "Mean Mass (g)", masses = "Distribution") |>
tab_header(title = md("**Penguin Body Mass by Species**"),
subtitle = "With inline density plots") |>
gt_theme_538()| Penguin Body Mass by Species | |||
| With inline density plots | |||
| Species | N | Mean Mass (g) | Distribution |
|---|---|---|---|
| Adelie | 146 | 3,706 | |
| Chinstrap | 68 | 3,733 | |
| Gentoo | 119 | 5,092 | |
Heatmaps turn a matrix of numbers into a pattern you can see. Essential for correlation tables, confusion matrices, and time-by-group summaries.
# Correlation matrix of numeric penguin variables
cor_data <- penguins |>
drop_na() |>
select(where(is.numeric)) |>
cor() |>
as.data.frame() |>
rownames_to_column("var1") |>
pivot_longer(-var1, names_to = "var2", values_to = "cor")
ggplot(cor_data, aes(x = var1, y = var2, fill = cor)) +
geom_tile(color = "white", linewidth = 0.8) +
geom_text(aes(label = round(cor, 2),
color = abs(cor) > 0.6),
size = 4.5, fontface = "bold") +
scale_fill_gradient2(low = "#D55E00", mid = "white", high = "#0072B2",
midpoint = 0, limits = c(-1, 1),
name = "Correlation") +
scale_color_manual(values = c("TRUE" = "white", "FALSE" = "grey20"),
guide = "none") +
scale_x_discrete(labels = \(x) str_replace_all(x, "_", "\n")) +
scale_y_discrete(labels = \(x) str_replace_all(x, "_", "\n")) +
labs(title = "Correlation Heatmap of Penguin Measurements",
subtitle = "Color intensity encodes strength; text encodes exact value",
x = NULL, y = NULL) +
coord_fixed() +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 10),
axis.text.y = element_text(size = 10),
panel.grid = element_blank())When you need to show the difference between two time points or conditions, dumbbell charts are more effective than grouped bars. They encode direction, magnitude, and rank simultaneously.
# Simulated epi example: disease rates before/after intervention
intervention_data <- tibble(
county = c("Albany", "Saratoga", "Rensselaer", "Schenectady",
"Columbia", "Greene", "Warren", "Washington"),
before = c(15.2, 12.8, 18.1, 16.5, 11.3, 14.7, 9.8, 13.2),
after = c(11.1, 10.2, 12.4, 13.8, 9.5, 11.0, 8.1, 10.9)
) |>
mutate(change = after - before,
county = fct_reorder(county, change))
ggplot(intervention_data) +
geom_segment(aes(x = before, xend = after,
y = county, yend = county),
color = "grey60", linewidth = 1.2) +
geom_point(aes(x = before, y = county), color = "#D55E00",
size = 4) +
geom_point(aes(x = after, y = county), color = "#0072B2",
size = 4) +
annotate("text", x = 19, y = 8.3, label = "Before",
color = "#D55E00", fontface = "bold", size = 4.5) +
annotate("text", x = 19, y = 7.7, label = "After",
color = "#0072B2", fontface = "bold", size = 4.5) +
labs(title = "Every County Improved After the Intervention",
subtitle = "Rate per 1,000 population, before vs. after community health program",
x = "Rate per 1,000", y = NULL) +
scale_x_continuous(limits = c(7, 20)) +
theme_minimal(base_size = 12)Once you have a look you like, package it as a function and reuse it across every chart in your manuscript or thesis. This is what professional outlets do.
theme_epi553 <- function(base_size = 12) {
theme_minimal(base_size = base_size) +
theme(
plot.title = element_text(face = "bold", size = base_size + 2,
margin = margin(b = 4)),
plot.subtitle = element_text(color = "grey40",
margin = margin(b = 12)),
plot.caption = element_text(color = "grey60", size = base_size - 3,
hjust = 0, margin = margin(t = 12)),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(color = "grey92"),
axis.title = element_text(color = "grey30"),
axis.text = element_text(color = "grey30"),
strip.text = element_text(face = "bold", color = "grey20"),
legend.position = "top",
legend.title = element_text(face = "bold", size = base_size - 1),
plot.title.position = "plot",
plot.caption.position = "plot"
)
}
ggplot(penguins, aes(bill_length_mm, body_mass_g, color = species)) +
geom_point(alpha = 0.8, size = 2.2) +
scale_color_manual(values = c("#FF6B35", "#A23B72", "#2E86AB")) +
labs(title = "Reusable EPI 553 Theme in Action",
subtitle = "Define once, apply everywhere",
x = "Bill length (mm)", y = "Body mass (g)", color = "Species",
caption = "Source: palmerpenguins") +
theme_epi553()Practical tip: drop
theme_epi553()(or whatever you call yours) into aR/themes.Rfile in your project. Source it from every analysis. Your figures will look consistent without effort.
Here is the workflow used by the data viz teams at The Pudding, FiveThirtyEight, and The Economist:
ggsave() with dpi = 300 for print,
dpi = 96 for web. SVG for vectors.ggsave("figure_1.png", width = 8, height = 5, dpi = 300, bg = "white")
ggsave("figure_1.svg", width = 8, height = 5) # vector for editorial
ggsave("figure_1.pdf", width = 8, height = 5, device = cairo_pdf)You have spent the semester building models. Now make them visible.
A regression table is a wall of numbers. A coefficient plot communicates direction, magnitude, uncertainty, and significance in one glance. A table requires row-by-row mental math.
Consider this table:
| Term | Estimate | SE | p |
|---|---|---|---|
| (Intercept) | -1.23 | 0.41 | 0.003 |
| Smoking | 0.65 | 0.10 | <0.001 |
| Exercise | -0.45 | 0.08 | <0.001 |
| Income | -0.20 | 0.07 | 0.004 |
| Sleep | -0.30 | 0.09 | 0.001 |
| Age | 0.05 | 0.04 | 0.211 |
Now compare it to the visual version:
library(ggdist)
library(distributional)
estimates <- tibble(
predictor = c("Smoking", "Exercise", "Income", "Sleep", "Age"),
estimate = c(0.65, -0.45, -0.20, -0.30, 0.05),
se = c(0.10, 0.08, 0.07, 0.09, 0.04)
) |>
mutate(
lower = estimate - 1.96 * se,
upper = estimate + 1.96 * se,
significant = !(lower <= 0 & upper >= 0)
)
ggplot(estimates, aes(x = estimate,
y = fct_reorder(predictor, estimate),
color = significant)) +
geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
geom_pointrange(aes(xmin = lower, xmax = upper),
size = 0.8, linewidth = 1.1) +
scale_color_manual(values = c("TRUE" = "#2E86AB", "FALSE" = "grey60"),
guide = "none") +
labs(title = "Same Model, Instantly Readable",
subtitle = "Blue = significant at p < 0.05; grey = non-significant",
x = "Log odds ratio (95% CI)", y = NULL) +
theme_minimal(base_size = 12)Fit a logistic regression on the penguins data and plot the odds ratios directly.
library(broom)
# Fit a logistic regression: predict heavy penguin (above median mass)
model_data <- penguins |>
drop_na() |>
mutate(heavy = as.integer(body_mass_g > median(body_mass_g)))
fit <- glm(heavy ~ bill_length_mm + bill_depth_mm + flipper_length_mm +
species + sex,
data = model_data, family = binomial)
# Tidy the model output (use Wald CIs for stability)
model_tidy <- tidy(fit, exponentiate = TRUE) |>
filter(term != "(Intercept)") |>
mutate(
conf.low = exp(log(estimate) - 1.96 * std.error),
conf.high = exp(log(estimate) + 1.96 * std.error),
term = case_match(term,
"bill_length_mm" ~ "Bill length (mm)",
"bill_depth_mm" ~ "Bill depth (mm)",
"flipper_length_mm" ~ "Flipper length (mm)",
"speciesChinstrap" ~ "Chinstrap vs. Adelie",
"speciesGentoo" ~ "Gentoo vs. Adelie",
"sexmale" ~ "Sex (male vs. female)"
),
significant = p.value < 0.05
)
ggplot(model_tidy, aes(x = estimate, y = fct_reorder(term, estimate),
color = significant)) +
geom_vline(xintercept = 1, linetype = "dashed", color = "grey50") +
geom_pointrange(aes(xmin = conf.low, xmax = conf.high),
size = 0.8, linewidth = 1.1) +
geom_text(aes(label = sprintf("OR = %.2f", estimate)),
vjust = -1, size = 3.8, show.legend = FALSE) +
scale_color_manual(values = c("TRUE" = "#e64173", "FALSE" = "grey60"),
guide = "none") +
scale_x_log10() +
labs(title = "Predictors of Above-Median Body Mass (Logistic Regression)",
subtitle = "Odds ratios with 95% CI on log scale; red = p < 0.05",
x = "Odds Ratio (log scale)", y = NULL,
caption = "Model: glm(heavy ~ bill + flipper + species + sex, family = binomial)") +
theme_minimal(base_size = 12)Show what your model predicts across the range of a key variable, holding others at their means.
library(scales)
# Generate predictions across flipper length range
pred_grid <- tibble(
flipper_length_mm = seq(170, 235, length.out = 200),
bill_length_mm = mean(model_data$bill_length_mm),
bill_depth_mm = mean(model_data$bill_depth_mm),
species = "Adelie",
sex = "female"
)
preds <- augment(fit, newdata = pred_grid, type.predict = "response",
se_fit = TRUE) |>
mutate(lower = pmax(.fitted - 1.96 * .se.fit, 0),
upper = pmin(.fitted + 1.96 * .se.fit, 1))
ggplot(preds, aes(x = flipper_length_mm, y = .fitted)) +
geom_ribbon(aes(ymin = lower, ymax = upper),
fill = "#2E86AB", alpha = 0.2) +
geom_line(color = "#2E86AB", linewidth = 1.3) +
geom_rug(data = model_data,
aes(x = flipper_length_mm, y = heavy),
sides = "tb", alpha = 0.15, color = "grey40") +
scale_y_continuous(labels = label_percent()) +
labs(title = "Predicted Probability of Above-Median Mass by Flipper Length",
subtitle = "Logistic regression (Adelie, female); rug marks show observed data",
x = "Flipper length (mm)", y = "Predicted probability",
caption = "Shaded band = approximate 95% confidence interval") +
theme_minimal(base_size = 12)Predicted probability curves are one of the most effective ways to communicate logistic regression results to non-statisticians.
The ggeffects package automates predicted value plots
for any model class. One line to get predicted values, and a built-in
plot() method for quick visualization.
library(ggeffects)
preds <- ggpredict(fit, terms = c("flipper_length_mm [170:235 by=1]", "sex"),
condition = c(species = "Adelie"))
ggplot(as.data.frame(preds),
aes(x = x, y = predicted, color = group, fill = group)) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.15,
color = NA) +
geom_line(linewidth = 1.2) +
scale_color_manual(values = c("female" = "#D55E00", "male" = "#0072B2")) +
scale_fill_manual(values = c("female" = "#D55E00", "male" = "#0072B2")) +
scale_y_continuous(labels = label_percent()) +
labs(title = "Marginal Effect of Flipper Length by Sex (Adelie)",
subtitle = "ggeffects automates predicted values for any model",
x = "Flipper length (mm)", y = "P(above-median mass)",
color = "Sex", fill = "Sex") +
theme_minimal(base_size = 12)You can also extract the data for custom ggplot with
as.data.frame(preds) and build your own visualization from
scratch.
performance + see
PackagesThe performance package (from easystats) provides model
diagnostics; see visualizes them with ggplot2.
check_model() replaces the base R plot(model)
with a modern, multi-panel diagnostic dashboard.
# Fit a linear model for diagnostics demo
lm_fit <- lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + species,
data = drop_na(penguins))
library(performance)
library(see)
check_model(lm_fit, check = c("linearity", "normality", "qq", "homogeneity"))When comparing nested or competing models, visualize the fit statistics side by side. This shows how stable an effect is as covariates are added.
# Fit competing models
m1 <- glm(heavy ~ flipper_length_mm, data = model_data, family = binomial)
m2 <- glm(heavy ~ flipper_length_mm + species, data = model_data, family = binomial)
m3 <- glm(heavy ~ flipper_length_mm + species + sex, data = model_data, family = binomial)
m4 <- fit # full model from earlier
# Helper to get Wald CIs
tidy_wald <- function(mod, label) {
tidy(mod, exponentiate = TRUE) |>
mutate(conf.low = exp(log(estimate) - 1.96 * std.error),
conf.high = exp(log(estimate) + 1.96 * std.error),
model = label)
}
# Compare coefficients across models
models_tidy <- bind_rows(
tidy_wald(m1, "Model 1:\nFlipper only"),
tidy_wald(m2, "Model 2:\n+ Species"),
tidy_wald(m3, "Model 3:\n+ Sex"),
tidy_wald(m4, "Model 4:\n+ Bill measures")
) |>
filter(term == "flipper_length_mm")
ggplot(models_tidy, aes(x = estimate, y = model)) +
geom_vline(xintercept = 1, linetype = "dashed", color = "grey50") +
geom_pointrange(aes(xmin = conf.low, xmax = conf.high),
color = "#2E86AB", size = 1, linewidth = 1.1) +
geom_text(aes(label = sprintf("OR = %.2f (%.2f, %.2f)",
estimate, conf.low, conf.high)),
vjust = -1.2, size = 3.8, color = "grey30") +
scale_x_log10() +
labs(title = "How Stable Is the Flipper Length Effect Across Models?",
subtitle = "Odds ratio for flipper_length_mm as covariates are added",
x = "Odds Ratio (log scale)", y = NULL,
caption = "Stable estimates across nested models suggest robust association") +
theme_minimal(base_size = 12)Every April, the data visualization community participates in the #30DayChartChallenge: one prompt per day, one chart per day, shared on social media. Created in 2021 by Cedric Scherer and Dominic Roye, inspired by the #30DayMapChallenge.
| Prompt | Chart Type | R Package |
|---|---|---|
| Part-to-whole | Waffle chart | waffle |
| Ranking | Bump chart | ggbump |
| Slope | Slope chart | geom_segment |
| Circular | Polar bar | coord_polar() |
| Uncertainty | Gradient intervals | ggdist |
| Relationships | Network | ggraph + tidygraph |
| Neo-geometric | Voronoi | ggforce |
| Storytelling | Annotated timeline | ggtext + annotate |
#30DayChartChallenge on Twitter/X or
MastodonThe 2026 edition is at github.com/30DayChartChallenge/Edition2026.
A waffle chart is a part-to-whole alternative to pie charts, popularized by the #30DayChartChallenge. Each square represents a unit.
library(waffle)
penguin_counts <- penguins |>
drop_na() |>
count(species) |>
mutate(n_scaled = round(n / 5)) # each square = 5 penguins
waffle(
c("Adelie" = penguin_counts$n_scaled[1],
"Chinstrap" = penguin_counts$n_scaled[2],
"Gentoo" = penguin_counts$n_scaled[3]),
rows = 5,
size = 1,
colors = c("#FF6B35", "#A23B72", "#2E86AB"),
title = "Palmer Penguins by Species",
xlab = "1 square = 5 penguins"
) +
theme(plot.title = element_text(face = "bold", size = 16),
legend.position = "bottom")Challenge for you: pick a prompt from the #30DayChartChallenge and create a chart using a dataset from this semester. Share it!
| Principle | Practical Rule |
|---|---|
| Grammar of graphics | Build plots in layers; do not start from scratch |
| Show the data | Prefer jitter + summary to bar charts of means |
| Quantify uncertainty | CIs, error bars, halfeye plots |
| Position > color > area | Use the highest-accuracy encoding for the comparison that matters most |
| Direct label | Replace legends with labels on the data |
| Strip ruthlessly | If a gridline does not help the reader, remove it |
| Title is a finding | “Gentoo Penguins Are Heavier” beats “Body Mass by Species” |
| Visualize models | Coefficient plots and predicted probabilities beat regression tables |
| Iterate | Your fourth draft is better than your first |
No lab activity for this lecture.