Perhaps the way to introduce this is to provide the dominant R
graphics package’s (ggplot2
) intellectual history.
Hadley Wickham starts grad school at Iowa State, eventually to be supervised by Diana Cook, an Australian statistician. In assessing the extant computational approaches to statistical graphics, Hadley observes that they tend to cluster as follows:
Which design traits characterize both software clusters?
Graph Zoo is a system which requires you to flick through a gallery of graphs, each of which is produced by a canned routine. Upon finding the sort of graph you intend, you inspect the relevant code. If your data are in exactly the format that was in mind of the routine’s developers, and your desired is exactly what you see in the zoo, then you’re fine.1 This is the type of graphical suite built into SPSS, Stata, SAS, Matplotlib, Lattice, etc. A consequence of this design philosophy is a large number of functions, each of which corresponds to a desired graph type, and each function has a innumerable arguments.
Microcontrol is a system which provides almost no graphical primitives, and instead requires that the user specify the precise location of each graphical element. This is apparent in D3 (the JavaScript graphing library), and a little like gnuplot.
So Graph Zoo lets you try out graphs quickly, but it’s difficult to
customize and adapt. Microcontrol provides you exceptional flexibility,
but iterating is laborious. ggplot2
’s genius is to combine
the best of both worlds–flexibility and efficiency.
Hadley began working in 2005 on what would become ggplot2, with the following two design ambitions
What does this mean?
ggplot will take care of the rest
This is a huge obstacle for people who have just started working in ggplot. People are used to passing raw data–untransformed microdata–to a graphing routine, and the routine knowing how to make one specific graph. When each function only makes a specific type of graph (ie, the very epitome of Graph Zoo), then it can be written in a way to do all the required computations for that graph. As we’ve said, this is a rejection of the ggplot approach, wherein they want to express all graphs types more fluently and generally, with as little graph syntax as possible.
Accordingly, as a heuristic, think of the graph making as the final part of a data analysis process, and something you’ll only undertake after you’ve finished computing your necessary quantities.
You should recognize a lovely trait of tidyverse design philosophy
–the common geom_
prefix providing the user the opportunity
to hit the TAB
button after the cursor is flashing on the
_
, and cycling through the list of available geoms.
We’ll load an updated version of the US box office data we used in an earlier lab
library(tidyverse)
library(magrittr)
t1 <- "https://github.com/thomasjwood/code_lab/raw/main/data/box_office_jan_97_feb_24.rds" %>%
url %>%
readRDS
t1 %>%
glimpse
## Rows: 323,662
## Columns: 17
## $ id <chr> "rl1917879809", "rl2656798209", "rl912098817", "rl91209881…
## $ title <chr> "Evita", "Independence Day", "Jerry Maguire", "Jerry Magui…
## $ date <date> 1997-01-01, 1997-01-01, 1997-01-01, 1997-01-01, 1997-01-0…
## $ dow <chr> "Wednesday", "Wednesday", "Wednesday", "Wednesday", "Thurs…
## $ daily <dbl> 408292, 1594, 3462509, 3462509, 292698, 1240, 1837884, 183…
## $ to_date <dbl> 868379, 306157684, 69138326, 69138326, 1161077, 306158924,…
## $ daily_adj <dbl> 744400.3400, 2906.1900, 6312866.4699, 6312866.4699, 533648…
## $ to_date_adj <dbl> 1583234, 558188463, 126053397, 126053397, 2116882, 5581907…
## $ percent_yd <dbl> 556.0, 43.1, 20.3, 20.3, -28.3, -22.2, -46.9, -46.9, 18.3,…
## $ percent_lw <dbl> 472.6, 75.7, 11.6, 11.6, 330.1, 18.7, -49.1, -49.1, 435.0,…
## $ theaters <dbl> 22, 23, 2531, 2531, 22, 23, 2531, 2531, 22, 21, 2531, 2531…
## $ avg <dbl> 18558, 69, 1368, 1368, 13304, 53, 726, 726, 15738, 136, 16…
## $ day <dbl> 8, 183, 20, 20, 9, 184, 21, 21, 10, 185, 22, 22, 11, 186, …
## $ rating <chr> "PG", "PG-13", "R", "R", "PG", "PG-13", "R", "R", "PG", "P…
## $ runtime <dbl> 135, 145, 139, 139, 135, 145, 139, 139, 135, 145, 139, 139…
## $ genre <chr> "Biography Drama History Musical", "Action Adventure Sci-F…
## $ studio <chr> "Walt Disney Studios Motion Pictures", "Twentieth Century …
What if we wanted to plot opening weekends adjusted yield against the number of theatres?
t2 <- t1 %>%
filter(
day < 7 &
dow %>%
is_in(
c("Thursday",
"Friday",
"Saturday",
"Sunday")
)
) %>%
group_by(
id, title
) %>%
summarise(
tote = daily_adj %>% sum,
theaters = theaters %>% max
)
## `summarise()` has grouped output by 'id'. You can override using the `.groups`
## argument.
p1 <- t2 %>%
ggplot(
aes(
x = theaters, y = tote
)
)
p1 +
geom_point()
Or we could use a smoother
p1 +
geom_smooth()
Or we could use a density map
p1 +
geom_density2d()
Or maybe we’re interested in day of the week differences?
t1 %>%
mutate(
dow = dow %>%
factor(
c(2:7, 1) %>%
wday(label = T, abbr = F)
)
) %>%
ggplot(
aes(
x = dow, y = daily_adj
)
) +
stat_boxplot() +
scale_y_log10()
Mmmmm maybe just means and standard errors would be simpler?
t1 %>%
mutate(
dow = dow %>%
factor(
c(2:7, 1) %>%
wday(label = T, abbr = F)
)
) %>%
group_by(dow) %>%
summarize(
mu = daily_adj %>% mean,
se = daily_adj %>% plotrix::std.error()
) %>%
mutate(
low = mu %>%
subtract(
se %>%
multiply_by(1.96)
),
hi = mu %>%
add(
se %>%
multiply_by(1.96)
)
) %>%
ggplot(
aes(
x = dow, y = mu, ymin = low, ymax = hi
)
) +
geom_pointrange()
Or maybe we exploit one the huge number of R extensions?
t1 %>%
mutate(
dow = dow %>%
factor(
c(2:7, 1) %>%
wday(label = T, abbr = F)
)
) %>%
ggplot() +
geom_violin(
aes(
x = dow, y = daily_adj
),
adjust = 200
)
We can make nice line graphs for some recent hits, and less than hits
t1 %>%
filter(
title %>%
is_in(
c("Barbie",
"Oppenheimer",
"The Super Mario Bros. Movie",
"Guardians of the Galaxy Vol. 3",
"Spider-Mar: Across the Spider-Verse",
"Taylor Swift: The Eras Tour")
)
) %>%
mutate(
title = title %>%
fct_reorder(
to_date_adj,
.desc = T
)
) %>%
ggplot() +
geom_step(
aes(
x = day, y = to_date_adj, color = title
)
)
And the 30 biggest films of last year.
Note–we do all the math first, with
dplyr
verbs, before we do the plotting.
t1 %>%
filter(
date %>%
year %>%
equals(2023)
) %>%
group_by(title) %>%
summarize(
tote = daily_adj %>% sum
) %>%
arrange(desc(tote)) %>%
slice(1:30) %>%
mutate(
title = title %>%
fct_inorder %>%
fct_rev,
lab = tote %>%
divide_by(1e6) %>%
round %>%
prettyNum(big.mark = ",")
) %>%
ggplot(
aes(
tote, title
)
) +
geom_col(
fill = "grey90",
color = "grey10",
linewidth = .2,
width = .6
) +
geom_text(
aes(
tote, label = lab
),
position = position_nudge(x = 6e7),
size = 3,
) +
scale_x_continuous(
breaks = NULL,
expand = expansion(add = c(1e5, 5e7))
) +
labs(
x = "Gross ($ millions)",
y = "",
title = "30 biggest US films, 2023"
) +
theme_minimal() +
theme(
panel.grid = element_blank(),
plot.title = element_text(face = "bold")
)
lubridate::yday()
might be useful.)A classic example of this is apparent when you suggest a graph revision to show some other relationship, print the outcome of a statistical result, etc. Often, the user of graph zoo writes back and says “I have no idea how to make such a graph, the graphs I included are the graphs I include in all my papers.”↩︎