Graphics

Perhaps the way to introduce this is to provide the dominant R graphics package’s (ggplot2) intellectual history.

Hadley Wickham starts grad school at Iowa State, eventually to be supervised by Diana Cook, an Australian statistician. In assessing the extant computational approaches to statistical graphics, Hadley observes that they tend to cluster as follows:

The continuum of control over statistical graphics.
The continuum of control over statistical graphics.



Which design traits characterize both software clusters?

Graph Zoo is a system which requires you to flick through a gallery of graphs, each of which is produced by a canned routine. Upon finding the sort of graph you intend, you inspect the relevant code. If your data are in exactly the format that was in mind of the routine’s developers, and your desired is exactly what you see in the zoo, then you’re fine.1 This is the type of graphical suite built into SPSS, Stata, SAS, Matplotlib, Lattice, etc. A consequence of this design philosophy is a large number of functions, each of which corresponds to a desired graph type, and each function has a innumerable arguments.

Microcontrol is a system which provides almost no graphical primitives, and instead requires that the user specify the precise location of each graphical element. This is apparent in D3 (the JavaScript graphing library), and a little like gnuplot.

So Graph Zoo lets you try out graphs quickly, but it’s difficult to customize and adapt. Microcontrol provides you exceptional flexibility, but iterating is laborious. ggplot2 ’s genius is to combine the best of both worlds–flexibility and efficiency.

The continuum of control over statistical graphics.
Where ggplot falls on this continuum.

Hadley began working in 2005 on what would become ggplot2, with the following two design ambitions

  1. Provide a grammar of graphics–a way to express (almost) any figure with recourse to its visual primitives. The user declares quantities to be mapped to x and y axes, declares which graphical objects (ie–lines, points, ribbons, rectangles, density smoothers, regression smoothers, etc) will appear in this n-dimensional space, and the statistical software does the rest

What does this mean?

ggplot will take care of the rest

  1. To the maximum extent possible, the user is required to perform all the statistical computations before the graphing routine is called.

This is a huge obstacle for people who have just started working in ggplot. People are used to passing raw data–untransformed microdata–to a graphing routine, and the routine knowing how to make one specific graph. When each function only makes a specific type of graph (ie, the very epitome of Graph Zoo), then it can be written in a way to do all the required computations for that graph. As we’ve said, this is a rejection of the ggplot approach, wherein they want to express all graphs types more fluently and generally, with as little graph syntax as possible.

Accordingly, as a heuristic, think of the graph making as the final part of a data analysis process, and something you’ll only undertake after you’ve finished computing your necessary quantities.

The available geoms

You should recognize a lovely trait of tidyverse design philosophy –the common geom_ prefix providing the user the opportunity to hit the TAB button after the cursor is flashing on the _ , and cycling through the list of available geoms.

Let’s work some examples

We’ll load an updated version of the US box office data we used in an earlier lab

library(tidyverse)
library(magrittr)

t1 <- "https://github.com/thomasjwood/code_lab/raw/main/data/box_office_jan_97_feb_24.rds" %>% 
  url %>% 
  readRDS

t1 %>% 
  glimpse
## Rows: 323,662
## Columns: 17
## $ id          <chr> "rl1917879809", "rl2656798209", "rl912098817", "rl91209881…
## $ title       <chr> "Evita", "Independence Day", "Jerry Maguire", "Jerry Magui…
## $ date        <date> 1997-01-01, 1997-01-01, 1997-01-01, 1997-01-01, 1997-01-0…
## $ dow         <chr> "Wednesday", "Wednesday", "Wednesday", "Wednesday", "Thurs…
## $ daily       <dbl> 408292, 1594, 3462509, 3462509, 292698, 1240, 1837884, 183…
## $ to_date     <dbl> 868379, 306157684, 69138326, 69138326, 1161077, 306158924,…
## $ daily_adj   <dbl> 744400.3400, 2906.1900, 6312866.4699, 6312866.4699, 533648…
## $ to_date_adj <dbl> 1583234, 558188463, 126053397, 126053397, 2116882, 5581907…
## $ percent_yd  <dbl> 556.0, 43.1, 20.3, 20.3, -28.3, -22.2, -46.9, -46.9, 18.3,…
## $ percent_lw  <dbl> 472.6, 75.7, 11.6, 11.6, 330.1, 18.7, -49.1, -49.1, 435.0,…
## $ theaters    <dbl> 22, 23, 2531, 2531, 22, 23, 2531, 2531, 22, 21, 2531, 2531…
## $ avg         <dbl> 18558, 69, 1368, 1368, 13304, 53, 726, 726, 15738, 136, 16…
## $ day         <dbl> 8, 183, 20, 20, 9, 184, 21, 21, 10, 185, 22, 22, 11, 186, …
## $ rating      <chr> "PG", "PG-13", "R", "R", "PG", "PG-13", "R", "R", "PG", "P…
## $ runtime     <dbl> 135, 145, 139, 139, 135, 145, 139, 139, 135, 145, 139, 139…
## $ genre       <chr> "Biography Drama History Musical", "Action Adventure Sci-F…
## $ studio      <chr> "Walt Disney Studios Motion Pictures", "Twentieth Century …

What if we wanted to plot opening weekends adjusted yield against the number of theatres?

t2 <- t1 %>% 
  filter(
    day < 7 &
    dow %>% 
      is_in(
        c("Thursday",
          "Friday",
          "Saturday",
          "Sunday")
      )
  ) %>% 
  group_by(
    id, title
    ) %>% 
  summarise(
    tote = daily_adj %>% sum,
    theaters = theaters %>% max
    )
## `summarise()` has grouped output by 'id'. You can override using the `.groups`
## argument.
p1 <- t2 %>% 
  ggplot(
    aes(
      x = theaters, y = tote
    )
  )

p1 +
  geom_point()

Or we could use a smoother

p1 +
  geom_smooth()

Or we could use a density map

p1 +
  geom_density2d()

Or maybe we’re interested in day of the week differences?

t1 %>% 
  mutate(
    dow = dow %>% 
      factor(
        c(2:7, 1) %>% 
          wday(label = T, abbr = F)
        )
    ) %>% 
  ggplot(
    aes(
      x = dow, y = daily_adj
    )
  ) +
  stat_boxplot() +
  scale_y_log10()

Mmmmm maybe just means and standard errors would be simpler?

t1 %>% 
  mutate(
    dow = dow %>% 
      factor(
        c(2:7, 1) %>% 
          wday(label = T, abbr = F)
        )
    ) %>% 
  group_by(dow) %>% 
  summarize(
    mu  = daily_adj %>% mean,
    se = daily_adj %>% plotrix::std.error()
  ) %>% 
  mutate(
    low = mu %>% 
      subtract(
        se %>% 
          multiply_by(1.96)
      ),
    hi = mu %>% 
      add(
        se %>% 
          multiply_by(1.96)
      ) 
  ) %>% 
  ggplot(
    aes(
      x = dow, y = mu, ymin = low, ymax = hi
    )
  ) +
  geom_pointrange()

Or maybe we exploit one the huge number of R extensions?

t1 %>% 
  mutate(
    dow = dow %>% 
      factor(
        c(2:7, 1) %>% 
          wday(label = T, abbr = F)
        )
    ) %>% 
  ggplot() +
  geom_violin(
     aes(
      x = dow, y = daily_adj
      ),
     adjust = 200
     )

We can make nice line graphs for some recent hits, and less than hits

t1 %>% 
  filter(
    title %>% 
      is_in(
        c("Barbie",
          "Oppenheimer",
          "The Super Mario Bros. Movie",
          "Guardians of the Galaxy Vol. 3",
          "Spider-Mar: Across the Spider-Verse",
          "Taylor Swift: The Eras Tour")
        )
    ) %>%
  mutate(
    title = title %>% 
      fct_reorder(
        to_date_adj, 
        .desc = T
      )
    ) %>% 
  ggplot() +
  geom_step(
     aes(
      x = day, y = to_date_adj, color = title
      )
     )

And the 30 biggest films of last year.

Note–we do all the math first, with dplyr verbs, before we do the plotting.

t1 %>% 
  filter(
    date %>% 
      year %>% 
      equals(2023)
    ) %>%
  group_by(title) %>% 
  summarize(
    tote = daily_adj %>% sum
  ) %>% 
  arrange(desc(tote)) %>% 
  slice(1:30) %>% 
  mutate(
    title = title %>% 
      fct_inorder %>% 
      fct_rev,
    lab = tote %>% 
      divide_by(1e6) %>% 
      round %>% 
      prettyNum(big.mark = ",")
    ) %>% 
  ggplot(
    aes(
      tote, title
    )
  ) +
  geom_col(
    fill = "grey90",
    color = "grey10",
    linewidth = .2,
    width = .6
  ) +
  geom_text(
    aes(
      tote, label = lab
    ),
    position = position_nudge(x = 6e7),
    size = 3,
  ) +
  scale_x_continuous(
    breaks = NULL,
    expand = expansion(add = c(1e5, 5e7))
    ) +
  labs(
    x = "Gross ($ millions)",
    y = "",
    title = "30 biggest US films, 2023"
  ) +
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    plot.title = element_text(face = "bold")
  ) 

Some exercises

  1. Take the top grossing film released in every year between 2018 and 2023. For each film, plot the cumulative adjusted box office against days in release.
  2. Plot the cumulative annual box office for every year between 2018 and 2023.(Hint: the function lubridate::yday() might be useful.)
  3. Take the big 5 studios (Disney, Universal, Paramount, Warner Bros., Sony) and report their annual studio box office as a percentage of total receipts for the years 2018 though 2023. Group together all other studios in a “Other” category. Plot these percentages as a stacked bar, by year.

  1. A classic example of this is apparent when you suggest a graph revision to show some other relationship, print the outcome of a statistical result, etc. Often, the user of graph zoo writes back and says “I have no idea how to make such a graph, the graphs I included are the graphs I include in all my papers.”↩︎