How to Speak ggplot2 Like a Native

Harlan D. Harris, PhD
October 16, 2013

Background

ggplot's philosophy

  • Graphics are (should be!) created by combining a specification with data. (Wilkinson, 2005)

  • The specification makes you think about what you're trying to communicate.

  • The specification is not the name of the visual form: bar graph, scatterplot, histogram.

  • The specification is a collection of rules that together describe how to build a graph, a…

Grammar of Graphics

graphics as grammar

I think about the message:

  • \( x = year \)
  • \( y = r / g \)
  • visualize with bars
  • display \( lg \) separately

The computer translates it to an image:

spec + data.frame = plot of chunk unnamed-chunk-1

As we'll see later, a strength of ggplot is that you can do certain types of computations on your data from within the system. Here, it's computing a ratio of two things. In general, this power allows you to quickly explore your data, and different ways of presenting your data, without having to make changes to your data table.

advantages

  1. Flexible
    • can define new graph types by changing specifications
    • can combine many forms into single graphs
  2. Smart
    • compact: rules have useful defaults
    • graphs always have meaning
  3. Reusable
    • can plug new data into old specification
    • can explore many types of plots from a set of data

canonical flexibility example: pie chart is a stacked bar graph plotted in polar coordinates! In fact, that's how ggplot implements pie charts!

ggplot2

  1. Hadley Wickham (formerly Rice Univ., now RStudio)
    • also: reshape2, plyr, lubridate, stringr, etc.
  2. Implements & extends the Grammar of Graphics
    • Focus on layers; based on grid
    • Specification as R objects constructed by functions
    • Large library of components with good defaults
  3. ggplot2: Elegant Graphics for Data Analysis (Wickham, 2009)

Gripes

representations

  • Need to hold 3 things in your head at once
    1. ggplot syntax – left-to-right R expressions
    2. (some of) underlying data structure
    3. the visual display itself
  • Mappings are complex and nonlinear
  • Different syntax might have lessened the cognitive load

illegibility

  • Can't see the underlying data structure (completely)
  • Can't figure out mistakes you've made
  • Processing split between syntax \( \rightarrow \) data structure and data structure \( \rightarrow \) image steps

abuses R

  • Deep Magic with powerful ideosyncracies of R
  • Do “operator overloading”, “lazy evaluation”, and “prototype-based OOP” mean anything to you?

    That's the problem.

introductions create misconceptions

  • Most tutorials start with the shortcuts, requiring relearning, rebuilding of mental models
  • To learn a language, start with the formal structure first

Rip it out!

  • Hadley's book is a particularly bad example of this

Let's Make A Graph! (slowly)

goal

plot of chunk goal

data

library(ggplot2); library(plyr)
data(baseball)
tail(baseball, 4)
             id year stint team lg   g  ab  r   h X2b X3b
89526 benitar01 2007     1  SFN NL  19   0  0   0   0   0
89530 ausmubr01 2007     1  HOU NL 117 349 38  82  16   3
89533  aloumo01 2007     1  NYN NL  87 328 51 112  19   1
89534 alomasa02 2007     1  NYN NL   8  22  1   3   1   0
      hr rbi sb cs bb so ibb hbp sh sf gidp
89526  0   0  0  0  0  0   0   0  0  0    0
89530  3  25  6  1 37 74   3   6  4  1   11
89533 13  49  3  0 27 30   5   2  0  3   13
89534  0   0  0  0  0  3   0   0  0  0    0

Craig Biggio from the Astros

ggplot likes "long" data

library(reshape2); alnl <- subset(baseball, lg %in% c('AL', 'NL')); bb_wide <- ddply(alnl, .(lg, year), summarise, r=sum(r), h=sum(h), hr=sum(hr)); head(bb_wide,3)

  lg year    r    h hr
1 AL 1901 1322 2330 61
2 AL 1902 1872 3648 88
3 AL 1903 1409 3050 68
bb_long <- melt(bb_wide, id=c('lg', 'year'), variable.name='stat')
head(bb_long, 3)
  lg year stat value
1 AL 1901    r  1322
2 AL 1902    r  1872
3 AL 1903    r  1409

simplest plot

bb_long_al <- subset(bb_long, lg=='AL')
p <- ggplot(data=bb_long_al, 
            mapping=aes(x=year, y=value, 
                        color=stat)) +
    layer(geom='line')
print(p)

plot of chunk unnamed-chunk-4

p – building an object, not plotting on the screen ggplot – create a “core” ggplot object – required data – ggplot wants a data.frame mapping – all ggplots have an underlying x and y, and the mapping parameter describes how to map the underlying variables to the variables in a df x and y – realized as horiz and vert here, but might not be in other graphs. obvious mappings. Note that ggplot grabs the df col. names as default axis labels color – the color of the lines is mapped to the value of the Condition column. (the actual colors are set by default here – will override later)

  • - somewhat strangely, ggplot objects are built piece by piece using addition as the operator. Sometimes you're adding to an existing piece, sometimes to a new piece. layer – things that are drawn are in layers. Layers have geoms, which are geometric shapes like points, lines, bars, etc.

structure (not on the test!)

ggplot(data=bb_long_al, 
            mapping=aes(x=year, y=value, 
                        color=stat)) +
    layer(geom='line')
ggplot
data layers mapping scales coords facet theme
(copy) x=year y=value color=stat (default)
layer
data mapping geom stat geom.params stat.params
line identity

add layers

p2 <- p + layer(geom='point', 
                size=4)
summary(p2)
data: lg, year, stat, value [321x4]
mapping:  x = year, y = value, colour = stat
faceting: facet_null() 
-----------------------------------
geom_line:  
stat_identity:  
position_identity: (width = NULL, height = NULL)

geom_point: size = 4 
stat_identity:  
position_identity: (width = NULL, height = NULL)

plot of chunk unnamed-chunk-6

structure so far

ggplot
data layers mapping scales coords facet theme
(copy) x=year y=value color=stat (default)
layers
data mapping geom stat geom.params stat.params
line identity
data mapping geom stat geom.params stat.params
point identity size=4

scales

Mappings of data to representations of displayable attributes.

p3 <- p2 + scale_y_continuous("Total", limits=c(0,15000), 
                        breaks=c(0,5000,10000,12000)) +
    scale_x_continuous("Year") +
    scale_color_manual("Stat", 
                       values=c('darkgreen','darkblue','darkred'))

plot of chunk unnamed-chunk-8

coordinates and scales

  • Coordinates affect display of axes
    • cartesian, polar, map, etc.
  • Scales affect data mapping
    • colors, shapes, lines
  • Big source of confusion
    • set axis ticks/breaks and labels with scale_x_continuous(), but
    • set axis AND DATA range with coord_cartesian(xlim=c(1,10))
    • using scale_x_continuous(limits=c(1,10)) truncates the data too, which can break statistical calculations

both are properties of the base object, not of layers, so they're universal for the whole plot

themes

Adjust properties of a graph as a whole.

p4 <- p3 + ggtitle("Baseball!") + 
    theme_bw() +
    theme(plot.title=element_text(color='#FF8888', size=40))

plot of chunk unnamed-chunk-9

Note that although this is last in series of function calls added together, it's actually modifying the base ggplot object, not any of the layers. This is unfortunate. Not going into themes, but basically you can modify lots of the underlying lines and boxes and colors that make up a ggplot plot, either on a case-by-case basis like this, or by creating a theme object that you can make the default in your environment.

Idioms to Make Life Easier

"contractions" and "slang"

  • All those layer() calls are tedious!
  • geom_*() creates a layer with a specific geom (and various defaults, including a stat)
  • stat_*() creates a layer with a specific stat (and various defaults, including a geom)
  • qplot() creates a ggplot and a layer

typical usage

p_typ <- ggplot(bb_long_al, aes(x=year, y=value, color=stat)) +
    geom_line() +
    geom_point(size=2)+
    ylab("Total") + xlab("Year") +
    scale_color_manual("Stat", values=c('darkgreen','darkblue','darkred')) +
    ggtitle("American League")  + theme_bw()

plot of chunk unnamed-chunk-10

quick note on stats

  • stat=”identity"
  • stat=”lm”
    • fit \( y=f(x) \) with lm(), generate new data to be plotted by geom_line(), CIs w/ geom_ribbon()
  • stat=”smooth”
    • fit \( y=f(x) \) with loess() or gam()
  • stat=”summary”
    • \( y=f(x) \) with arbitrary \( f() \)
  • stat=”bin”
    • histograms

simplest faceted plot

p_typ <- ggplot(bb_long, aes(x=year, y=value, color=stat)) +
    geom_line() +
    facet_wrap(~lg)

plot of chunk unnamed-chunk-11

stats, annotations and alpha

p_annot <- ggplot(bb_long, aes(x=year, y=value, color=stat)) +
    geom_line(alpha=.2, size=2) +
    stat_smooth() +
    facet_wrap(~lg) +
    annotate("text", x=1981, y=0, label="Strike!", 
             angle=90, hjust=0)

plot of chunk unnamed-chunk-12

other fun examples

nicely formatted time series

library(scales)
df <- data.frame(
  date = seq(Sys.Date(), len=100, by="1 day")[sample(100, 50)],
  price = runif(50)
)
df <- df[order(df$date), ]
p_ts <- ggplot(df, aes(date, price)) + geom_line() +
    scale_x_date(breaks=date_breaks("2 weeks"))

plot of chunk unnamed-chunk-13

you are my density

p_density <- ggplot(baseball, aes(ab, fill=lg)) + 
    geom_density(alpha=.4)

plot of chunk unnamed-chunk-14

fizzy bubbly plots

library(Hmisc)
p_fb <- ggplot(baseball, aes(lg, ab)) + 
    geom_jitter() + 
    stat_summary(fun.data='median_hilow',geom='crossbar', color='red', conf.int=.9)

plot of chunk unnamed-chunk-15

wrapping up

take homes

  • a ggplot graph is generated by a specification + data
  • ggplot specifications are a core object plus layers
  • mappings among data, x/y, scales, and other attributes are fundamental
  • geom and stat shortcuts allow smart/compact construction of graphs
  • ggplot encourages good graphs, with facets, good use of color, no chartjunk

resources

Thanks!