Learning goals

Teaser

Let’s have a look at the tips dataset. The help reads: one waiter recorded information about each tip he received over a period of a few months working in one restaurant…

data(tips, package="reshape2") # load tips data
?tips
## No documentation for 'tips' in specified packages and libraries:
## you could try '??tips'
head(tips)
##   total_bill  tip    sex smoker day   time size
## 1      16.99 1.01 Female     No Sun Dinner    2
## 2      10.34 1.66   Male     No Sun Dinner    3
## 3      21.01 3.50   Male     No Sun Dinner    3
## 4      23.68 3.31   Male     No Sun Dinner    2
## 5      24.59 3.61 Female     No Sun Dinner    4
## 6      25.29 4.71   Male     No Sun Dinner    4
summary(tips)
##    total_bill         tip             sex      smoker      day    
##  Min.   : 3.07   Min.   : 1.000   Female: 87   No :151   Fri :19  
##  1st Qu.:13.35   1st Qu.: 2.000   Male  :157   Yes: 93   Sat :87  
##  Median :17.80   Median : 2.900                          Sun :76  
##  Mean   :19.79   Mean   : 2.998                          Thur:62  
##  3rd Qu.:24.13   3rd Qu.: 3.562                                   
##  Max.   :50.81   Max.   :10.000                                   
##      time          size     
##  Dinner:176   Min.   :1.00  
##  Lunch : 68   1st Qu.:2.00  
##               Median :2.00  
##               Mean   :2.57  
##               3rd Qu.:3.00  
##               Max.   :6.00

Here are three plots of the tips dataset created using ggplot2.

library(ggplot2)
theme_set(theme_bw()) # change the default plot theme to B&W

qplot(total_bill, tip, data=tips, colour=time, shape=sex, size=size, alpha=I(0.5))

qplot(tip, fill=factor(size), data=subset(tips, size%in%2:4), facets=sex~time, geom="density", alpha=I(0.5))

qplot(total_bill, tip, data=tips, facets=sex~time, geom=c("density2d", "smooth"))

On the grammar of graphics

The grammar of graphics was conceived by L. Wilkinson, trying to address the question: what is a statistical graphic? It allows to specify how variables in the data are mapped to aesthetic attributes that you can perceive.

Aesthetics (position, colour, size, etc) are applied to geometric objects (points, lines, polygons, etc). The plot may contain statistical transformations of the data (e.g. binning for histograms, smoothing, etc).

Grammar of graphic concepts were brought to R by H. Wickham in the ggplot2 package. Plots are built in a layered fashion from the data stored in a dataframe (beware the type of your columns! e.g. numeric, factor, logical). The main ggplot2 functions are:

The main aesthetics are x, y, colour, size, shape, alpha. The help (in particular all geoms and stats) is best browsed at http://docs.ggplot2.org/current/.

Building teaser plots layer by layer

In this section, you will learn how to draw the plots of the teaser section step by step.

Scatter plot

In order to get you started, this is a minimal code to draw a scatter plot:

ggplot() +
  geom_point(aes(x=total_bill, y=tip), data=tips)

Now, you can use the help to reproduce the first plot:

Histogram

Use geom_histogram to produce this plot:

Look at geom_histogram help to figure out how to draw these plots:

Look at facet_grid and facet_wrap help in order to reproduce these plots (the scales argument is a useful one in real life!):

Now you can use geom_density to reproduce the third of the teaser plots:

Contour plot

Use stat_density2d to reproduce this plot:

Now you can add a layer with stat_smooth to reproduce this plot:

Use the appropriate faceting function to reproduce these plots. You might also need to use subset for the second plot.

ggplot2 in practice

Mind the default values

The data and aesthetics can be set in each layer. Alternatively, they can be specified when calling ggplot(): in this case, these values are passed as defaults to all layers.

Each layer has a geom and a stat defined. However most of the time specifying only one is enough (mind the default value of the other!).

“Quicker” plots

The teaser examples were created using qplot syntax. This is a shortcut for the explicit ggplot syntax that we used up to now. You will certainly find it in tutorial and forums. Hence it is introduced here quickly, however I don’t recommend spending too much time learning it: real life plots often require controlling details that can only be achieved with the ggplot syntax!

The qplot function itself takes the following arguments (and is conveniently designed with appropriate defaults):

qplot(x, ..., data, facets = NULL, geom = "auto", stat = list(NULL), 
  position = list(NULL), xlim = c(NA, NA), ylim = c(NA, NA), log = "", main = NULL,
  xlab = deparse(substitute(x)), ylab = deparse(substitute(y)))

This qplot call:

qplot(total_bill, tip, data=tips, facets=sex~time, geom=c("density2d", "smooth"))

is equivalent to this ggplot one:

ggplot(aes(x=total_bill, y=tip), data=tips) +
  stat_density2d() +
  stat_smooth() +
  facet_grid(sex~time)

You can add additional layers to a plot created using qplot:

qplot(total_bill, tip, data=tips, geom="density2d") +
  stat_smooth() +
  facet_grid(sex~time)

Using statistical transformations in plots

For several geoms, the associated stat is not identity (e.g. bin, density). In this case, the stat transform produces one or more variables (e.g. ..density.., ..counts..) that you can map to the aesthetics of your choice.

stat_summary is particularly useful to plot the average over all points at a given x:

ggplot(aes(x=size, y=tip), data=tips) +
  geom_jitter(alpha=0.5) +
  stat_summary(fun.y="mean", geom="line")

ggplot(aes(x=size, y=tip), data=tips) +
  geom_jitter(alpha=0.5) +
  stat_summary(fun.data="mean_cl_normal", geom="smooth")

ggplot(aes(x=size, y=tip, colour=sex), data=tips) +
  geom_point(alpha=0.5, position=position_jitter(width=0.2)) +
  geom_line(stat="summary", fun.y="mean") +
  geom_errorbar(stat="summary", fun.data="mean_cl_normal", width=0.2)

Setting vs mapping an aesthetic

Any aesthetic can be set to a constant value or mapped to a variable. In this example, colour is mapped to time but alpha is set to 0.5. In the qplot syntax, setting is achieved with the I() function (stands for “inhibit”).

ggplot(data=tips) +
  geom_point(aes(x=total_bill, y=tip, colour=time, shape=sex, size=size), alpha=0.5)

qplot(total_bill, tip, data=tips, colour=time, shape=sex, size=size, alpha=I(0.5))

Controlling scales

Scales define several important visual aspects of the plot, including its limits and transformation (e.g. log scale).

Calling the scale function

All parameters can be fine-tuned by calling the scale function. Scales can be of different types (mainly continuous, discrete, and manual):

ggplot(aes(x=total_bill, y=tip), data=tips) +
  geom_point() +
  geom_smooth() +
  scale_x_continuous(name='total bill', trans='log10') +
  scale_y_continuous(limits=c(1, 3))
## Warning: Removed 98 rows containing missing values (stat_smooth).
## Warning: Removed 98 rows containing missing values (geom_point).
## Warning: Removed 6 rows containing missing values (geom_path).

Compare with:

ggplot(aes(x=total_bill, y=tip), data=subset(tips, tip>=1 & tip<=3)) +
  geom_point() +
  geom_smooth() +
  scale_x_continuous(name='total bill', trans='log10')

Scales are used for all aesthetics (if you don’t set one, a default is used):

ggplot(aes(x=total_bill, y=tip, colour=size), data=tips) +
  geom_point() +
  scale_colour_gradient(low="red", high="white", trans="log10", breaks=c(1, 3, 6))

Shortcuts

Several shortcuts mimicking the syntax of base R are available:

  • xlim and ylim
  • xlab, ylab, and labs
ggplot(aes(x=total_bill, y=tip), data=tips) +
  geom_point() +
  geom_smooth() +
  xlim(10, 20) + xlab('total bill')

Position

Position adjustments apply minor tweaks to the position of elements within a layer. The main values of the position argument are:

  • identity (don’t adjust position: most common default),
  • stack (overlapping object are shown on top of one another),
  • dodge (overlapping objects are shown side by side),
  • jitter (jitter points to avoid overplotting).
ggplot() +
  geom_histogram(aes(x=tip, fill=factor(size)), data=subset(tips, size%in%2:4)) # default poisition is 'stack'

ggplot() +
  geom_histogram(aes(x=tip, y=..density.., fill=factor(size)), data=subset(tips, size%in%2:4), position='identity', alpha=0.5)

Adjusting cosmetics

The general appearance of the plot is set by its theme. Themes can be set as default (for a given R session):

theme_set(theme_bw())

Themes can also be set (theme_xx) and customized (theme) for a given plot:

ggplot() +
  geom_point(aes(x=total_bill, y=tip, colour=time, shape=sex, size=size), data=tips, alpha=0.5) +
  theme_classic(base_size=18) + theme(legend.position='top')

You might also want to use a different discrete colour scale:

qplot(tip, fill=factor(size), data=subset(tips, size%in%2:4), geom="density", alpha=I(0.65)) +
  scale_fill_brewer(palette="Set1")

# to set it as a default for discrete colour scales
scale_colour_discrete <- function(...) scale_colour_brewer(..., palette="Set1")

In order to arrange several plots on a single page, the easiest way is to use the gridExtra package:

library(gridExtra)
p1 <- qplot(total_bill, tip, data=tips, colour=time, shape=sex, size=size, alpha=I(0.5))
p2 <- qplot(tip, fill=factor(size), data=subset(tips, size%in%2:4), facets=sex~., geom="density", alpha=I(0.5))
grid.arrange(p1, p2, ncol = 2)

The multiplot function is a nice (user contributed) alternative, allowing you to save plots in a list and to define the relative size of the plots: http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/

Going further