Figures using Grammar of Graphics plots



So far in the course we have mainly relied on the traditional base graphics plots that are an integral part of the R language. These plots are very useful for quick data visualisation. They can also be extended to make high quality figures for publication. In order to arrange multiple plots on a single page you can set graphics parameters. For example par(mfcol=c(2,2)) sets up a 2 x 2 grid. We also used the lattice package to produce more sophisticated grids of plots in which the scales coincided.

Both base graphics and lattice plots use a syntax that coincides very closely with that used for model building. So a boxplot written as boxplot(Length~Site) coincides with anova(lm(Length~Site). This is clearly an advantage when learning R. Lattice plots allow more complex data sets to be visualised using | as a conditional notation. However complex plots using lattice can be difficult to design.

Grammar of Graphics plots (ggplots) are a relatively recent addition to R. They were designed by Hadlye Whickam, who also programmed the reshape package. The two packages share the same aproach to high level declarative programming. The syntax differs from R model building syntax in various ways and can take some time to get used to. However it provides an extremely elegant framework for building really nice looking figures with comparatively few lines of code. Like lattice plots, ggplots are quite prescriptive and will make a lot of the decisions for you, although most of the default settings can be changed.

A very useful resource is provided by the R cookbook pages.


Typical histograms use only one variable at a time, although they may be “conditioned” by some grouping variable. The aim of a histogram is to show the distribution of the variable clearly. Let’s try ggplot histograms in ggplot2 using the mussels data introduced in the primer.

## Loading required package: nlme
## This is mgcv 1.8-3. For overview type 'help("mgcv-package")'.
d <- source_data("",sep=",",head=T)
## Downloading data from: 
## SHA-1 hash of the downloaded data file is:
## f4d719a9b2581496346ff0655394589252826656
## 'data.frame':    113 obs. of  3 variables:
##  $ Lshell  : num  122.1 100.1 100.7 102.3 94.9 ...
##  $ BTVolume: int  39 21 23 22 20 22 21 18 21 15 ...
##  $ Site    : Factor w/ 6 levels "Site_1","Site_2",..: 6 6 6 6 6 6 6 6 6 6 ...


The first step when building a ggplot is to decide how the data will be mapped onto the elements that make up the plot. The term for this in ggplot speak is “aesthetics”- Personally I find the term rather odd and potentially misleading. I would instinctively assume that aesthetics refers to the colour scheme or other visual aspect of the final plot. In fact the aesthetics are the first thing to decide on, rather than the last.

The easiest way to build a plot is by first forming an invisible object which represents the mapping of data onto the page. The way these data are presented can then be changed. The only aesthetics (mappings) that you need to know about for basic usage are x,y, colour, fill and group. The x and y mappings coincide with axes, so are simple enough. Remember that a histogram maps onto the x axis. The y axis shows either frequency or density so is not mapped directly as a variable in the data. So to produce a histogram we need to provide ggplot2 with the name of the variable we are going to use. We do that like this.

g0 <- ggplot(data=d,aes(x=Lshell))

Default histogram

Once the mapping is established producing a histogram, or any other figure, involves deciding on the geometry used to depict the aesthetics. The default is very simple.

g0 + geom_histogram()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-4

There are several things to notice here. One is that the default theme has a grey background and looks rather like some of the figures in Excel. For some purposes this can be useful. However you may prefer a more traditional black and white theme. This is easy to change. You are also warned that the default binwidth may not be ideal.

My own default settings for a histogram would therefore look more like this.

g1<-g0 +geom_histogram(fill="grey",colour="black",binwidth=10) + theme_bw()

plot of chunk unnamed-chunk-5

This should be self explanatory. The colour refers to the lines. It is usually a good idea to set the binwidth manually anyway. Notice that this time I have assigned the results to another object. We can then work with this to produce conditional histograms.

You can set the theme to black and white for all subsequent plots with a command.


Facet wrapping

Conditioning the data on one grouping variable is very simple using a facet_wrap. Facets are the term used in ggplots for the panels in a lattice plot. There are two facet functions. Facet_wrap simply wraps a one dimensional set of panels into what should be a convenient number of columns and rows. You can set the number of columns and rows if the results are not as you want.


plot of chunk unnamed-chunk-7

Density plots.

Density plots are produced in similar manner to histograms.

g1<-g0 +geom_density(fill="grey",colour="black")

plot of chunk unnamed-chunk-8


plot of chunk unnamed-chunk-9

Adding grouping aesthetics

Adding a grouping aesthetic allow subgroups to be plotted on the same figure.

g_group<-ggplot(d, aes(Lshell, group=Site)) + geom_density()

plot of chunk unnamed-chunk-10

Colour and fill are also grouping aesthetics. So a nicere way of showing this would be use them instead.

g_group<-ggplot(d, aes(Lshell, colour = Site,fill=Site)) + geom_density(alpha = 0.2)

plot of chunk unnamed-chunk-11


A grouped boxplot uses the grouping variable on the x axis. So we need to change the aesthetic mapping to reflect this.

g0 <- ggplot(d,aes(x=Site,y=Lshell))
g_box<-g0 + geom_boxplot(fill="grey",colour="black")+theme_bw()

plot of chunk unnamed-chunk-12

You should be able to work out that sets of boxplots could be conditioned on a third variable using faceting.

## Make up some data

plot of chunk unnamed-chunk-14


plot of chunk unnamed-chunk-15

Confidence interval plots

One important rule that you should try to follow when presenting data and carrying out any statistical test is to show confidence intervals for key parameters. Remember that boxplots show the actual data. Parameters extracted from the data are means when the data are grouped by a factor. When two or more numerical variables are combined the parameters refer to the statistical model, as in the case of regression.

Grammar of graphics provides a convenient way of adding statistical summaries to the figures. We can show the position of the mean mussel shell length for each site simply by asking to plot the mean for each y like this.

g0 <- ggplot(d,aes(x=Site,y=Lshell))

plot of chunk unnamed-chunk-16

This is not very useful. However you can easily add confidence intervals


plot of chunk unnamed-chunk-17

The traditional “dynamite” plots with a confidence interval over a bar can be formed in the same way.

g0 <- ggplot(d,aes(x=Site,y=Lshell))

plot of chunk unnamed-chunk-18

Most statisticians prefer that the means are shown as points rather than bars. You may want to look at the discussion on this provided by Ben Bolker.


Scatterplots can be built up in a similar manner. We first need to define the aesthetics. In this case there are clearly a and y coordinates that need to be mapped to the names of the variables.

g0 <- ggplot(d,aes(x=Lshell,y=BTVolume))

plot of chunk unnamed-chunk-20

Adding a regression line

It is very easy to add a regression line with confidence intervals to the plot.

g0+geom_point()+geom_smooth(method = "lm", se = TRUE)