ggplot2 Tutorial (by Alex Yakubovich, Cathia Badiere, Wei-Hao Hwang)


Introduction

ggplot2 is a R package for that is an elegant alternative to the base graphics and lattice plotting systems. ggplot2 has two complementary uses:

ggplot2 has a rich underlying theory: the Grammar of Graphics, proposed by Leland Wilkinson. The grammer is based on of composition of building blocks according to certain rules. Statistical graphics are viewed as layers, each consisting of 4 elements:

The user can explicitly specify these layers, and put them together according to the rules of the grammar. Layers can be saved or shared between plots, as they have a high-level representation in the code.

This is very different from base graphics in R (e.g. the engine behind plot). Base graphics has a 'pen and paper' model, which defines a graphic as a unstructured set of raw elements. This means you can add to an existing plot, but you can't delete or modify elements, or share them between plots. On the plus side, base graphics render plots much faster than ggplot2.

The nice thing about ggplot2 is that it is incredibly useful even if you don't understand the rich underlying theory.

We can summarize the capabilities of ggplot2 as follows:

Strengths

Weaknesses


Code

Getting started

This section explains how to set up ggplot2 and make some basic plots using qplot (quick plot).

To download the package from CRAN and install:

install.packages("ggplot2")
## Error: trying to use CRAN without setting a mirror
library(ggplot2)

We will use the diamond dataset included with ggplot2:

head(diamonds)
##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

qplot shares much of its syntax with the standard plot function in R. Both accept \( x \) and \( y \) arguments as values from the workspace, or fields from a data frame:

# these are equivalent: qplot(diamonds$carat, diamonds$price,
# data=diamonds) qplot(x=carat, y=price, data=diamonds)
qplot(carat, price, data = diamonds)

plot of chunk unnamed-chunk-4

It's better to pass data frames, for a few reasons. It makes it easier to change the data in the plot in the future, and generates nicer x and y labels.

Unlike plot, qplot does not accept generatic arguments (e.g. linear models). This is where ggplot comes in (explained later).

qplot makes it very easy to change the colour or scale aesthetics to display information about additional variables:

qplot(carat, price, data = diamonds, colour = clarity)

plot of chunk unnamed-chunk-5

A legend is automatically corrected, with the colours of the points mapping to the clarity as we want. We would have to do a lot more work to create this plot with base graphics.

To reduce overplotting (clutter), sometimes it helps to add transparency. This specified by the alpha field. Specifying alpha as 1/2 means that 2 points need to overlay to achieve an opacity of one (transparency of zero).

qplot(carat, price, data = diamonds, colour = clarity, alpha = I(1/2))

plot of chunk unnamed-chunk-6

This visualization suggests that price depends on carat through a power law which is different for every level of clarity. We can use a log-log plot to see this more clearly.

qplot accepts transformations of variables as its arguments, and like plot, it has a also has a log parameter:

qplot(log(carat), log(price), data = diamonds, colour = clarity)

plot of chunk unnamed-chunk-7

Next, we will explore how we can use colour and scale to visualize some regression diagnostics.
We will use some synthetic data on height and weight for 15 individuals:

cat("\nheight weight health\n1  0.6008 0.3355  1.280\n2  0.9440 0.6890  1.208\n3  0.6150 0.6980  1.036\n4  1.2340 0.7617  1.395\n5  0.7870 0.8910  0.912\n6  0.9150 0.9330  1.175\n7  1.0490 0.9430  1.237\n8  1.1840 1.0060  1.048\n9  0.7370 1.0200  1.003\n10 1.0770 1.2150  0.943\n11 1.1280 1.2230  0.912\n12 1.5000 1.2360  1.311\n13 1.5310 1.3530  1.411\n14 1.1500 1.3770  0.603\n15 1.9340 2.0734  1.073 ", 
    file = "height_weight.dat")

hw <- read.table("height_weight.dat", header = T)

head(hw)
##   height weight health
## 1 0.6008 0.3355  1.280
## 2 0.9440 0.6890  1.208
## 3 0.6150 0.6980  1.036
## 4 1.2340 0.7617  1.395
## 5 0.7870 0.8910  0.912
## 6 0.9150 0.9330  1.175

We can visualize all the data on one plot by plotting health against weight, and scaling each point by the height:

qplot(x = weight, y = health, data = hw, size = height, colour = I("steelblue"))

plot of chunk unnamed-chunk-9

This plot is simpler than a full 3d visualization, but it of course carries less information. In particular, we can't see the regression plane.

Let's consider the marginal regression of health on weight. We can easily generate a scatter plot showing the line of best fit and the 95% confidence intervals:

qplot(x = weight, y = health, data = hw) + geom_smooth(method = lm)

plot of chunk unnamed-chunk-10

We can display the data, residuals and the leverage for the regression all on one plot:

fit <- lm(health ~ weight, data = hw)
hii <- hatvalues(fit)  #leverages
res <- fit$res  #residuals

qplot(x = weight, y = health, data = hw, size = hii, colour = abs(res)) + geom_abline(intercept = fit$coeff[1], 
    slope = fit$coeff[2])  #regression line

plot of chunk unnamed-chunk-11

We see clearly how the leverage changes as only a function of the x-values (their z-scores, to be exact). The plot makes it easy to pick out the different types of outliers. We see two points with a very high leverage but small residual - type III outliers.

Advanced use

Working with ggplot

Recall the different components of a plot:

qplot does a lot of things behind the scenes: it initialize a plot objec with the given dataframe, maps the axes to the data, adds the different layers (geom, stat etc.) and finally renders the graphic. If we want to have more control over these process, and really make use of the grammar, ggplot is a better function to work with.

Amongst other things, it allows us to explicitly specify the different components of a plot, and save and reuse them in future plots.


p <- ggplot(data = diamonds, aes(x = carat, y = price, colour = cut))  #init. plot, specifying data and aes
p <- p + layer(geom = "point")  # add a layer with points geom
p  #render plot

plot of chunk unnamed-chunk-12

Note that p is a gg object. We can get information about it using the standard summary function:

summary(p)
## data: carat, cut, color, clarity, depth, table, price, x, y, z
##   [53940x10]
## mapping:  x = carat, y = price, colour = cut
## faceting: facet_null() 
## -----------------------------------
## geom_point:  
## stat_identity:  
## position_identity: (width = NULL, height = NULL)

For a more complicated exam, we can produce a histogram, explicitly specifiying the geometry, its colour properties, the statistical transformation (binning) and its bin width property:

p <- ggplot(diamonds, aes(x = carat))  #init
p <- p + layer(geom = "bar", geom_params = list(fill = "steelblue"), stat = "bin", 
    stat_params = list(binwidth = 0.25))
p  #render the plot

plot of chunk unnamed-chunk-14

summary(p)
## data: carat, cut, color, clarity, depth, table, price, x, y, z
##   [53940x10]
## mapping:  x = carat
## faceting: facet_null() 
## -----------------------------------
## geom_bar: fill = steelblue 
## stat_bin: binwidth = 0.25 
## position_stack: (width = NULL, height = NULL)

There is as default stat for every geom and vice versa. If we are working with defaults, we only need to specify the geom or stat, not both:

# specify the geom, using default stat
ggplot(diamonds, aes(x = carat)) + geom_histogram(binwidth = 0.25, fill = "steelblue")

plot of chunk unnamed-chunk-15


# specify the stat, using default geom
ggplot(diamonds, aes(x = carat)) + stat_bin(binwidth = 0.25, fill = "steelblue") + 
    geom_density()

plot of chunk unnamed-chunk-15


# Note that the above plot can be created w/ qplot:
qplot(carat, data = diamonds, binwidth = 0.25)

plot of chunk unnamed-chunk-15


# we can store layers as variables to reuse them later:

# We can reuse the same plot object, modifying only the data
p1 <- ggplot(diamonds, aes(x = carat, y = price)) + geom_point()
p1

plot of chunk unnamed-chunk-15


diamonds1 <- transform(diamonds, carat = log(carat), price = log(price))
p1 %+% diamonds1  #update the data

plot of chunk unnamed-chunk-15

Note that the data is copied into the gg, object, not just stored as a reference. This means that we can save the gg object and load into another workspace, and it will have all the information neccessary to produce a plot.

More on Aesthetic mappings

The function aes describes the mapping between variables and aesthetics (things we see in the plot). We can specify the aesthetic mappings, or update them later. We will explore aes using our height-weight data:

p2 <- ggplot(data = hw)  #initialize

p2 <- p2 + aes(x = height, y = health)  #specify a mapping
p2 + geom_point()  #render

plot of chunk unnamed-chunk-16


p2 <- p2 + aes(x = weight, y = health)  #change mapping
p2 + geom_point()

plot of chunk unnamed-chunk-16

summary(p2)
## data: height, weight, health [15x3]
## mapping:  x = weight, y = health
## faceting: facet_null()

# add another aesthetic (colour)
p2 + geom_point(aes(colour = height))

plot of chunk unnamed-chunk-16


# we can also remove aesthetics
p2 + geom_point(aes(colour = NULL))

plot of chunk unnamed-chunk-16


# instead of mapping aesthetics to a variable, we can set them to a
# constant
p2 + geom_point(colour = "darkblue")  #set col to darkblue

plot of chunk unnamed-chunk-16

qplot(weight, health, data = hw, colour = I("darkblue"))  #equivalent

plot of chunk unnamed-chunk-16


# This is different from an aesthetic mapping:
p2 + geom_point(aes(colour = "darkblue"))

plot of chunk unnamed-chunk-16

Examples of ggplot2

Cateogrical Data Analysis

We will now explore some examples of visualizations of categorical data. We will use the arrests dataset from the effects package, which contains demographic data on 5226 and information on whether they were arrested or released with summons for possession of marijuana. First, we experiment with some basic bar graphs:

## Loading required package: lattice
## Loading required package: grid
## Loading required package: MASS
## Loading required package: nnet
## Loading required package: colorspace
## Attaching package: 'effects'
## The following object(s) are masked from 'package:datasets':
## 
## Titanic
head(Arrests)
##   released colour year age    sex employed citizen checks
## 1      Yes  White 2002  21   Male      Yes     Yes      3
## 2       No  Black 1999  17   Male      Yes     Yes      3
## 3      Yes  White 2000  24   Male      Yes     Yes      3
## 4       No  Black 2000  46   Male      Yes     Yes      1
## 5      Yes  Black 1999  27 Female      Yes     Yes      1
## 6      Yes  Black 1998  16 Female      Yes     Yes      0

### Bar graph
dat <- data.frame(colour = factor(c("Black", "White"), levels = c("Black", "White")), 
    Percent_Released = c(0.74, 0.85))
# basic bar graph
ggplot(dat, aes(x = colour, y = Percent_Released)) + geom_bar(stat = "identity")

plot of chunk unnamed-chunk-18

# Fill different fill colors.
ggplot(dat, aes(x = colour, y = Percent_Released)) + geom_bar(aes(fill = colour), 
    stat = "identity")

plot of chunk unnamed-chunk-18


# Add a black outline
ggplot(dat, aes(x = colour, y = Percent_Released, fill = colour)) + geom_bar(colour = "black", 
    stat = "identity")

plot of chunk unnamed-chunk-18


# Removing the legend
ggplot(dat, aes(x = colour, y = Percent_Released, fill = colour)) + geom_bar(colour = "black", 
    stat = "identity") + guides(fill = FALSE)

plot of chunk unnamed-chunk-18

We can use an overlaid histogram to visualize more dimensions:

# Overlaid histograms
ggplot(Arrests, aes(x = checks, fill = released)) + geom_histogram(binwidth = 1, 
    alpha = 0.5, position = "identity")

plot of chunk unnamed-chunk-19

# conclusions from this plot: if you have more checks, you are much less
# likely to be released

Next, we can generate different kinds of box plots:

# Boxplot Using Arrests Data

# specify the theme
p <- ggplot(data = Arrests) + theme(plot.title = element_text(lineheight = 0.8, 
    face = "bold"))

p + geom_boxplot(mapping = aes(x = colour, y = unclass(checks))) + ggtitle("Prior Police Checks by Race")

plot of chunk unnamed-chunk-20


p + geom_boxplot(mapping = aes(x = released, y = unclass(checks))) + ggtitle("Prior Police Checks by Released (Yes/No)")

plot of chunk unnamed-chunk-20



# faceting
p + facet_wrap(~released) + geom_boxplot(mapping = aes(x = colour, y = unclass(checks), 
    color = colour)) + ggtitle("Prior Police Checks by Race and Released (Yes/No)")

plot of chunk unnamed-chunk-20

Density Estimation

Generating histograms and nonparametric density estimates is easy with ggplot. Here are some examples taken from the R Cookbook by Winston Chang:

# Basic histogram from the vector "rating". Each bin is .5 wide.

df <- data.frame(cond = factor( rep(c("A","B"), each=200) ), rating = c(rnorm(200),rnorm(200, mean=.5)))
###simulate data from mixed dist. 0.5N(0,1) + 0.5N(0.5,1)

ggplot(df, aes(x=rating)) + geom_histogram(binwidth=.5)

plot of chunk unnamed-chunk-21


# Draw with black outline, white fill
ggplot(df, aes(x=rating)) + geom_histogram(binwidth=.5, colour="black", fill="white")

plot of chunk unnamed-chunk-21


# Density curve
ggplot(df, aes(x=rating)) + geom_density()

plot of chunk unnamed-chunk-21


# Histogram overlaid with kernel density curve
ggplot(df, aes(x=rating)) + geom_histogram(aes(y=..density..),      
# Histogram with density instead of count on y-axis
binwidth=.5,
colour="black", fill="white") +
geom_density(alpha=.2, fill="pink")  # Overlay with transparent density plot

plot of chunk unnamed-chunk-21


Conclusion

We hope that this short tutorial gave you a sense of how ggplot2 functions as a powerful and flexible visualization package. For more information, we encourage you to consult some of the references below which were used to create this tutorial. We made especially heavy use of the ggplot book by L. Wilkinson, which we strongly recommend.


Additional Resources

Tutorials

Reference Pages

Videos

Books

Other