ggplot2 Tutorial (by Alex Yakubovich, Cathia Badiere, Wei-Hao Hwang)

Introduction

ggplot2 is a R package for that is an elegant alternative to the base graphics and lattice plotting systems. ggplot2 has two complementary uses:

Producing publication quality graphics using very simple syntax that it similiar to that of base graphics. ggplot2 tends to make smart default choices for color, scale etc.
Making more sophisticated/customized plots that go beyond the defaults.

ggplot2 has a rich underlying theory: the Grammar of Graphics, proposed by Leland Wilkinson. The grammer is based on of composition of building blocks according to certain rules. Statistical graphics are viewed as layers, each consisting of 4 elements:

Data
Mapping between variables and aesthetics (e.g. color, shape,scale)
Geometric Objects (e.g. points, lines, polygons)
Statistical Transformation (e.g. smoothing, binning in a histogram)

The user can explicitly specify these layers, and put them together according to the rules of the grammar. Layers can be saved or shared between plots, as they have a high-level representation in the code.

This is very different from base graphics in R (e.g. the engine behind plot). Base graphics has a 'pen and paper' model, which defines a graphic as a unstructured set of raw elements. This means you can add to an existing plot, but you can't delete or modify elements, or share them between plots. On the plus side, base graphics render plots much faster than ggplot2.

The nice thing about ggplot2 is that it is incredibly useful even if you don't understand the rich underlying theory.

We can summarize the capabilities of ggplot2 as follows:

Strengths

Can make beautiful graphics very fast
Easy to update/modify plots
Can make highly customized plots very efficeintly.

Weaknesses

No 3d plotting capabilities
Only static plots (e.g. when)
Slower processing time

Code

Getting started

This section explains how to set up ggplot2 and make some basic plots using qplot (quick plot).

To download the package from CRAN and install:

install.packages("ggplot2")

## Error: trying to use CRAN without setting a mirror

library(ggplot2)

We will use the diamond dataset included with ggplot2:

head(diamonds)

##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

qplot shares much of its syntax with the standard plot function in R. Both accept \( x \) and \( y \) arguments as values from the workspace, or fields from a data frame:

# these are equivalent: qplot(diamonds$carat, diamonds$price,
# data=diamonds) qplot(x=carat, y=price, data=diamonds)
qplot(carat, price, data = diamonds)

plot of chunk unnamed-chunk-4

It's better to pass data frames, for a few reasons. It makes it easier to change the data in the plot in the future, and generates nicer x and y labels.

Unlike plot, qplot does not accept generatic arguments (e.g. linear models). This is where ggplot comes in (explained later).

qplot makes it very easy to change the colour or scale aesthetics to display information about additional variables:

qplot(carat, price, data = diamonds, colour = clarity)

plot of chunk unnamed-chunk-5

A legend is automatically corrected, with the colours of the points mapping to the clarity as we want. We would have to do a lot more work to create this plot with base graphics.

To reduce overplotting (clutter), sometimes it helps to add transparency. This specified by the alpha field. Specifying alpha as 1/2 means that 2 points need to overlay to achieve an opacity of one (transparency of zero).

qplot(carat, price, data = diamonds, colour = clarity, alpha = I(1/2))

plot of chunk unnamed-chunk-6

This visualization suggests that price depends on carat through a power law which is different for every level of clarity. We can use a log-log plot to see this more clearly.

qplot accepts transformations of variables as its arguments, and like plot, it has a also has a log parameter:

qplot(log(carat), log(price), data = diamonds, colour = clarity)

plot of chunk unnamed-chunk-7

Next, we will explore how we can use colour and scale to visualize some regression diagnostics.
We will use some synthetic data on height and weight for 15 individuals:

cat("\nheight weight health\n1  0.6008 0.3355  1.280\n2  0.9440 0.6890  1.208\n3  0.6150 0.6980  1.036\n4  1.2340 0.7617  1.395\n5  0.7870 0.8910  0.912\n6  0.9150 0.9330  1.175\n7  1.0490 0.9430  1.237\n8  1.1840 1.0060  1.048\n9  0.7370 1.0200  1.003\n10 1.0770 1.2150  0.943\n11 1.1280 1.2230  0.912\n12 1.5000 1.2360  1.311\n13 1.5310 1.3530  1.411\n14 1.1500 1.3770  0.603\n15 1.9340 2.0734  1.073 ", 
    file = "height_weight.dat")

hw <- read.table("height_weight.dat", header = T)

head(hw)

##   height weight health
## 1 0.6008 0.3355  1.280
## 2 0.9440 0.6890  1.208
## 3 0.6150 0.6980  1.036
## 4 1.2340 0.7617  1.395
## 5 0.7870 0.8910  0.912
## 6 0.9150 0.9330  1.175

We can visualize all the data on one plot by plotting health against weight, and scaling each point by the height:

qplot(x = weight, y = health, data = hw, size = height, colour = I("steelblue"))

plot of chunk unnamed-chunk-9

This plot is simpler than a full 3d visualization, but it of course carries less information. In particular, we can't see the regression plane.

Let's consider the marginal regression of health on weight. We can easily generate a scatter plot showing the line of best fit and the 95% confidence intervals:

qplot(x = weight, y = health, data = hw) + geom_smooth(method = lm)

plot of chunk unnamed-chunk-10

We can display the data, residuals and the leverage for the regression all on one plot:

fit <- lm(health ~ weight, data = hw)
hii <- hatvalues(fit)  #leverages
res <- fit$res  #residuals

qplot(x = weight, y = health, data = hw, size = hii, colour = abs(res)) + geom_abline(intercept = fit$coeff[1], 
    slope = fit$coeff[2])  #regression line

plot of chunk unnamed-chunk-11

We see clearly how the leverage changes as only a function of the x-values (their z-scores, to be exact). The plot makes it easy to pick out the different types of outliers. We see two points with a very high leverage but small residual - type III outliers.

Advanced use

Working with `ggplot`

Recall the different components of a plot:

data: Data frame
geoms: Geometric Objects
aes: Mapping between variables (data) and aesthetics (visual properties of geoms)
stat: Statistical Transformation

qplot does a lot of things behind the scenes: it initialize a plot objec with the given dataframe, maps the axes to the data, adds the different layers (geom, stat etc.) and finally renders the graphic. If we want to have more control over these process, and really make use of the grammar, ggplot is a better function to work with.

Amongst other things, it allows us to explicitly specify the different components of a plot, and save and reuse them in future plots.


p <- ggplot(data = diamonds, aes(x = carat, y = price, colour = cut))  #init. plot, specifying data and aes
p <- p + layer(geom = "point")  # add a layer with points geom
p  #render plot

plot of chunk unnamed-chunk-12

Note that p is a gg object. We can get information about it using the standard summary function:

summary(p)

## data: carat, cut, color, clarity, depth, table, price, x, y, z
##   [53940x10]
## mapping:  x = carat, y = price, colour = cut
## faceting: facet_null() 
## -----------------------------------
## geom_point:  
## stat_identity:  
## position_identity: (width = NULL, height = NULL)

For a more complicated exam, we can produce a histogram, explicitly specifiying the geometry, its colour properties, the statistical transformation (binning) and its bin width property:

p <- ggplot(diamonds, aes(x = carat))  #init
p <- p + layer(geom = "bar", geom_params = list(fill = "steelblue"), stat = "bin", 
    stat_params = list(binwidth = 0.25))
p  #render the plot

plot of chunk unnamed-chunk-14

summary(p)

## data: carat, cut, color, clarity, depth, table, price, x, y, z
##   [53940x10]
## mapping:  x = carat
## faceting: facet_null() 
## -----------------------------------
## geom_bar: fill = steelblue 
## stat_bin: binwidth = 0.25 
## position_stack: (width = NULL, height = NULL)

There is as default stat for every geom and vice versa. If we are working with defaults, we only need to specify the geom or stat, not both:

# specify the geom, using default stat
ggplot(diamonds, aes(x = carat)) + geom_histogram(binwidth = 0.25, fill = "steelblue")

plot of chunk unnamed-chunk-15


# specify the stat, using default geom
ggplot(diamonds, aes(x = carat)) + stat_bin(binwidth = 0.25, fill = "steelblue") + 
    geom_density()

plot of chunk unnamed-chunk-15


# Note that the above plot can be created w/ qplot:
qplot(carat, data = diamonds, binwidth = 0.25)

plot of chunk unnamed-chunk-15


# we can store layers as variables to reuse them later:

# We can reuse the same plot object, modifying only the data
p1 <- ggplot(diamonds, aes(x = carat, y = price)) + geom_point()
p1

plot of chunk unnamed-chunk-15


diamonds1 <- transform(diamonds, carat = log(carat), price = log(price))
p1 %+% diamonds1  #update the data

plot of chunk unnamed-chunk-15

Note that the data is copied into the gg, object, not just stored as a reference. This means that we can save the gg object and load into another workspace, and it will have all the information neccessary to produce a plot.

More on Aesthetic mappings

The function aes describes the mapping between variables and aesthetics (things we see in the plot). We can specify the aesthetic mappings, or update them later. We will explore aes using our height-weight data:

p2 <- ggplot(data = hw)  #initialize

p2 <- p2 + aes(x = height, y = health)  #specify a mapping
p2 + geom_point()  #render