ggplot2
is a R package for that is an elegant alternative to the base graphics and lattice plotting systems. ggplot2
has two complementary uses:
Producing publication quality graphics using very simple syntax that it similiar to that of base graphics. ggplot2
tends to make smart default choices for color, scale etc.
Making more sophisticated/customized plots that go beyond the defaults.
ggplot2
has a rich underlying theory: the Grammar of Graphics, proposed by Leland Wilkinson. The grammer is based on of composition of building blocks according to certain rules. Statistical graphics are viewed as layers, each consisting of 4 elements:
The user can explicitly specify these layers, and put them together according to the rules of the grammar. Layers can be saved or shared between plots, as they have a high-level representation in the code.
This is very different from base graphics in R (e.g. the engine behind plot
). Base graphics has a 'pen and paper' model, which defines a graphic as a unstructured set of raw elements. This means you can add to an existing plot, but you can't delete or modify elements, or share them between plots. On the plus side, base graphics render plots much faster than ggplot2
.
The nice thing about ggplot2
is that it is incredibly useful even if you don't understand the rich underlying theory.
We can summarize the capabilities of ggplot2
as follows:
Strengths
Weaknesses
This section explains how to set up ggplot2
and make some basic plots using qplot
(quick plot).
To download the package from CRAN and install:
install.packages("ggplot2")
## Error: trying to use CRAN without setting a mirror
library(ggplot2)
We will use the diamond dataset included with ggplot2
:
head(diamonds)
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
qplot
shares much of its syntax with the standard plot
function in R. Both accept \( x \) and \( y \) arguments as values from the workspace, or fields from a data frame:
# these are equivalent: qplot(diamonds$carat, diamonds$price,
# data=diamonds) qplot(x=carat, y=price, data=diamonds)
qplot(carat, price, data = diamonds)
It's better to pass data frames, for a few reasons. It makes it easier to change the data in the plot in the future, and generates nicer x and y labels.
Unlike plot
, qplot
does not accept generatic arguments (e.g. linear models). This is where ggplot
comes in (explained later).
qplot
makes it very easy to change the colour or scale aesthetics to display information about additional variables:
qplot(carat, price, data = diamonds, colour = clarity)
A legend is automatically corrected, with the colours of the points mapping to the clarity as we want. We would have to do a lot more work to create this plot with base graphics.
To reduce overplotting (clutter), sometimes it helps to add transparency. This specified by the alpha
field. Specifying alpha as 1/2
means that 2 points need to overlay to achieve an opacity of one (transparency of zero).
qplot(carat, price, data = diamonds, colour = clarity, alpha = I(1/2))
This visualization suggests that price depends on carat through a power law which is different for every level of clarity. We can use a log-log plot to see this more clearly.
qplot
accepts transformations of variables as its arguments, and like plot
, it has a also has a log
parameter:
qplot(log(carat), log(price), data = diamonds, colour = clarity)
Next, we will explore how we can use colour and scale to visualize some regression diagnostics.
We will use some synthetic data on height and weight for 15 individuals:
cat("\nheight weight health\n1 0.6008 0.3355 1.280\n2 0.9440 0.6890 1.208\n3 0.6150 0.6980 1.036\n4 1.2340 0.7617 1.395\n5 0.7870 0.8910 0.912\n6 0.9150 0.9330 1.175\n7 1.0490 0.9430 1.237\n8 1.1840 1.0060 1.048\n9 0.7370 1.0200 1.003\n10 1.0770 1.2150 0.943\n11 1.1280 1.2230 0.912\n12 1.5000 1.2360 1.311\n13 1.5310 1.3530 1.411\n14 1.1500 1.3770 0.603\n15 1.9340 2.0734 1.073 ",
file = "height_weight.dat")
hw <- read.table("height_weight.dat", header = T)
head(hw)
## height weight health
## 1 0.6008 0.3355 1.280
## 2 0.9440 0.6890 1.208
## 3 0.6150 0.6980 1.036
## 4 1.2340 0.7617 1.395
## 5 0.7870 0.8910 0.912
## 6 0.9150 0.9330 1.175
We can visualize all the data on one plot by plotting health against weight, and scaling each point by the height:
qplot(x = weight, y = health, data = hw, size = height, colour = I("steelblue"))
This plot is simpler than a full 3d visualization, but it of course carries less information. In particular, we can't see the regression plane.
Let's consider the marginal regression of health on weight. We can easily generate a scatter plot showing the line of best fit and the 95% confidence intervals:
qplot(x = weight, y = health, data = hw) + geom_smooth(method = lm)
We can display the data, residuals and the leverage for the regression all on one plot:
fit <- lm(health ~ weight, data = hw)
hii <- hatvalues(fit) #leverages
res <- fit$res #residuals
qplot(x = weight, y = health, data = hw, size = hii, colour = abs(res)) + geom_abline(intercept = fit$coeff[1],
slope = fit$coeff[2]) #regression line
We see clearly how the leverage changes as only a function of the x-values (their z-scores, to be exact). The plot makes it easy to pick out the different types of outliers. We see two points with a very high leverage but small residual - type III outliers.
ggplot
Recall the different components of a plot:
data
: Data framegeoms
: Geometric Objectsaes
: Mapping between variables (data) and aesthetics (visual properties of geoms
)stat
: Statistical Transformationqplot
does a lot of things behind the scenes: it initialize a plot objec with the given dataframe, maps the axes to the data, adds the different layers (geom
, stat
etc.) and finally renders the graphic. If we want to have more control over these process, and really make use of the grammar, ggplot
is a better function to work with.
Amongst other things, it allows us to explicitly specify the different components of a plot, and save and reuse them in future plots.
p <- ggplot(data = diamonds, aes(x = carat, y = price, colour = cut)) #init. plot, specifying data and aes
p <- p + layer(geom = "point") # add a layer with points geom
p #render plot
Note that p
is a gg
object. We can get information about it using the standard summary
function:
summary(p)
## data: carat, cut, color, clarity, depth, table, price, x, y, z
## [53940x10]
## mapping: x = carat, y = price, colour = cut
## faceting: facet_null()
## -----------------------------------
## geom_point:
## stat_identity:
## position_identity: (width = NULL, height = NULL)
For a more complicated exam, we can produce a histogram, explicitly specifiying the geometry, its colour properties, the statistical transformation (binning) and its bin width property:
p <- ggplot(diamonds, aes(x = carat)) #init
p <- p + layer(geom = "bar", geom_params = list(fill = "steelblue"), stat = "bin",
stat_params = list(binwidth = 0.25))
p #render the plot
summary(p)
## data: carat, cut, color, clarity, depth, table, price, x, y, z
## [53940x10]
## mapping: x = carat
## faceting: facet_null()
## -----------------------------------
## geom_bar: fill = steelblue
## stat_bin: binwidth = 0.25
## position_stack: (width = NULL, height = NULL)
There is as default stat for every geom and vice versa. If we are working with defaults, we only need to specify the geom or stat, not both:
# specify the geom, using default stat
ggplot(diamonds, aes(x = carat)) + geom_histogram(binwidth = 0.25, fill = "steelblue")
# specify the stat, using default geom
ggplot(diamonds, aes(x = carat)) + stat_bin(binwidth = 0.25, fill = "steelblue") +
geom_density()
# Note that the above plot can be created w/ qplot:
qplot(carat, data = diamonds, binwidth = 0.25)
# we can store layers as variables to reuse them later:
# We can reuse the same plot object, modifying only the data
p1 <- ggplot(diamonds, aes(x = carat, y = price)) + geom_point()
p1
diamonds1 <- transform(diamonds, carat = log(carat), price = log(price))
p1 %+% diamonds1 #update the data
Note that the data is copied into the gg
, object, not just stored as a reference. This means that we can save the gg
object and load into another workspace, and it will have all the information neccessary to produce a plot.
The function aes
describes the mapping between variables and aesthetics (things we see in the plot). We can specify the aesthetic mappings, or update them later. We will explore aes
using our height-weight data:
p2 <- ggplot(data = hw) #initialize
p2 <- p2 + aes(x = height, y = health) #specify a mapping
p2 + geom_point() #render
p2 <- p2 + aes(x = weight, y = health) #change mapping
p2 + geom_point()
summary(p2)
## data: height, weight, health [15x3]
## mapping: x = weight, y = health
## faceting: facet_null()
# add another aesthetic (colour)
p2 + geom_point(aes(colour = height))
# we can also remove aesthetics
p2 + geom_point(aes(colour = NULL))
# instead of mapping aesthetics to a variable, we can set them to a
# constant
p2 + geom_point(colour = "darkblue") #set col to darkblue
qplot(weight, health, data = hw, colour = I("darkblue")) #equivalent
# This is different from an aesthetic mapping:
p2 + geom_point(aes(colour = "darkblue"))
We will now explore some examples of visualizations of categorical data. We will use the arrests
dataset from the effects package
, which contains demographic data on 5226 and information on whether they were arrested or released with summons for possession of marijuana. First, we experiment with some basic bar graphs:
## Loading required package: lattice
## Loading required package: grid
## Loading required package: MASS
## Loading required package: nnet
## Loading required package: colorspace
## Attaching package: 'effects'
## The following object(s) are masked from 'package:datasets':
##
## Titanic
head(Arrests)
## released colour year age sex employed citizen checks
## 1 Yes White 2002 21 Male Yes Yes 3
## 2 No Black 1999 17 Male Yes Yes 3
## 3 Yes White 2000 24 Male Yes Yes 3
## 4 No Black 2000 46 Male Yes Yes 1
## 5 Yes Black 1999 27 Female Yes Yes 1
## 6 Yes Black 1998 16 Female Yes Yes 0
### Bar graph
dat <- data.frame(colour = factor(c("Black", "White"), levels = c("Black", "White")),
Percent_Released = c(0.74, 0.85))
# basic bar graph
ggplot(dat, aes(x = colour, y = Percent_Released)) + geom_bar(stat = "identity")
# Fill different fill colors.
ggplot(dat, aes(x = colour, y = Percent_Released)) + geom_bar(aes(fill = colour),
stat = "identity")
# Add a black outline
ggplot(dat, aes(x = colour, y = Percent_Released, fill = colour)) + geom_bar(colour = "black",
stat = "identity")
# Removing the legend
ggplot(dat, aes(x = colour, y = Percent_Released, fill = colour)) + geom_bar(colour = "black",
stat = "identity") + guides(fill = FALSE)
We can use an overlaid histogram to visualize more dimensions:
# Overlaid histograms
ggplot(Arrests, aes(x = checks, fill = released)) + geom_histogram(binwidth = 1,
alpha = 0.5, position = "identity")
# conclusions from this plot: if you have more checks, you are much less
# likely to be released
Next, we can generate different kinds of box plots:
# Boxplot Using Arrests Data
# specify the theme
p <- ggplot(data = Arrests) + theme(plot.title = element_text(lineheight = 0.8,
face = "bold"))
p + geom_boxplot(mapping = aes(x = colour, y = unclass(checks))) + ggtitle("Prior Police Checks by Race")
p + geom_boxplot(mapping = aes(x = released, y = unclass(checks))) + ggtitle("Prior Police Checks by Released (Yes/No)")
# faceting
p + facet_wrap(~released) + geom_boxplot(mapping = aes(x = colour, y = unclass(checks),
color = colour)) + ggtitle("Prior Police Checks by Race and Released (Yes/No)")
Generating histograms and nonparametric density estimates is easy with ggplot
. Here are some examples taken from the R Cookbook by Winston Chang:
# Basic histogram from the vector "rating". Each bin is .5 wide.
df <- data.frame(cond = factor( rep(c("A","B"), each=200) ), rating = c(rnorm(200),rnorm(200, mean=.5)))
###simulate data from mixed dist. 0.5N(0,1) + 0.5N(0.5,1)
ggplot(df, aes(x=rating)) + geom_histogram(binwidth=.5)
# Draw with black outline, white fill
ggplot(df, aes(x=rating)) + geom_histogram(binwidth=.5, colour="black", fill="white")
# Density curve
ggplot(df, aes(x=rating)) + geom_density()
# Histogram overlaid with kernel density curve
ggplot(df, aes(x=rating)) + geom_histogram(aes(y=..density..),
# Histogram with density instead of count on y-axis
binwidth=.5,
colour="black", fill="white") +
geom_density(alpha=.2, fill="pink") # Overlay with transparent density plot
We hope that this short tutorial gave you a sense of how ggplot2
functions as a powerful and flexible visualization package. For more information, we encourage you to consult some of the references below which were used to create this tutorial. We made especially heavy use of the ggplot
book by L. Wilkinson, which we strongly recommend.