In the early 1990s, Richard Becker and William Cleveland (two researchers at Bell Labs) built a revolutionary new system for displaying data called Trellis graphics. And the lattice package is an implementation of Trellis graphics.
The lattice package provides a different way to plot graphics in R. Lattice graphics are created with different functions, and have different options. These functions make it easy to do some things that are hard to do with standard graphics, such as plotting multiple plots on the same page or superimposing plots. Additionally, most lattice functions can produce clean, readable output by default.
The real strength of the lattice package is in splitting a chart into different panels (shown in a grid), or groups (shown with different colors or symbols) using a conditioning or grouping variable.
print.lattice()
or plot.lattice()
with the lattice object as an argument. (This typically happens automatically on the R console.)plot.lattice()
sets up the matrix of panels, assigns packets to different panels and and then calls the panel function specified in the lattice object to draw the individual panels.Data sets in lattice package:
Dataset | Description |
---|---|
USMortality | Mortality Rates in US by Cause and Gender |
USRegionalMortality | Mortality Rates in US by Cause and Gender |
barley | Yield data from a Minnesota barley trial |
environmental | Atmospheric environmental conditions in New York City |
ethanol | Engine exhaust fumes from burning ethanol |
melanoma | Melanoma skin cancer incidence |
singer | Heights of New York Choral Society singers |
As you may have noticed, arguments within the lattice package are much more consistent than the graphics package. (For example, data for barplot()
is specified with the height
argument, while data for plot()
is specified with x
and y
.) You can always specify the data to plot using a formula and a data frame.
xyplot()
:library(lattice)
d <- data.frame(x = c(0:9), y = c(1:10), z = c(rep(c("a", "b"), times = 5)))
d
x | y | z |
---|---|---|
0 | 1 | a |
1 | 2 | b |
2 | 3 | a |
3 | 4 | b |
4 | 5 | a |
5 | 6 | b |
6 | 7 | a |
7 | 8 | b |
8 | 9 | a |
9 | 10 | b |
# Figure 1
xyplot(y~x, data = d)
To plot this data frame, we’ll use the formula y~x and specify the data frame d. The first argument given is the formula. Formulas in the lattice package can also specify a conditioning variable. The conditioning variable is used to assign data points to different panels.
# Figure 2
xyplot(y~x|z, data = d)
As you can see, the data is now split into two panels. If you would prefer to see the two data series superimposed on the same plot, you can use the argument groups to specify the grouping variable(s).
# Figure 3
xyplot(y~x, groups = z, data = d)
As shown in above figure, the two data series are represented by different symbols.
The easiest way to use lattice graphics is by calling a high-level plotting function. Most of these functions are the equivalent of a similar function in the graphics package. Here is a table showing how standard graphics functions map to lattice functions:
Graphics package function | Lattice package function | Description |
---|---|---|
barplot | barchart | Bar and column charts |
dotchart | dotplot | Cleveland dot plots |
hist | histogram | Histograms |
density | densityplot | Kernel density plots |
plot.density | densityplot | Kernel density plots |
stripchart | stripplot | Strip charts |
xplot | xyplot | Scatter plots |
pairs | splom | Scatter plot matrices |
image | levelplot | Image plots |
contour | contourplot | Contour plots |
persp | cloud, wireframe | Perspective charts of three-dimensional data |
– | qqmath | Quantile-quantile plots |
– | Quantile-quantile plots |
When you call a high-level lattice function, it does not actually plot the data. Instead, each of these functions returns a lattice object. To actually show the graphic, you need to use a print or plot command.
# Figure 4
obj <- xyplot(y~x, data = d)
plot(obj)
For some (but not all) lattice functions, it is possible to specify the source data in multiple forms.
histogram
can accept data arguments as data frames, factors or numeric vectors.# Figure 5
x = rnorm(100)
histogram(x, data = NULL)
Here is a table of data types accepted by different lattice functions:
Lattice function | Data types |
---|---|
barchart | Array, formula, matrix, numeric vector, table |
dotplot | Array, formula, matrix, numeric vector, table |
histogram | Factor, formula, numeric vector |
densityplot | Formula, numeric vector |
stripplot | Formula, numeric vector |
qqmath | Formula, numeric vector |
xyplot | Formula |
Formula | |
splom | Data frame formula, matrix |
levelplot | Array, formula, matrix, table |
contourplot | Array, formula, matrix, table |
cloud | Formula, matrix, table |
wireframe | Formula, matrix |
For more details on arguments to lattice functions, see the latter section “Customizing Lattice Graphics”.
With standard graphics, you could easily superimpose points, lines, text, and other objects on existing charts. It’s possible to do the same thing with lattice graphics.
In order to add extra graphical elements to a lattice plot, you need to use a custom panel function.
# Figure 6
xyplot(y~x|z,
data = d,
panel = function(...){panel.abline(a = 1, b = 1)
panel.xyplot(...)}
)
We create a new custom panel function that calls both panel.xyplot()
and panel.abline()
. The new panel function will pass along its arguments to panel.xyplot()
. We specify a line that crosses the y-axis at 1 (through the a = 1
argument to panel.abline()
) and has slope 1 (through the b = 1
argument to panel.abline()
).
In this section, we use the same data set for most of the examples: births in the United States during 2006. The version that is included in the nutshell package only contains a 10% sample from the original data file. Each record includes the following variables:
Arguments | Description |
---|---|
DOB_MM | Month of date of birth |
DOB_WK | Day of week of birth |
MAGER | Mother’s age |
TBO_REC | Total birth order |
WTGAIN | Weight gain by mother |
SEX | A factor with levels F M, representing the sex of the child |
APGAR5 | APGAR score |
DMEDUC | Mother’s education level |
UPREVIS | Number of prenatal visits |
ESTGEST | Estimated weeks of gestation |
DMETH_REC | Delivery Method |
DPLURAL | “Plural Births;” levels include 1 Single, 2 Twin, 3 Triplet or higher |
DBWT | Birth weight, in grams |
You can view this data set with the following codes
# library(nutshell)
data(births2006.smpl)
# Figure 7
births.dow <- table(births2006.smpl$DOB_WK)
barchart(births.dow)
Notice that many more babies are born on weekdays than on weekends. That’s a little surprising.
You might wonder if there is a difference in the number of births because of the delivery method; maybe doctors just schedule a lot of cesarean sections on weekdays, and natural births occur all the time. This is the type of question that the lattice package is great for answering.
# Figure 8
births2006.dm <- transform(births2006.smpl[births2006.smpl$DMETH_REC != "Unknown", ],
DMETH_REC = as.factor(as.character(DMETH_REC)))
dob.dm.tbl <- table(WK = births2006.dm$DOB_WK, MM = births2006.dm$DMETH_REC)
barchart(dob.dm.tbl)
By default, barchart prints stacked bars with no legend and the different colors show different groups. But notice that the different shades aren’t labeled, so it’s not immediately obvious what each shade represents. Let’s try to change the way the chart is displayed.
stack = FALSE
) and add a legend (auto.key = TRUE
):# Figure 9
barchart(dob.dm.tbl, stack = FALSE, auto.key = TRUE)
It’s a little easier to see that both types of births decrease on weekends, but it’s still a little difficult to compare values within each group. So let’s try a different approach.
groups = FALSE
) and change to columns (horizontal=FALSE
)# Figure 10
barchart(dob.dm.tbl, groups = FALSE, horizontal = FALSE)
The two different charts are in different panels. Now, we can more clearly see what’s going on. The number of vaginal births decreases on weekends, by maybe 25 to 30%. However, C-sections drop by 50 to 60%. As you can see, lattice graphics let you quickly try different ways to present information, helping you zero in on the method that best illustrates what is happening in the data.
Like bar charts, dot plots are useful for showing data where there is a single point for each category, especially when we’re going to summarize larger data tables. For example, let’s look at a chart of data on births by day of week. Is the pattern we saw above a seasonal pattern?
# Figure 11
dob.dm.tbl.alt <- table(WEEK = births2006.dm$DOB_WK,
MONTH = births2006.dm$DOB_MM,
METHOD = births2006.dm$DMETH_REC)
dotplot(dob.dm.tbl.alt,
stack = FALSE,
auto.key = TRUE,
groups = TRUE
)
In this plot, we keep on grouping, so that different delivery methods are shown in different colors (groups = TRUE
). To help highlight differences, we’ll disable stacking values (stack = FALSE
). Finally, we’ll print a key so that it’s obvious what each symbol represents (auto.key = TRUE
).
As you can see, there are slight seasonal differences, but the overall pattern remains the same.
As another example of dot plots, let’s look at the tire failure data in the nutshell package. In 2003, the National Highway Traffic Safety Administration (NHTSA) began a study into the durability of radial tires on light trucks. (See this for links to this study.) Tests were carried out on six different types of tires. Here is a table of the characteristics of the tires:
Tire | Size | Load Index | Speed Rating | Brand | Model | OE Vehicle | OE Model |
---|---|---|---|---|---|---|---|
B | P195/65R15 | 89 | S | BF Goodrich | Touring T/A | Chevy | Cavalier |
C | P205/65R15 | 92 | V | Goodyear | Eagle GA | Lexus | ES300 |
D | P235/75R15 | 108 | S | Michelin | LTX M/S | Ford,Dodge | E 150 Van, Ram Van 1500 |
E | P265/75R16 | 114 | S | Firestone | Wilderness AT | Chevy/GMC | Silverado, Tahoe, Yukon |
H | LT245/75R16/E | 120/116 | Q | Pathfinder | ATR A/S OWL | NA | NA |
L | 255/65R16 | 109 | H | General | Grabber ST A/S | Mercedes | ML320 |
We focus on only three variables. Time_To_Failure is the time before each tire failed (in hours), Speed_At_Failure_km_h is the testing speed at which the tire failed, and Tire_Type is the type of tire tested. We know that tests were only run at certain stepped speeds; despite the fact that speed is a numeric variable, we can treat it as a factor.
# Figure 12
library(nutshell)
data(tires.sus)
dotplot(as.factor(Speed_At_Failure_km_h)~Time_To_Failure|Tire_Type, data = tires.sus)
This diagram let’s us clearly see how quickly tires failed in each of the tests. For example, all type D tires failed quickly at the testing speed of 180 km/h, but some type H tires lasted a long time before failure.
The histogram is a very popular chart for showing the distribution of a variable. As an example of histograms, let’s look at average birth weights, grouped by number of births.
# Figure 13
histogram(~DBWT|DPLURAL, data = births2006.smpl)
This format helps make each chart readable by itself, but makes it difficult to compare the different groups.
# Figure 14
histogram(~DBWT|DPLURAL, data = births2006.smpl, layout = c(1, 5))
It’s easy to see that birth weights are roughly normally distributed within each group, but the mean weight drops as the number of births increases.
If you’d like to see a single line showing the distribution, instead of a set of columns representing bins, you can use kernel density plots.
# Figure 15
densityplot(~DBWT|DPLURAL,
data = births2006.smpl,
layout = c(1, 5),
plot.points = FALSE
)
By default, densityplot will draw a strip chart under each chart, showing every data point. However, the data set is so big, we specify plot.points = FALSE
.
One advantage of density plots over histograms is that you can stack them on top of each other and still read the results.
# Figure 16
densityplot(~DBWT,
groups = DPLURAL,
data = births2006.smpl,
plot.points = FALSE,
auto.key = TRUE
)
As you can see, it’s easier to compare distribution shapes (and centers) by superimposing the charts.
A good alternative to histograms are strip plots, especially when there isn’t much data to plot. Strip plots look similar to dot plots, but they show different information. Dot plots are designed to show one value per category (often a mean or a sum), while strip plots show many values. You can think of strip plots as one-dimensional scatter plots.
As an example of a strip plot, let’s look at the weights of babies born in sets of 4 or more. There were only 44 observations in our data set that match this description.
jitter.data = TRUE
):# Figure 17
stripplot(~DBWT,
data = births2006.smpl,
subset = (DPLURAL == "5 Quintuplet or higher" | DPLURAL == "4 Quadruplet"),
jitter.data = TRUE
)
A quantile-quantile plot is a useful plot that can compare the distribution of actual data values to a theoretical distribution. It plots quantiles of the observed data against quantiles of a theoretical distribution. If the plotted points form a straight diagonal line (from top right to bottom left), then it is likely that the observed data comes from the theoretical distribution. Quantile-quantile plots are a very powerful technique for seeing how closely a data set matches a theoretical distribution (or how much it deviates from it).
# Figure 18
qqmath(rnorm(100000))
By default, the function qqmath compares the sample data to a normal distribution. If the sample data is really normally distributed, you’ll see a vertical line.
# Figure 19
qqmath(~DBWT|DPLURAL,
data = births2006.smpl[sample(1:nrow(births2006.smpl), 50000), ],
pch = 19,
cex = 0.25,
subset = (DPLURAL != "5 Quintuplet or higher")
)
As you can see from above figure, the distribution of birth weights is not quite normal.
As another example, let’s look at real estate prices in San Francisco in 2008 and 2009. This data set is included in the nutshell package as sanfrancisco.home.sales.
# Figure 20
library(nutshell)
data(sanfrancisco.home.sales)
qqmath(~price, data = sanfrancisco.home.sales)
As expected, real estate prices is not normally distributed. Intuitively, it doesn’t make sense for real estate prices to be normally distributed. There are far more people with below-average incomes than above-average incomes. The lowest recorded price in the data set is $100,000; the highest is $9,500,000.
But it looks exponential, so let’s try a log transform.
# Figure 21
qqmath(~log(price), data = sanfrancisco.home.sales)
A log transform yields a distribution that looks pretty close to normally distributed.
Then let’s take a look at how the distribution changes based on the number of bedrooms.
type = "smooth"
) to show how the distribution changes based on the number of bedrooms (groups = bedrooms
):# Figure 22
qqmath(~log(price),
groups = bedrooms,
data = subset(sanfrancisco.home.sales,
!is.na(bedrooms) & bedrooms>0 & bedrooms<7),
auto.key = TRUE,
drop.unused.levels = TRUE,
type = "smooth"
)
In this formula, we pass an explicit subset as an argument to the function instead of using the subset
argument. Notice that the lines are separate, with higher values for higher numbers of bedrooms. drop.unused.levels
is a logical flag indicating whether the unused levels of factors will be dropped.
We can do the same thing for square footage.
# Figure 23
# library(Hmisc)
qqmath(~log(price),
groups = cut2(squarefeet, g = 6),
data = subset(sanfrancisco.home.sales, !is.na(squarefeet)),
auto.key = TRUE,
drop.unused.levels = TRUE,
type = "smooth"
)
The function cut2
from the package HMisc to divide square footages into six even quantiles.
This section describes Trellis plots for plotting two variables. Many real data sets (for example, financial data) record relationships between multiple numeric variables.
As an example of a scatter plot, let’s take a look at the relationship between house size and price.
# Figure 24
xyplot(price~squarefeet, data = sanfrancisco.home.sales)
It looks like there is a rough correspondence between size and price (the plot looks vaguely cone shaped). Let’s analyze it further.
table(subset(sanfrancisco.home.sales, !is.na(squarefeet), select = zip))
##
## 94100 94102 94103 94104 94105 94107 94108 94109 94110 94111 94112 94114
## 2 52 62 4 44 147 21 115 161 12 192 143
## 94115 94116 94117 94118 94121 94122 94123 94124 94127 94131 94132 94133
## 101 124 114 92 92 131 71 85 108 136 82 47
## 94134 94158
## 105 13
# Figure 25
xyplot(price~squarefeet|zip,
data = sanfrancisco.home.sales,
subset = (zip!=94100 & zip!=94104 & zip!=94108 & zip!=94111 &
zip!=94133 & zip!=94158 & price<4000000 &
ifelse(is.na(squarefeet), FALSE, squarefeet<6000)),
strip = strip.custom(strip.levels = TRUE)
)
The first formula is to pick a subset of zip codes to plot. A few parts of the city are sparsely populated (like the financial district, 94104) and don’t have enough data to make plotting interesting. strip.custom()
is the function that draws the strips by specifying certain arguments. strip.levels()
is a logical vector of length 2, indicating whether or not the level of the conditioning variable is to be written on the strip.
Now, the linear relationship is much more pronounced. We can notice that the different slopes in different neighborhoods. We can make this slightly more readable by using neighborhood names.
pch = 19
) and shrink them to a smaller size (cex=.2
):# Figure 26
dollars.per.squarefoot <- mean(
sanfrancisco.home.sales$price / sanfrancisco.home.sales$squarefeet,
na.rm = TRUE)
xyplot(price~squarefeet|neighborhood,
data = sanfrancisco.home.sales,
pch = 19,
cex = .2,
subset = (zip!=94100 & zip!=94104 & zip!=94108 & zip!=94111 &
zip!=94133 & zip!=94158 & price<4000000 &
ifelse(is.na(squarefeet), FALSE, squarefeet<6000)),
strip = strip.custom(strip.levels = TRUE,
horizontal = TRUE,
par.strip.text = list(cex = .8)),
panel = function(...) {panel.abline(a = 0, b = dollars.per.squarefoot)
panel.xyplot(...)}
)
Box plots in the lattice package are just like box plots drawn with the graphics package. The boxes represent prices from the 25th through the 75th percentiles (the interquartile range), the dots represent median prices, and the whiskers represent the minimum or maximum values. (When there are values that stretch beyond 1.5 times the length of the interquartile range, the whiskers are truncated at those extremes.)
Let’s take a look at how the San Francisco home prices changed over time. We can use box plots to watch how the whole distribution changed in this period.
# Figure 27
table(cut(sanfrancisco.home.sales$date, "month"))
##
## 2008-02-01 2008-03-01 2008-04-01 2008-05-01 2008-06-01 2008-07-01
## 139 230 267 253 237 198
## 2008-08-01 2008-09-01 2008-10-01 2008-11-01 2008-12-01 2009-01-01
## 253 223 272 118 181 114
## 2009-02-01 2009-03-01 2009-04-01 2009-05-01 2009-06-01 2009-07-01
## 123 142 116 180 150 85
bwplot(price~cut(date, "month"), data = sanfrancisco.home.sales)
Unfortunately, this doesn’t produce an easily readable plot and there are a large number of outliers that are making the plot hard to see. Let’s try plotting the box plots again, this time with the logtransformed values. To make it more readable, we can change to vertical box plots and rotate the text at the bottom:
# Figure 28
bwplot(log(price)~cut(date, "month"),
data = sanfrancisco.home.sales,
scales = list(x = list(rot = 90))
)
we can more clearly see some trends in this plot. Median prices moved around a little during this period, though the interquartile range moved a lot. Moreover, the basic distribution appears pretty stable from month to month.
If you would like to generate a matrix of scatter plots for many different pairs of variables, you can use the splom
function.
head(iris)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
# Figure 29
super.sym <- trellis.par.get("superpose.symbol")
splom(~iris[1:4], groups = Species, data = iris,
panel = panel.superpose,
key = list(title = "Three Varieties of Iris",
columns = 3,
points = list(pch = super.sym$pch[1:3],
col = super.sym$col[1:3]),
text = list(c("Setosa", "Versicolor", "Virginica")))
)
If you would like to generate quantile-quantile plots for comparing two distributions, you can use the function qq
.
# library(lattice)
head(singer)
height | voice.part |
---|---|
64 | Soprano 1 |
62 | Soprano 1 |
66 | Soprano 1 |
65 | Soprano 1 |
60 | Soprano 1 |
61 | Soprano 1 |
# Figure 30
qq(voice.part ~ height,
data = singer,
aspect = 1,
subset = (voice.part == "Bass 2" | voice.part == "Tenor 1")
)
aspect = 1
means the length of the x-axis is equal to that of the y-axis. As we can see, there’s a little difference between two distributions because the plotted points don’t form a straight diagonal line.
If you would like to plot three-dimensional data with Trellis graphics, there are several functions available.
levelplot
function can plot three-dimensional data in flat grids, with colors showing different values for the third dimension. As an example of level plots, we also look at the San Francisco home sales data set.
# Figure 31
attach(sanfrancisco.home.sales)
levelplot(table(cut(longitude, breaks = 40), cut(latitude, breaks = 40)),
scales = list(y = list(cex = .5), x = list(rot = 90, cex = .5)),
xlab = "latitude",
ylab = "longitude"
)
The table
and cut
functions are to break the longitude and latitude data into bins and count the number of homes within each bin. xlab
and ylab
function is a character or expression (or a “grob”) giving label(s) for the x-axis and y-axis.
If we were interested in looking at the average sales price by area, we could use a similar strategy. Instead of table, you can use the tapply function to aggregate observations.
# Figure 32
levelplot(tapply(price,
INDEX = list(cut(longitude, breaks = 40), cut(latitude, breaks = 40)),
FUN = mean),
scales = list(draw = FALSE),
xlab = "latitude",
ylab = "longitude"
)
scales
is generally a list determining how the x- and y-axes (tick marks and labels) are drawn. draw
is a logical flag, determines whether to draw the axis (i.e., tick marks and labels) at all.
Of course, you can use conditioning values with level plots.
# Figure 33
bedrooms.capped <- ifelse(bedrooms<5, bedrooms, 5)
levelplot(table(cut(longitude, breaks = 25),cut(latitude, breaks = 25), bedrooms.capped),
scales = list(draw = FALSE)
)
contourplot
function is to to show contour plots with lattice (which resemble topographic maps). We can use ?contourplot
to see more details.
# Figure 34
contourplot(volcano)
cloud
function is to plot points in three dimensions (technically, projections into two dimensions of the points in three dimensions). You can use ?cloud
to see more details. The data set volcano is package datasets.
# Figure 35
cloud(volcano, zlab = list("volcano", rot = 90))
If you would like to show a three-dimensional surface, you could use the function wireframe
.
# Figure 36
wireframe(volcano, zlab = list("volcano", rot = 90))
If you have fitted a model to a data set, the rfs function can help you visualize how well the model fits the data. The rfs function plots residual and fit-spread (RFS) plots.
# library(nutshell)
data(team.batting.00to08)
head(team.batting.00to08)
teamID | yearID | runs | singles | doubles | triples | homeruns | walks | stolenbases | caughtstealing | hitbypitch | sacrificeflies | atbats |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ANA | 2000 | 864 | 995 | 309 | 34 | 236 | 608 | 93 | 52 | 47 | 43 | 5628 |
BAL | 2000 | 794 | 992 | 310 | 22 | 184 | 558 | 126 | 65 | 49 | 54 | 5549 |
BOS | 2000 | 792 | 988 | 316 | 32 | 167 | 611 | 43 | 30 | 42 | 48 | 5630 |
CHA | 2000 | 978 | 1041 | 325 | 33 | 216 | 591 | 119 | 42 | 53 | 61 | 5646 |
CLE | 2000 | 950 | 1078 | 310 | 30 | 221 | 685 | 113 | 34 | 51 | 52 | 5683 |
DET | 2000 | 823 | 1028 | 307 | 41 | 177 | 562 | 83 | 38 | 43 | 49 | 5644 |
# Figure 37
attach(team.batting.00to08)
rfs(lm(formula = runs~singles+doubles+triples+homeruns+walks+
hitbypitch+sacrificeflies+stolenbases+caughtstealing,
data = team.batting.00to08),
aspect = 1
)
Notice that the two curves are S shaped. The residual plot is a quantile-quantile plot of the residuals. Because the default distribution choice for rfs
is a uniform distribution, we need to modify certain arguments.
# Figure 38
rfs(lm(formula = runs~singles+doubles+triples+homeruns+walks+
hitbypitch+sacrificeflies+stolenbases+caughtstealing,
data = team.batting.00to08),
aspect = 1,
distribution = qnorm
)
Notice that the plots are roughly linear. We expect a normally distributed error function for a linear regression model, and this is a good thing.
Most lattice functions share common arguments; the same argument has a similar effect in multiple functions. Let’s describe what each of those arguments does and explain how to fine-tune the output of lattice functions.
Lattice functions share many common arguments. Instead of explaining what each function does separately we’ll explain them in a single table.
Argument | Description |
---|---|
x | The object to plot. May be a formula, array, numeric vector, or table. |
data | When x is a formula, data is a data frame in which the function is evaluated. |
allow.multiple | Specifies how to interpret formulas of the form y1 + y2 ~ X | Z (where X is a function of multiple variables and Z may also be a function of multiple variables) |
outer | Specifies whether to superimpose plots or not when allow.multiple=TRUE and multiple dependent variables are specified. |
box.ratio | For plots that show data in rectangles (bwplot, barchart, and stripplot), a numeric value that specifies the ratio of the width of the rectangles to the inner rectangle space. |
horizontal | For plots that can be laid out vertically or horizontally (bwplot, dotplot, barchart and stripplot), a logical value that specifies the direction to plot. |
panel | The panel function used to actually draw the plots. |
aspect | Specifies the aspect ratio to use for different panels. |
groups | Specifies a variable (or expression of variables) describing groups of data to pass to the panel function. |
auto.key | A logical value specifying whether to automatically draw a key showing the names of groups corresponding to different colors or symbols. |
prepanel | A function that takes the same arguments as panel and returns a list containing values xlim, ylim, dx, and dy (and, less frequently, xat and yat). |
strip | A logical value specifying whether strips (that label panels) should be drawn. |
xlab | A character value specifying the label for the x-axis. |
ylab | A character value specifying the label for the y-axis. |
scales | A list that specifies how the x- and y-axes should be drawn. |
subscripts | A logical value specifying whether a vector named subscripts should be passed to the panel function. |
subset | Specifies the subset of values from data to plot. |
xlim | Specifies the minimum and maximum values for the x-axis. |
ylim | Specifies the minimum and maximum values for the y-axis. |
drop.unused.levels | A logical value (or a list outlining what to do for different components of x) specifying whether to drop unused levels of factors. |
default.scales | A list giving the default value of scales. |
lattice.options | A list of plotting parameters, similar to par values for standard R graphics. |
If you would like to get more information about augments, see the help files.
You can control how axes are drawn in the lattice package by named values in the argument scales
. Here is a table of the available arguments.
Argument | Description |
---|---|
rot | Angle to rotate axis labels. Can specify a vector of length 2 to separately control left/bottom and right/top. |
cex | A numeric value that controls the size of axis labels (âcharacter expansionâ factor). Can specify a vector of length 2 to separately control left/bottom and right/top. |
limits | Limits for each axis; equivalent to xlim and ylim. |
axs | Use axs=“r” to pad date values on each side, axs=“i” to use exact values. |
at | A numeric vector describing where to plot tick marks (in native coordinates) or a list describing where to plot tick marks for each panel. |
labels | Labels to accompany at, specified as a vector (or list of vectors). |
tck | A numeric value specifying the length of the tick marks. |
For more details on other arguments, you can see the page 310 of the book R IN A NUTSHELL.
In the graphics package, we use par
function to set or query default parameters.
par("cex")
## [1] 1
That is a similar mechanism for lattice graphics. To check the value of a setting, use the function trellis.par.get
. To change a setting, use the function trellis.par.set
axis.text
parameter, which controls the look of text printed on axes:trellis.par.get("axis.text")
## $alpha
## [1] 1
##
## $cex
## [1] 0.8
##
## $col
## [1] "#000000"
##
## $font
## [1] 1
##
## $lineheight
## [1] 1
axis.text$cex
to 0.5trellis.par.set(list(axis.text = list(cex = 0.5)))
show.settings()
We can also use trellis.par.get()
.
names(trellis.par.get())
## [1] "grid.pars" "fontsize" "background"
## [4] "panel.background" "clip" "add.line"
## [7] "add.text" "plot.polygon" "box.dot"
## [10] "box.rectangle" "box.umbrella" "dot.line"
## [13] "dot.symbol" "plot.line" "plot.symbol"
## [16] "reference.line" "strip.background" "strip.shingle"
## [19] "strip.border" "superpose.line" "superpose.symbol"
## [22] "superpose.polygon" "regions" "shade.colors"
## [25] "axis.line" "axis.text" "axis.components"
## [28] "layout.heights" "layout.widths" "box.3d"
## [31] "par.xlab.text" "par.ylab.text" "par.zlab.text"
## [34] "par.main.text" "par.sub.text"
There are 35 highlevel groups of parameters describing how different components are drawn. If you want to know details of what each of these groups of parameters control, please refer to page 313 of the book “R in a nutshell”.
plot.trellis
As we noted above, lattice functions do not plot results; they return lattice objects. To plot a lattice object, you need to call print.trellis()
or plot.trellis()
on the lattice object. It’s possible to control how lattice objects are printed through changing arguments of them.
plot.trellis()
?plot.trellis
To change the way strips are drawn, you can specify your own strip function as an argument to a lattice function. The simplest way to modify the appearance of the strips is by using the function strip.custom
. This function accepts the same arguments as strip.default
and returns a new function that can be specified as an argument to a lattice function. We have used this function when ploting Figure 25.
plot.trellis
?strip.default
The lattice package includes a variety of different panel functions that you can use to customize your charts. For example, you can use these functions to add lines, text, and other graphical elements to lattice graphics.
Function(s) | Description |
---|---|
llines, panel.line | Plots lines |
lpoints, panel.points | Plots points |
ltext, panel.text | Plots text |
panel.axis | Plots axes |
panel.abline | Adds a line to the chart area of a panel. |
panel.curve | Adds a curve (defined by a mathematical expression) to the chart area of a panel. |
panel.mathdensity | Plots a probability distribution given by a distribution function. |
panel.lmline | Plots a line fitted to the underlying data by a linear regression. |
For more details on other functions, you can see the corresponding help files and page 318-319 of the book R IN A NUTSHELL.
ggplot2 is an R package for producing statistical, or data, graphics, but it is unlike most other graphics packages because it has a deep underlying grammar. This layered grammar, based on the Grammar of Graphics (Wilkinson 2005), focuses on the primacy of layers and adapting it for embedding within R. In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). This is made up of a set of independent components which makes ggplot2 very powerful because you are not limited to a set of pre-specified graphics, but you can create new graphics that are precisely tailored for your problem.
This package is not part of a standard R installation, so it must first be installed, then it can be loaded into R as follows.
library(ggplot2)
In ggplot2
, all plots are composed of:
Data
Layers made up of geometric elements (geom), including points, lines and polygons, and statistical transformation (stat), such as histogram or a 2d relationship with a linear model.
Scales (scale) map values in the data space to values in an aesthetic space, whether it be colour, or size, or shape.
A coordinate system (coord) describes how data coordinates are mapped to the plane of the graphic.
A faceting specification (facet) describes how to break up the data into subsets and how to display those subsets as small multiples.
A theme which controls the finer points of display, like the font size and background colour.
Data sets in ggplot2 package:
Dataset | Description |
---|---|
diamonds | Prices of 50,000 round cut diamonds |
economics | US economic time series |
economics_long | US economic time series |
faithfuld | 2d density estimate of Old Faithful data |
luv_colours | ‘colors()’ in Luv space |
midwest | Midwest demographics |
mpg | Fuel economy data from 1999 and 2008 for 38 popular models of car |
msleep | An updated and expanded version of the mammals sleep dataset |
presidential | Terms of 11 presidents from Eisenhower to Obama |
seals | Vector field of seal movements |
txhousing | Housing sales in TX |
In this section, we’ll mostly use one data set that’s bundled with ggplot2: mpg. It includes information about the fuel economy of popular car models in 1999 and 2008, collected by the US Environmental Protection Agency.
head(mpg)
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | compact |
Argument | Description |
---|---|
cty and hwy | record miles per gallon (mpg) for city and highway driving. |
displ | the engine displacement in litres. |
drv | the drivetrain: front wheel (f), rear wheel (r) or four wheel (4). |
model | the model of car. |
class | a categorical variable describing the âtypeâ of car: two seater, SUV, compact, etc. |
Every ggplot2 plot has three key components:
data
A set of aesthetic mappings between variables in the data and visual properties, and
At least one layer which describes how to render each observation. Layers are usually created with a geom function.
# Figure 1
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()
This produces a scatterplot defined by:
Data: mpg.
Aesthetic mapping: engine size mapped to x position, fuel economy to y position.
Layer: points.
Pay attention to the structure of this function call: data and aesthetic mappings are supplied in ggplot()
, then layers are added on with +.
To add additional variables to a plot, we can use other aesthetics like colour, shape, and size. These work in the same way as the x and y aesthetics, and are added into the call to aes()
:
# Figure 2
ggplot(mpg, aes(displ, hwy, colour = class)) +
geom_point()
This gives each point a unique colour corresponding to its class. The legend allows us to read data values from the colour. You can also use aes(displ, hwy, shape = class)
and aes(displ, hwy, size = class)
to distinguish classes.
# Figure 3
ggplot(mpg, aes(displ, hwy)) +
geom_point(colour = "blue", shape = 17, size = 2)
Another technique for displaying additional categorical variables on a plot is facetting. Facetting creates tables of graphics by splitting the data into subsets and displaying the same graph for each subset.
There are two types of facetting: grid and wrapped. Wrapped is the most useful, so take it as an example.
# Figure 4
ggplot(mpg, aes(displ, hwy, colour = class)) +
geom_point() +
facet_wrap(~class)
Substituting geom_point()
for a different geom
function, you’d get a different type of plot. There is some of the other important geoms provided in ggplot2. This isn’t an exhaustive list, but should cover the most commonly used plot types.
Geoms | Description | Aesthetics |
---|---|---|
geom_point() | Data symbols | x, y, shape, fill, alpha, stroke |
geom_line() | Line (ordered on x) | x, y, alpha, linetype |
geom_path() | Line (original order) | x, y, alpha, linetype |
geom_text() | Text labels | x, y, label, angle, hjust, vjust |
geom_rect() | Rectangles | xmin, xmax, ymin, ymax, fill, linetype |
geom_polygon() | Polygons | x, y, fill, linetype |
geom_segment() | Line segments | x, y, xend, yend, linetype |
geom_bar() | Bars | x, y, alpha, fill, linetype |
geom_histogram() | Histogram | x, y, alpha, fill, linetype |
geom_boxplot() | Boxplots | x, lower, upper, middle, ymin, ymax, alpha, fill, weight, shape |
geom_density() | Density | x, y, fill, linetype, weight |
geom_contour() | Contour lines | x, y, alpha, fill, linetype, weight |
geom_smooth() | Smoothed line | x, y, alpha, fill, linetype, weight |
ALL | color, size, group |
Here’s some common aesthetics:
Aesthetics | Explanation |
---|---|
shape | takes four types of values: an integer in [0, 25], a single character, a “.”, an NA |
fill | fills different colours according to its value (do well in histogram and boxplot) |
alpha | makes the points transparent (very useful for larger datasets with more overplotting) |
stroke | modifies the width of the border |
linetype | solid (default), dotted and dashed |
label | modifies the xlab and ylab |
angle | control the display direction of the axis text |
hjust | controls horizontal justification, defined between 0 and 1 |
vjust | controls vertical justification, defined between 0 and 1 |
# Figure 5
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth()
This overlays the scatterplot
with a smooth curve, including an assessment of uncertainty in the form of point-wise confidence intervals shown in grey. If you’re not interested in the confidence interval, turn it off with geom_smooth(se = FALSE)
.
An important argument to geom_smooth()
is the method, which allows us to choose which type of model is used to fit the smooth curve:
method = "loess"
, the default for small n, uses a smooth local regression. The wiggliness of the line is controlled by the span parameter, which ranges from 0 (exceedingly wiggly) to 1 (not so wiggly). Notice that loess does not work well for large datasets (n > 1,000).# Figure 6
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(span = 0.2)
method = "gam"
fits a generalised additive model provided by the mgcv package. You need to first load mgcv, then use a formula like formula = y ~ s(x)
or y ~ s(x, bs = "cs")
(for large data). This is what ggplot2
uses when there are more than 1,000 points.# Figure 7
library(mgcv)
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(method = "gam", formula = y ~ s(x))
method = "lm"
fits a linear model, giving the line of best fit.# Figure 8
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(method = "lm")
When a set of data includes a categorical variable and one or more continuous variables, you will probably be interested to know how the values of the continuous variables vary with the levels of the categorical variable.
# Figure 9
ggplot(mpg, aes(drv, hwy)) +
geom_point()
Because there are few unique values of both class and hwy, there is a lot of overplotting. Many points are plotted in the same location, and it’s difficult to see the distribution. There are three useful techniques that help alleviate the problem:
geom_jitter()
, adds a little random noise to the data which can help avoid overplotting.# Figure 10
ggplot(mpg, aes(drv, hwy)) +
geom_jitter()
geom_boxplot()
, summarise the shape of the distribution with a handful of summary statistics.# Figure 11
ggplot(mpg, aes(drv, hwy)) +
geom_boxplot()
geom_violin()
, show a compact representation of the density of the distribution, highlighting the areas where more points are found.# Figure 12
ggplot(mpg, aes(drv, hwy)) +
geom_violin()
Histograms and frequency polygons show the distribution of a single numeric variable. They provide more information about the distribution of a single group than boxplots do, at the expense of needing more space.
# Figure 13
ggplot(mpg, aes(hwy)) +
geom_histogram()
# Figure 14
ggplot(mpg, aes(hwy)) +
geom_freqpoly()
Both of them bin the data, then count the number of observations in each bin. The only difference is the display: histograms use bars and frequency polygons use lines. However, the default just splits your data into 30 bins, which is unlikely to be the best choice.
# Figure 15
ggplot(mpg, aes(hwy)) +
geom_freqpoly(binwidth = 2.5)
# Figure 16
ggplot(mpg, aes(displ, fill = drv)) +
geom_histogram(binwidth = 0.5) +
facet_wrap(~drv, ncol = 1)
The discrete analogue of the histogram is the bar chart, geom_bar()
.
# Figure 17
ggplot(mpg, aes(manufacturer)) +
geom_bar(aes(fill = drv))
Line and path plots are typically used for time series data. Line plots join the points from left to right, while path plots join them in the order that they appear in the dataset. Line plots usually have time on the x-axis, showing how a single variable has changed over time. Path plots show how two variables have simultaneously changed over time, with time encoded in the way that observations are connected.
We’ll show some time series plots using the economics dataset, which contains economic data on the US measured over the last 40 years.
# Figure 18
ggplot(economics, aes(date, unemploy / pop)) +
geom_line()
Now we would like to examine the relationship between unemployment rate and length of unemployment. Meanwhile, we also need to see the evolution over time. The solution is to join points adjacent in time with line segments, forming a path plot.
# Figure 19
ggplot(economics, aes(unemploy / pop, uempmed)) +
geom_path(colour = "grey50") +
geom_point(aes(colour = date))
In the plot, we colour the points to make it easier to see the direction of time. Pay attention to the difference of geom_point(colour = ...)
and geom_point(aes(colour = ...))
.
Two families of useful helpers let you make the most common modifications. xlab()
and ylab()
modify the x- and y-axis labels, while xlim()
and ylim()
modify the limits of axes.
# Figure 20
ggplot(mpg, aes(drv, hwy)) +
geom_jitter(width = 0.25, size = 2) +
xlim("f", "r") +
ylim(20, 30) +
xlab("city driving (mpg)") +
ylab("highway driving (mpg)")
xlab(NULL)
and ylab(NULL)
can remove the axis labels.
Most of the time we create a plot object and immediately plot it, but we can also save a plot to a variable and manipulate it:
p <- ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
geom_point()
Once we have a plot object, there are a few things we can do with it:
print()
:# Figure 21
print(p)
ggsave()
:ggsave("plot.png", width = 5, height = 5)
summary()
:summary(p)
## data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy,
## fl, class [234x11]
## mapping: x = ~displ, y = ~hwy, colour = ~factor(cyl)
## faceting: <ggproto object: Class FacetNull, Facet, gg>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map_data: function
## params: list
## setup_data: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetNull, Facet, gg>
## -----------------------------------
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
saveRDS()
. This saves a complete copy of the plot object, so you can easily re-create it with readRDS()
:saveRDS(p, "plot.rds")
q <- readRDS("plot.rds")
In some cases, you will want to create a quick plot with a minimum of typing. In these cases you may prefer to use qplot()
over ggplot()
. qplot()
lets you define a plot in a single call, picking a geom by default if you don’t supply one.
# Figure 22
qplot(displ, data = mpg)
# Figure 23
qplot(displ, hwy, data = mpg)
qplot()
tries to pick a sensible geometry and statistic based on the arguments provided. For example, if you give qplot()
x and y variables, it’ll create a scatterplot. If you just give it an x, it’ll create a histogram or bar chart depending on the type of variable.
qplot()
assumes that all variables should be scaled by default. If you want to set an aesthetic to a constant, you need to use I()
:
# Figure 24
qplot(displ, hwy, data = mpg, colour = "blue")
or
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = "blue"))
# Figure 25
qplot(displ, hwy, data = mpg, colour = I("blue"))
or
ggplot(mpg, aes(displ, hwy)) +
geom_point(colour = "blue")
This section begins by describing the details in the process of drawing a more complicated plot. I hope they can help you to produce graphics using the same structured thinking that you use to design an analysis, reducing the distance between a plot in your head and one on the page.
geom_text()
is the main tool to add labels at the specified x and y positions. It has the most aesthetics of any geom, because there are so many ways to control the appearance of a text.
# Figure 26
df <- data.frame(x = 1, y = 3:1, z = c("sans", "serif", "mono"))
ggplot(df, aes(x, y)) +
geom_text(aes(label = z, family = z))
# Figure 27
df <- data.frame(x = 1, y = 3:1, z = c("plain", "bold", "italic"))
ggplot(df, aes(x, y)) +
geom_text(aes(label = z, fontface = z))
# Figure 28
df <- data.frame(x = c(1, 1, 2, 2, 1.5), y = c(1, 2, 1, 2, 1.5),
z = c("bottom-left", "bottom-right","top-left", "top-right", "center"))
ggplot(df, aes(x, y)) +
geom_text(aes(label = z), vjust = "inward", hjust = "inward")
size controls the font size.
angle specifies the rotation of the text in degrees.
nudge_x and nudge_y parameters allow you to nudge the text a little horizontally or vertically.
# Figure 29
df <- data.frame(trt = c("a", "b", "c"), resp = c(1.2, 3.4, 2.5))
ggplot(df, aes(resp, trt)) +
geom_point() +
geom_text(aes(label = paste0("(", resp, ")")), nudge_y = -0.25) +
xlim(1, 3.6)
# Figure 30
ggplot(mpg, aes(displ, hwy)) +
geom_text(aes(label = model)) +
xlim(1, 8)
# Figure 31
ggplot(mpg, aes(displ, hwy)) +
geom_text(aes(label = model), check_overlap = TRUE) +
xlim(1, 8)
geom_label()
is a variation on geom_text()
: it draws a rounded rectangle behind the text.# Figure 32
z <- data.frame(waiting = c(55, 80), eruptions = c(2, 4.3), peak = c("peak one", "peak two"))
ggplot(faithfuld, aes(waiting, eruptions)) +
geom_tile(aes(fill = density)) +
geom_label(data = z, aes(label = peak))
# Figure 33
ggplot(mpg, aes(displ, hwy, colour = drv)) +
geom_label(aes(label = class), show.legend = FALSE) +
geom_text(aes(x = 6.5, y = 42, label = "A SIMPLE EXAMPLE"),
vjust = "inward", hjust = "inward",
fontface = "italic",
family = "mono",
size = 5,
colour = "darkred")
Annotations add metadata to our plot. But metadata is just data, so we can use:
geom_rect()
to highlight interesting rectangular regions of the plot. geom_rect()
has aesthetics xmin
, xmax
, ymin
and ymax
.head(presidential)
name | start | end | party |
---|---|---|---|
Eisenhower | 1953-01-20 | 1961-01-20 | Republican |
Kennedy | 1961-01-20 | 1963-11-22 | Democratic |
Johnson | 1963-11-22 | 1969-01-20 | Democratic |
Nixon | 1969-01-20 | 1974-08-09 | Republican |
Ford | 1974-08-09 | 1977-01-20 | Republican |
Carter | 1977-01-20 | 1981-01-20 | Democratic |
presidential <- subset(presidential, start > economics$date[1])
# Figure 34
p <- ggplot(economics) +
geom_rect(aes(xmin = start, xmax = end, fill = party),
ymin = -Inf, ymax = Inf, alpha = 0.2,
data = presidential)
print(p)
We annotate this plot with which president was in power at the time. There is one special thing to note: the use of -Inf
and Inf
as positions. These refer to the top and bottom (or left and right) limits of the plot.
geom_line()
, geom_path()
and geom_segment()
to add lines.# Figure 35
p <- p + geom_line(aes(date, unemploy))
print(p)
geom_vline()
, geom_hline()
and geom_abline()
allow you to add reference lines (sometimes called rules), that span the full range of the plot.# Figure 36
p <- p + geom_vline(aes(xintercept = as.numeric(start)),
data = presidential,
colour = "grey50",
alpha = 0.5)
print(p)
geom_text()
to add text descriptions or to label points. Most plots will not benefit from adding text to every single observation on the plot, but labelling outliers and other important points is very useful.# Figure 37
p <- p + geom_text(aes(x = start, y = 2500, label = name),
data = presidential,
size = 4,
vjust = 0, hjust = 0,
nudge_x = 50)
print(p)
annotate()
to add a single annotation to a plot# Figure 38
yrng <- range(economics$unemploy)
xrng <- range(economics$date)
caption <- paste(strwrap("Unemployment rates in the US have varied a lot over the years", 40),
collapse = "\n")
ggplot(economics, aes(date, unemploy)) +
geom_line() +
annotate("text", x = xrng[1], y = yrng[2],
label = caption,
hjust = 0, vjust = 1,
size = 4)
geom_abline()
is useful when comparing groups across facets. In the following plot, it’s much easier to see the subtle differences if we add a reference line.head(diamonds)
carat | cut | color | clarity | depth | table | price | x | y | z |
---|---|---|---|---|---|---|---|---|---|
0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
0.23 | Good | E | VS1 | 56.9 | 65 | 327 | 4.05 | 4.07 | 2.31 |
0.29 | Premium | I | VS2 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 |
0.31 | Good | J | SI2 | 63.3 | 58 | 335 | 4.34 | 4.35 | 2.75 |
0.24 | Very Good | J | VVS2 | 62.8 | 57 | 336 | 3.94 | 3.96 | 2.48 |
# Figure 39
mod_coef <- coef(lm(log10(price) ~ log10(carat), data = diamonds))
ggplot(diamonds, aes(log10(carat), log10(price))) +
geom_bin2d() +
geom_abline(intercept = mod_coef[1], slope = mod_coef[2],
colour = "white", size = 1) +
facet_wrap(~cut, nrow = 1)
geom_bin2d()
divides the plane into rectangles, counts the number of cases in each rectangle, and then (by default) maps the number of cases to the rectangle’s fill.
Scales control the mapping from data to aesthetics. They take your data and turn it into something that you can see, like size, colour, position or shape. Scales also provide the tools that let you read the plot: the axes and legends. Formally, each scale is a function from a region in data space (the domain of the scale) to a region in aesthetic space (the range of the scale). The axis or legend is the inverse function: it allows you to convert visual properties back to data.
A scale is required for every aesthetic used on the plot. When you write:
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class))
What actually happens is this:
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
scale_x_continuous() +
scale_y_continuous() +
scale_colour_discrete()
It would be tedious to manually add a scale every time you used a new aesthetic, so ggplot2 does it for you. But if we want to override the defaults, we’ll need to add the scale yourself, like this:
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
scale_x_continuous("A really awesome x axis label") +
scale_y_continuous("An amazingly great y axis label")
When we +
a scale, we’re not actually adding it to the plot, but overriding the existing scale. This means that the following two specifications are equivalent:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_continuous("Label 1") +
scale_x_continuous("Label 2")
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_continuous("Label 2")
You’ve probably already figured out the naming scheme for scales:
Scale | The name of the aesthetic | The name of the scale |
---|---|---|
scale | colour | continuous |
shape | discrete | |
x | brewer | |
y |
The component of a scale that you’re most likely to want to modify is the guide, the axis or legend associated with the scale. There are many natural correspondences between the two
Axis | Legend | Argument name |
---|---|---|
Label | Title | name |
Ticks & grid line | Key | breaks |
Tick label | Key label | labels |
The first argument to the scale function, name, is the axes/legend title. You can supply text strings (using \n
for line breaks) or mathematical expressions in quote()
# Figure 40
df <- data.frame(x = 1:2, y = 1, z = "a")
ggplot(df, aes(x, y)) +
geom_point() +
scale_x_continuous(quote(a + mathematical ^ expression))
Because tweaking these labels is such a common task, there are three helpers that save you some typing:xlab()
, ylab()
and labs()
:
# Figure 41
ggplot(df, aes(x, y)) + geom_point(aes(colour = z)) +
labs(x = "X axis", y = "Y axis", colour = "Colour\nlegend")
There are two ways to remove the axis label.
labs(x = "", y = "")
: omits the label, but still allocates space;labs(x = NULL, y = NULL)
: removes the label and its space.The breaks argument controls which values appear as tick marks on axes and keys on legends. Each break has an associated label, controlled by the labels argument. If you set labels, you must also set breaks; otherwise, if data changes, the breaks will no longer align with the labels.
# Figure 42
df <- data.frame(x = c(1, 3, 5) * 1000, y = 0.5)
ggplot(df, aes(x, y)) +
geom_point() +
labs(x = NULL, y = NULL) +
scale_x_continuous(breaks = c(2000, 4000), labels = c("2k", "4k")) +
scale_y_continuous(breaks = c(0.25, 0.75), labels = c("25%", "75%"))
The scales package provides a number of useful labelling functions:
scales::comma_format()
adds commas to make it easier to read large numbers.
scales::unit_format(unit, scale)
adds a unit suffix, optionally scaling.
scales::dollar_format(prefix, suffix)
displays currency values, rounding to two decimal places and adding a prefix
or suffix
.
scales::wrap_format()
wraps long labels into multiple lines.
# Figure 43
df <- data.frame(x = c(1, 3, 5) * 1000, y = 0.5)
ggplot(df, aes(x, y)) + geom_point() + labs(x = NULL, y = NULL) +
scale_y_continuous(labels = scales:::dollar_format(prefix="$"))
You can adjust the minor breaks (the faint grid lines that appear between the major grid lines) by supplying a numeric vector of positions to the minor_breaks
argument. This is particularly useful for log scales:
# Figure 44
df <- data.frame(x = c(2, 3, 5, 10, 200, 3000), y = 1)
mb <- as.numeric(1:10 %o% 10 ^ (0:4))
ggplot(df, aes(x, y)) +
geom_point() +
scale_x_log10(minor_breaks = mb)
A legend may need to draw symbols from multiple layers. For example, if you’ve mapped colour to both points and lines, the keys will show both points and lines.
# Figure 45
df <- data.frame(x = 1:3, y = 1:3, z = c("a", "b", "c"))
ggplot(df, aes(x, y)) +
geom_point(size = 8, colour = "grey20", alpha = 0.5, show.legend = TRUE) +
geom_point(aes(colour = z, shape = z), size = 4) +
guides(colour = guide_legend(override.aes = list(alpha = 1))) +
guides(fill = guide_legend(reverse=TRUE))
As we can see, if we want the geoms in the legend to display differently to the geoms in the plot. This is particularly useful when you’ve used transparency or size to deal with moderate overplotting and also used colour in the plot. You can do this using the override.aes parameter of guide_legend()
.
A number of settings that affect the overall display of the legends are controlled through the theme system. You can modify theme settings with the theme()
function.
The position and justification of legends are controlled by the theme setting legend.position, which takes values “right”, “left”, “top”, “bottom”, or “none” (no legend).
# Figure 46
df <- data.frame(x = 1:3, y = 1:3, z = c("a", "b", "c"))
ggplot(df, aes(x, y)) +
geom_point(aes(colour = z), size = 3) +
xlab(NULL) +
ylab(NULL) +
theme(legend.position = "bottom")
Alternatively, if there’s a lot of blank space in your plot you might want to place the legend inside the plot. You can do this by setting legend.position
to a numeric vector of length two. The numbers represent a relative location in the panel area: c(0, 1)
is the top-left corner and c(1, 0)
is the bottom-right corner. You control which corner of the legend the legend.position
refers to with legend.justification
, which is specified in a similar way.
# Figure 47
ggplot(df, aes(x, y)) +
geom_point(aes(colour = z), size = 3) +
theme(legend.position = c(0.8, 0.2), legend.justification = c(0.2, 0.8),
legend.direction = "horizontal")
# Figure 48
df <- data.frame(x = rnorm(1000), y = rnorm(1000))
df$z <- cut(df$x, 4, labels = c("a", "b", "c", "d"))
ggplot(df, aes(x, y)) +
geom_point(aes(colour = z, shape = z), alpha = 0.7) +
guides(colour = guide_legend(ncol = 2, byrow = TRUE, alpha = 1)) +
guides(shape = guide_legend(ncol = 2, byrow = TRUE, alpha = 1)) +
scale_colour_discrete("colour") +
scale_shape_discrete("shape")
Date and date/time data are continuous variables with special labels. ggplot2 works with Date (for dates) and POSIXct (for date/times) classes: if your dates are in a different format you will need to convert them with as.Date()
or as.POSIXct()
. scale_x_date()
and scale_x_datetime()
work similarly to scale_x_continuous()
but have special date_breaks and date_labels arguments that work in date-friendly units:
date_breaks()
and date_minor_breaks()
allows you to position breaks by date units (years, months, weeks, days, hours, minutes, and seconds). For example, date_breaks = "2 weeks"
will place a major tick mark every two weeks.
date_labels()
controls the display of the labels using the same formatting strings as in strptime()
and format():
String | Meaning |
---|---|
%S | second (00-59) |
%M | minute (00-59) |
%l | hour, in 12-hour clock (1-12) |
%I | hour, in 12-hour clock (01-12) |
%p | am/pm |
%H | hour, in 24-hour clock (00-23) |
%a | day of week, abbreviated (Mon-Sun) |
%A | day of week, full (Monday-Sunday) |
%e | day of month (1-31) |
%d | day of month (01-31) |
%m | month, numeric (01-12) |
%b | month, abbreviated (Jan-Dec) |
%B | month, full (January-December) |
%y | year, without century (00-99) |
%Y | year, with century (0000-9999) |
For example, if you wanted to display dates like 14/10/1979, you would use the string “%d/%m/%Y”.
head(economics)
date | pce | pop | psavert | uempmed | unemploy |
---|---|---|---|---|---|
1967-07-01 | 507.4 | 198712 | 12.5 | 4.5 | 2944 |
1967-08-01 | 510.5 | 198911 | 12.5 | 4.7 | 2945 |
1967-09-01 | 516.3 | 199113 | 11.7 | 4.6 | 2958 |
1967-10-01 | 512.9 | 199311 | 12.5 | 4.9 | 3143 |
1967-11-01 | 518.1 | 199498 | 12.5 | 4.7 | 3066 |
1967-12-01 | 525.8 | 199657 | 12.1 | 4.8 | 3018 |
base <- ggplot(economics, aes(date, psavert)) +
geom_line(na.rm = TRUE) +
labs(x = NULL, y = NULL)
# Figure 49
print(base)
# Figure 50
base + scale_x_date(date_labels = "%Y", date_breaks = "5 years")
# Figure 51
base + scale_x_date(limits = as.Date(c("2004-01-01", "2005-01-01")),
date_labels = "%b %y",
date_minor_breaks = "1 month")
# Figure 52
base + scale_x_date(limits = as.Date(c("2004-01-01", "2004-06-01")),
date_labels = "%m/%d",
date_minor_breaks = "2 weeks")
The ggplot2 theme system does not affect how the data is rendered by geoms, or how it is transformed by scales. Themes don’t change the perceptual properties of the plot, but they do help you make the plot aesthetically pleasing or match an existing style guide. Themes give you control over things like fonts, ticks, panel strips, and backgrounds.
The theming system is composed of four main components:
Theme elements specify the non-data elements that you can control. For example, the plot.title()
element controls the appearance of the plot title;
Each element is associated with an element function, which describes the visual properties of the element. For example, element_text()
sets the font size, colour and face of text elements like plot.title()
.
The theme()
function which allows you to override the default theme elements by calling element functions, like theme(plot.title = element_text(colour = "red"))
.
Complete themes, like theme_grey()
set all of the theme elements to values designed to work together harmoniously.
# Figure 53
base <- ggplot(mpg, aes(cty, hwy, colour = factor(cyl))) +
geom_jitter() +
geom_abline(colour = "black", size = 1, alpha = 0.8) +
labs(x = "City mileage/gallon",
y = "Highway mileage/gallon",
colour = "Cylinders",
title = "Highway and city mileage are highly correlated") +
scale_colour_brewer(type = "seq", palette = "Spectral")
print(base)
Next, you need to make sure the plot matches the style guidelines of your journal: 1. The background should be white, not pale grey. 2. The legend should be placed inside the plot if there’s room. 3. Major gridlines should be a pale grey and minor gridlines should be removed. 4. The plot title should be 12pt bold text and centered.
# Figure 54
style <- theme(plot.title = element_text(face = "bold", size = 12, hjust = 0.5),
legend.background = element_rect(fill = "white", size = 2, colour = "white"),
legend.justification = c(0.6, 0.1),
legend.position = c(0.1, 0.6),
axis.ticks = element_line(colour = "grey70", size = 0.2),
panel.grid.major = element_line(colour = "grey70", size = 0.2),
panel.grid.minor = element_blank())
base + theme_bw() + style
There are seven other themes built in to ggplot2 1.1.0:
theme_grey()
: a light grey background and white gridlines.
theme_bw()
: a white background and thin grey grid lines.
theme_linedraw()
: A theme with only black lines of various widths on white backgrounds, reminiscent of a line drawing
theme_light()
: similar to theme_linedraw() but with light grey lines and axes, to direct more attention towards the data.
theme_dark()
: the dark cousin of theme_light(), with similar line sizes but a dark background. Useful to make thin coloured lines pop out.
theme_minimal()
: A minimalistic theme with no background annotations.
theme_classic()
: A classic-looking theme, with x and y axis lines and no gridlines.
theme_void()
: A completely empty theme.
As well as applying themes a plot at a time, you can change the default theme with theme_set()
. For example, if you really hate the default grey background, run theme_set(theme_bw())
to use a white background for all plots.
To modify an individual theme component you use code like plot + theme(element.name = element_function())
. There are four basic types of built-in element functions: text, lines, rectangles, and blank. Each element function has a set of parameters that control the appearance:
element_text()
draws labels and headings. You can control the font family, face, colour, size (in points), hjust, vjust, angle (in degrees) and lineheight (as ratio of fontcase).
element_line()
draws lines parameterised by colour, size and linetype.
element_rect()
draws rectangles, mostly used for backgrounds, parameterised by fill colour and border colour, size and linetype.
element_blank()
draws nothing. Use this if you don’t want anything drawn, and no space allocated for that element
# Figure 55
df <- data.frame(x = rnorm(300), y = rnorm(300))
df$z <- cut(df$x, 6, labels = c("a", "b", "c", "d", "e", "f"))
ggplot(df, aes(x, y)) +
geom_point(aes(colour = z, shape = z), size = 2) +
labs(title = "This is a ggplot") +
xlab(NULL) +
ylab(NULL) +
theme(plot.title = element_text(face = "bold", colour = "#91003F", size = 16, hjust = 0.5),
panel.grid.major = element_line(colour = "black", linetype = "dotted", size = 1),
plot.background = element_rect(fill = "grey80", colour = NA),
panel.background = element_rect(fill = "#EFEDF5"),
axis.line = element_line(colour = "black"))
Element | Setter | Description |
---|---|---|
legend.background | element_rect() | legend background |
legend.key | element_rect() | background of legend keys |
legend.key.size | unit() | legend key size |
legend.key.height | unit() | legend key height |
legend.key.width | unit() | legend key width |
legend.margin | unit() | legend margin |
legend.text | element_text() | legend labels |
legend.text.align | 0, 1 | legend label alignment (0 = right, 1 = left) |
legend.title | element_text() | legend name |
legend.title.align | 0, 1 | legend name alignment (0 = right, 1 = left) |
The legend elements control the apperance of all legends. We can also modify the appearance of individual legends by modifying these elements.
# Figure 56
df <- data.frame(x = rnorm(300), y = rnorm(300))
df$z <- cut(df$x, 6, labels = c("a", "b", "c", "d", "e", "f"))
ggplot(df, aes(x, y)) +
geom_point(aes(colour = z, shape = z), size = 2) +
theme_minimal() +
scale_colour_brewer(type = "seq", palette = "Dark2") +
theme(legend.key = element_rect(color = "grey50"),
legend.key.width = unit(0.9, "cm"),
legend.key.height = unit(0.75, "cm"),
legend.text = element_text(size = 15),
legend.title = element_text(size = 15, face = "bold"),
legend.title.align = 0.3)
Element | Setter | Description |
---|---|---|
panel.background | element_rect() | panel background (under data) |
panel.border | element_rect() | panel border (over data) |
panel.grid.major | element_line() | major grid lines |
panel.grid.major.x | element_line() | vertical major grid lines |
panel.grid.major.y | element_line() | horizontal major grid lines |
panel.grid.minor | element_line() | minor grid lines |
panel.grid.minor.x | element_line() | vertical minor grid lines |
panel.grid.minor.y | element_line() | horizontal minor grid lines |
aspect.ratio | numeric plot | aspect ratio |
Panel elements control the appearance of the plotting panels. You can also modify the appearance of panel by modifying these elements.
# Figure 57
df <- data.frame(x = rnorm(300), y = rnorm(300))
df$z <- cut(df$x, 6, labels = c("a", "b", "c", "d", "e", "f"))
ggplot(df, aes(x, y)) +
geom_point(aes(colour = z, shape = z), size = 2) +
theme_linedraw() +
scale_colour_brewer(type = "seq", palette = "Dark2") +
theme(panel.background = element_rect(fill = "#C7EAE5"),
panel.grid.major.x = element_line(color = "gray60", size = 0.8),
plot.background = element_rect(colour = "black", size = 2),
aspect.ratio = 12 / 16)
Element | Setter | Description |
---|---|---|
strip.background | element_rect() | background of panel strips |
strip.text | element_text() | strip text |
strip.text.x | element_text() | horizontal strip text |
strip.text.y | element_text() | vertical strip text |
panel.margin | unit() | margin between facets |
panel.margin.x | unit() | margin between facets (vertical) |
panel.margin.y | unit() | margin between facets (horizontal) |
The above theme elements are associated with faceted ggplots.
# Figure 58
df <- data.frame(x = rnorm(300), y = rnorm(300))
df$z <- cut(df$x, 6, labels = c("a", "b", "c", "d", "e", "f"))
ggplot(df, aes(x, y)) +
geom_point(aes(colour = z, shape = z), alpha = 0.7) +
theme_linedraw() +
scale_colour_brewer(type = "seq", palette = "Dark2") +
facet_wrap(~z) +
theme(panel.spacing = unit(0.5, "in"),
strip.background = element_rect(fill = "grey20", color = "grey80", size = 1),
strip.text = element_text(colour = "white"))
In R, a colour is represented as a string. Basically, a colour is defined, like in HTML/CSS, using the hexadecimal values (00 to FF) for red, green, and blue, concatenated into a string, prefixed with a “#”. A pure red colour this is represented with “#FF0000”.
Besides the “#RRGGBB” RGB colour strings, one can also use one of R’s predefined named colours in R package–RColorBrewer
First, let’ s see some key ggplot2 R functions for changing a plot color.
scale_fill_manual()
for box plot, bar plot, violin plot, dot plot, etc
scale_color_manual()
or scale_colour_manual()
for lines and points
scale_fill_brewer()
for box plot, bar plot, violin plot, dot plot, etcscale_color_brewer()
or scale_colour_brewer()
for lines and pointsscale_fill_grey()
for box plot, bar plot, violin plot, dot plot, etc
scale_colour_grey()
or scale_colour_brewer()
for points, lines, etc
scale_color_gradient()
, scale_fill_gradient()
for sequential gradients between two colors
scale_color_gradient2()
, scale_fill_gradient2()
for diverging gradients
scale_color_gradientn()
, scale_fill_gradientn()
for gradient between n colors
This is the default color:
# Figure 59
ggplot(diamonds, aes(carat, price, colour = cut)) +
geom_point(size = 0.5)
Set custom color palettes:
# Figure 60
ggplot(diamonds, aes(carat, price, colour = cut)) +
geom_point(size = 0.5) +
scale_colour_manual(values = c("#6794a7", "#014d64", "#7ad2f6", "#01a2d9", "#76c0c1"))
Using palette in RColorBrewer:
# Figure 61
library(RColorBrewer)
ggplot(diamonds, aes(carat, price, colour = cut)) +
geom_point(size = 0.5) +
scale_colour_brewer(palette = "Greens")
Using grey color scales:
# Figure 62
ggplot(diamonds, aes(carat, price, colour = cut)) +
geom_point(size = 0.5) +
scale_colour_grey()
When data is splitted by a continuous variable:
Sequential gradients between two colors
# Figure 63
ggplot(diamonds, aes(carat, price, colour = depth)) +
geom_point(size = 0.5)+
scale_colour_gradient(low = "orange", high = "Darkred")
Diverge gradients
# Figure 64
ggplot(diamonds, aes(carat, price, colour = depth)) +
geom_point(size = 0.5) +
scale_colour_gradient2(low="#8E0F2E", mid="#BFBEBE", high="#0E4E75")
Second, we’ll see some predefined color palettes in R package. The most commonly used color scales are Colorbrewer palettes [RColorBrewer package] and Grey color palettes [ggplot2 package].
display.brewer.all(colorblindFriendly = TRUE)
will display only colorblind friendly palettes.
# Figure 65
display.brewer.pal(11, "Spectral")
brewer.pal(11, "Spectral")
## [1] "#9E0142" "#D53E4F" "#F46D43" "#FDAE61" "#FEE08B" "#FFFFBF" "#E6F598"
## [8] "#ABDDA4" "#66C2A5" "#3288BD" "#5E4FA2"
Then, there are five predefined color palettes and 657 built-in color names available in R.
show_col()
in scales package will give you a quick and dirty way to show colours in a plot.
# Figure 66
library(scales)
show_col(rainbow(16), labels = T)
# Figure 67
ggplot(diamonds, aes(carat, price, colour = cut)) +
geom_point(size = 0.5) +
scale_colour_manual(values = rainbow(25))
Because cut has 5 factors, in values = rainbow(n)
, n must be greater than or equal to 5. values takes the first five values. Change n and you will get different beautiful plots.
The function colors()
returns the color names, which R knows about.
# Figure 68
r_color <- colors()
head(r_color)
## [1] "white" "aliceblue" "antiquewhite" "antiquewhite1"
## [5] "antiquewhite2" "antiquewhite3"
# Figure 69
show_col(r_color, labels = FALSE, border = "white")
Finally, ggthemes package provides a large number of high-quality themes and color schemes. The most common schemes are economist, wsj, stata, excel, tableau and solarized.
library(ggthemes)
m<-excel_pal()(6)
show_col(m)
# Figure 70
ggplot(diamonds, aes(carat, price, colour = cut)) +
geom_point(size = 0.5) +
scale_colour_excel()
The R package ggsci also contains a collection of high-quality color palettes inspired by colors used in scientific journals, data visualization libraries, and more. The color palettes are provided as ggplot2 scale functions:
scale_color_npg()
and scale_fill_npg()
: Nature Publishing Group color palettes
scale_color_aaas()
and scale_fill_aaas()
: American Association for the Advancement of Science color palettes
scale_color_lancet()
and scale_fill_lancet()
: Lancet journal color palettes
scale_color_jco()
and scale_fill_jco()
: Journal of Clinical Oncology color palettes
scale_color_tron()
and scale_fill_tron()
: This palette is inspired by the colors used in Tron Legacy. It is suitable for displaying data when using a dark theme.
You can find more examples in the ggsci package vignettes.
The following plot show the change of unemployment rate in fifty years. We highlight the start and end date of each president’s term by adding some lines. Moveover, the highest rate is labeled in this plot.
# Figure 71
ggplot(economics) +
geom_line(aes(date, unemploy/pop)) +
labs(title = "The change of unemployed rate in last five decades") +
theme_linedraw() +
theme(plot.title = element_text(face = "bold", size = 15, hjust = 0.5),
panel.grid.major = element_line(colour = "grey90", size = 0.3),
panel.background = element_rect(fill = "white"),
plot.background = element_rect(fill = "#FEE0B6", colour = "black"),
axis.text.x = element_text(angle = -30, vjust = 0.5)) +
geom_vline(xintercept = presidential$start, colour = "darkred", alpha = 0.5) +
geom_text(x = date[which.max(unemploy/pop)], y = max(unemploy/pop),
label = paste0("(",round(max(unemploy/pop), 4), ")"),
vjust = 0.8, hjust = -0.1)
Line_Data.csv file is the stock price data of Apple (AAPL) and Amazon (AMZN) over the past 10 years. This plot can briefly show how to deal with date type data and set appropriate coordinate axis.
library(reshape2)
LineData <- read.csv("Line_Data.csv", stringsAsFactors = FALSE)
head(LineData)
date | AMZN | AAPL |
---|---|---|
2000/1/1 | 69 | 25.94 |
2000/2/1 | 67 | 28.66 |
2000/3/1 | 55 | 33.95 |
2000/4/1 | 48 | 31.01 |
2000/5/1 | 36 | 21.00 |
2000/6/1 | 30 | 26.19 |
# Figure 72
LineData$date <- as.Date(LineData$date)
LineData <- melt(LineData, id = "date")
ggplot(LineData, aes(x = date, y = value, group = variable)) +
geom_area(aes(fill = variable), alpha = 0.5, position = "identity") +
geom_line(aes(color = variable), size = 0.75) +
scale_x_date(date_labels = "%Y", date_breaks = "2 year") +
xlab("Year") +
ylab("Value") +
labs(title = "The Stock Price Trend of AAPL and AMZN") +
theme_linedraw() +
theme( plot.title = element_text(size = 15, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10, face = "plain", color = "black"),
axis.text = element_text(size = 10, face = "plain", color = "black"),
legend.position = c(0.15,0.8),
legend.background = element_blank()) +
scale_colour_brewer(type = "seq", palette = "Set1")
This example describes how to create a scatter plot and add regression lines using geom_point()
and geom_smooth()
functions. Note that we remove confidence intervals and extend the regression lines. We also use geom_rug()
to add marginal rugs
# Figure 73
mtcars$cyl <- as.factor(mtcars$cyl)
ggplot(mtcars, aes(x = wt, y = mpg, color = cyl, shape = cyl)) +
geom_point(size = 2) +
geom_smooth(method = lm, se = FALSE, fullrange = TRUE) +
scale_shape_manual(values = c(3, 16, 17)) +
scale_color_manual(values = c('#999999', '#E69F00', '#56B4E9')) +
labs(title = "An example of adding regression lines") +
theme_light() +
theme(plot.title = element_text(face = "bold", size = 15, hjust = 0.6),
axis.title = element_text(face = "bold", size = 12),
axis.text = element_text(size = 10, face= "bold"),
legend.text = element_text(face = "bold", size = 8),
legend.title = element_text(face = "bold"),
legend.title.align = 0.4) +
geom_rug()
The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum. We estimate the fuel consumption in different brands of cars. Meanwhile, we highlight those outliers (red) and detailed data (black) to make the chart clearer.
# Figure 74
ggplot(mpg, aes(manufacturer, cty, fill = manufacturer)) +
geom_boxplot(outlier.colour = "red",
outlier.shape = 16,
outlier.size = 2,
notch = FALSE) +
xlab(NULL) +
labs(title = "City miles per gallon for different brands of cars") +
guides(fill = guide_legend(ncol = 2, byrow = TRUE, alpha = 1)) +
geom_jitter(shape = 16, position = position_jitter(0.2), size=1) +
theme_bw() +
theme(plot.title = element_text(face = "bold", colour = "black",
hjust = 0, vjust = 1,
size = 15),
axis.text.x = element_text(angle = 90, vjust = 0.2),
axis.text = element_text(size = 10, face= "bold"),
axis.title.y= element_text(face= "bold", size = 12),
legend.text = element_text(face = "bold", size = 8),
legend.title = element_text(face = "bold"),
legend.title.align = 0.4,
legend.key.width = unit(0.5, "cm"),
plot.background = element_rect(fill = "#ECE2F0", colour = "darkred"))
In the economics data, we can consider the relationship between unemployment rate and median duration from 1960s to 2010s. As we can see, there may be a linear relationship in each plot. However, the impact of duration on unemployment rate may different.
For faceting, we need to extract “year” from “year-month-date”.
# Figure 75
economics$year <- paste0(substr(economics$date,1,3),"0s")
ggplot(economics, aes(uempmed, unemploy / pop, color = year)) +
geom_point() +
facet_wrap(~year) +
xlab("median duration of unemployment in weeks") +
ylab("unemployment rate") +
labs(title = "Analysis of unemployment data in each decade") +
theme_bw() +
scale_colour_brewer(type = "seq", palette = "Set1") +
theme(plot.title = element_text(face = "bold", colour = "black",
hjust = 0.5, vjust = 1, size = 15),
panel.spacing = unit(0.2, "in"),
strip.background = element_rect(fill = "#D94801", color = "grey", size = 1),
strip.text = element_text(colour = "white"),
legend.position = "right",
panel.background = element_rect(fill = "white"),
axis.title = element_text(face = "bold", size = 12),
legend.title.align = 0.7)
There is a bar chart to show the diamond quality of different colors. The labels in the following plot are the percentage of ideal diamonds in each group. We need to library(dplyr)
# Figure 76
library(dplyr)
idealnumb <- diamonds %>% group_by(color) %>% filter(cut == "Ideal") %>% dplyr::summarise(n())
groupnumb <- diamonds %>% group_by(color) %>% dplyr::summarise(n())
idealrate <- idealnumb$`n()`/groupnumb$`n()`
ggplot(diamonds, aes(color, fill = cut)) +
geom_bar(position = position_dodge(), stat = "count") +
ylim(0,5500) +
coord_flip() +
labs(title = "The diamond quality of different colors") +
guides(fill = guide_legend(reverse = TRUE)) +
theme_classic() +
theme(plot.title = element_text(face = "bold", size = 15, hjust = 0.9),
legend.position = "right",
aspect.ratio = 10 / 15) +
scale_fill_brewer(palette = "Spectral") +
annotate("text", x = c("D", "E", "F", "G", "H", "I", "J"), y = idealnumb$`n()`,
label = paste0(round(idealrate,3)*100, "%"),
hjust = -0.2,
vjust = -1.7,
color = "black",
size = 3)
There is no specific functions for pie chart in ggplot2. Pie chart is a special case of a histogram.
# Figure 77
ggplot(diamonds, aes(color)) +
geom_bar(aes(fill = cut), position=position_dodge()) +
coord_polar(theta = "x") +
scale_fill_brewer(palette = "Set1") +
theme_minimal() +
labs(title = "The diamond quality of different colors") +
theme(plot.title = element_text(face = "bold", size = 15, hjust = 1))
# Figure 78
ggplot(diamonds, aes(color)) +
geom_bar(aes(fill = cut), position = position_dodge()) +
coord_polar(theta = "y") +
scale_fill_brewer(palette = "Blues") +
theme_minimal()+
labs(title = "The diamond quality of different colors") +
theme(plot.title = element_text(face = "bold", size = 15, hjust = 1))
Histogram and density can be shown in the same graphic. To run the following code, you need library(plyr)
# prepare data
# library(plyr)
set.seed(1000)
df <- data.frame(sex = factor(rep(c("F", "M"), each = 200)),
weight = round(c(rnorm(200, mean = 55, sd = 5), rnorm(200, mean = 65, sd = 5))))
head(df)
sex | weight |
---|---|
F | 53 |
F | 49 |
F | 55 |
F | 58 |
F | 51 |
F | 53 |
mu <- ddply(df, "sex", summarise, grp.mean = mean(weight))
head(mu)
sex | grp.mean |
---|---|
F | 55.285 |
M | 64.885 |
# Figure 79
ggplot(df, aes(x = weight, color = sex)) +
geom_histogram(aes(y =..density..), position = "identity", size = 1, fill = "white") +
geom_vline(data = mu, aes(xintercept = grp.mean, color = sex),
linetype = "dashed", size = 0.8) +
theme_classic() +
scale_color_manual(values = c("#999999", "#E69F00", "#56B4E9")) +
scale_fill_manual(values = c("#999999", "#E69F00", "#56B4E9")) +
labs(title = "The histogram of female and male's weight") +
theme(plot.title = element_text(face = "bold", size = 15, hjust = 0.6),
axis.title = element_text(face = "bold", size = 12),
axis.text = element_text(size = 10, face= "bold"),
legend.text = element_text(face = "bold", size = 8),
legend.title = element_text(face = "bold"),
legend.title.align = 0.4) +
geom_density(data = df[1:200,], alpha = 0.2, fill="#FC9272", color = "#EF3B2C") +
geom_density(data = df[200:400,], alpha = 0.2, fill="#9ECAE1", color = "#4292C6")
This example describes how to create a qq plot (or quantile-quantile plot) using R software and ggplot2 package. QQ plots is used to check whether a given data follows normal distribution.
# Figure 80
mtcars$cyl <- as.factor(mtcars$cyl)
ggplot(mtcars, aes(sample = mpg, color = cyl, size = cyl)) +
stat_qq() +
labs(title = "Miles per gallon \n according to the weight",
y = "Miles/(US) gallon") +
scale_color_manual(values = rainbow(15)) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 15, hjust = 0.6),
axis.title = element_text(face = "bold", size = 12),
axis.text = element_text(size = 10, face= "bold"),
legend.text = element_text(face = "bold", size = 8),
legend.title = element_text(face = "bold"),
legend.title.align = 0.4)
ggplot2 can not draw true 3d surfaces, but you can use geom_contour()
and geom_tile()
to visualise 3d surfaces in 2d.
# Figure 81
ggplot(faithfuld, aes(waiting, eruptions, z = density)) +
geom_contour(aes(colour = stat(level))) +
theme_bw() +
labs(title = "The 2d contours of the faithful data") +
theme(plot.title = element_text(face = "bold", size = 15, hjust = 0.6),
axis.title = element_text(face = "bold", size = 12),
axis.text = element_text(size = 10, face= "bold"),
legend.text = element_text(face = "bold", size = 8),
legend.title = element_text(face = "bold"),
legend.title.align = 0.4)
This example describes how to compute and visualize a correlation matrix using R software and ggplot2
package. Take mycar data as an example.
# Figure 82
mydata <- mtcars[, c(1,3,4,5,6,7)]
cormat <- round(cor(mydata), 2)
reorder_cormat <- function(cormat){
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]}
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)}
# Reorder the correlation matrix
cormat <- reorder_cormat(cormat)
upper_tri <- get_upper_tri(cormat)
# Melt the correlation matrix
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Create a heatmap
ggplot(melted_cormat, aes(Var2, Var1, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab", name = "Pearson\nCorrelation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1),
axis.text.y = element_text(vjust = 1, size = 12, hjust = 1)) +
coord_fixed() +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal") +
guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
title.position = "top", title.hjust = 0.5))
We can draw a heart shape by using ‘coord_polar’ as follows.
a=2
theta = seq(0,pi,by=0.01)
rho = a*(1-sin(theta))
heart = data.frame(theta,rho)
pie <- ggplot(heart,aes(x=theta,y=rho,colour="red")) + geom_line()
pie + coord_polar(theta="x") + ggtitle("My heart")
# ggplot(heart,aes(theta,rho,fill="blue"))+geom_rect(aes(xmin=min(theta),xmax=max(theta),ymin=min(rho),ymax=max(rho)))
# pie + coord_polar(theta="x") + ggtitle("My heart")
library(ggvis)
mtcars %>%
ggvis(~wt, ~mpg) %>%
layer_smooths(span = input_slider(0.5, 1, value = 1, step=0.1)) %>%
layer_points(size := input_slider(100, 1000, value = 100, ticks=F,
pre="pre_", post="_post"))
library(animation)
library(plyr)
oopt = ani.options(interval = 0.3, nmax = 101)
a <- sort(rnorm(100, 2))
b <- sort(rnorm(100, 7))
out <- vector("list", 101)
for (i in 1:ani.options("nmax")) {
ji <- seq(from = 0, to = 5, by = .05)
a <- jitter(a, factor = 1, amount = ji[i])
fab1 <- lm(a ~ b)
coe <- summary(fab1)$coefficients
r2 <- summary(fab1)$r.squared
if (coe[2, 4] < .0001) p <- " < .0001"
if (coe[2, 4] < .001 & coe[2, 4] > .0001) p <- " < .001"
if (coe[2, 4] > .01) p <- round(coe[2, 4], 3)
plot(a ~ b, main = "Linear model")
abline(fab1, col = "red", lw = 2)
text(x = min(b) + 2, y = max(a) - 1,
labels = paste("t = ", round(coe[2, 3], 3), ", p = ", p, ", R2 = ", round(r2, 3)))
out[[i]] <- c(coe[2, 3], coe[2, 4], r2)
ani.pause()
}
ani.options(oopt)
# library(rgl)
# library(scatterplot3d)
x1=seq(-3,3,by = 0.1)
a1=1
a2=1
x2=sqrt((9-a1*x1^2)/a2)
x3=seq(-4,4,by = 0.1)
x4=sqrt((16-a1*x3^2)/a2)
plot(x3,x4)
points(x1,x2)
xy=rbind(cbind(x1,x2),cbind(x1,-x2),cbind(x3,x4),cbind(x3,-x4))
plot(xy[c(123:284),1],xy[c(123:284),2],col=2,pch = 16)
points(xy[c(1:122),1],xy[c(1:122),2],col=3,pch = 16)
z1=xy[,1]^2
z2=xy[,2]^2
z3=sqrt(2)*xy[,1]*xy[,2]
library(scatterplot3d)
scatterplot3d(z1,z2,z3,pch = 3)
library(rgl)
open3d()
plot3d(z1[c(1:122)], z2[c(1:122)], z3[c(1:122)],col = 3,size = 6)
plot3d(z1[c(123:284)], z2[c(123:284)], z3[c(123:284)],col = 2,size = 6,add = TRUE)
######
# install.packages("caTools") # install external package
library(caTools) # external package providing write.gif function
jet.colors <- colorRampPalette(c("red", "blue", "#007FFF", "cyan", "#7FFF7F",
"yellow", "#FF7F00", "red", "#7F0000"))
dx <- 1500 # define width
dy <- 1400 # define height
C <- complex(real = rep(seq(-2.2, 1.0, length.out = dx), each = dy),
imag = rep(seq(-1.2, 1.2, length.out = dy), dx))
C <- matrix(C, dy, dx) # reshape as square matrix of complex numbers
Z <- 0 # initialize Z to zero
X <- array(0, c(dy, dx, 20)) # initialize output 3D array
for (k in 1:20) { # loop with 20 iterations
Z <- Z^2 + C # the central difference equation
X[, , k] <- exp(-abs(Z)) # capture results
}
write.gif(X, "Mandelbrot.gif", col = jet.colors, delay = 100)
tips:
Wickham, H. (2015). ggplot2–Elegant Graphics for Data Analysis, Springer.
Murrell, P. (2012). R Graphics (Second Edition), CRC Press. Taylor & Francis Group.
ggplot2 : Quick correlation matrix heatmap - R software and data visualization