Processing math: 100%
  • 1 Lattice
    • 1.1 An Overview of the Lattice Package
      • 1.1.1 How Lattice Works
      • 1.1.2 A Simple Example
      • 1.1.3 Using Lattice Functions
      • 1.1.4 Custom Panel Functions
    • 1.2 High-Level Lattice Plotting Functions
      • 1.2.1 Univariate Trellis Plots
      • 1.2.2 Bivariate Trellis Plots
      • 1.2.3 Trivariate Trellis Plots
      • 1.2.4 Other Plots
      • 1.2.5 Customizing Lattice Graphics
      • 1.2.6 Low-Level Functions
  • 2 ggplot2
    • 2.1 An Overview of ggplot2 Package
    • 2.2 Getting Started with ggplot2
      • 2.2.1 Key Components
      • 2.2.2 Aesthetic Attributes
      • 2.2.3 Facetting
      • 2.2.4 Plot Geoms
      • 2.2.5 Modifying the Axes
      • 2.2.6 Output
      • 2.2.7 Quick Plots
    • 2.3 Advanced Learning
      • 2.3.1 Labels
      • 2.3.2 Annotations
      • 2.3.3 Scales
      • 2.3.4 Axes
      • 2.3.5 Legends
      • 2.3.6 Date
      • 2.3.7 Themes
      • 2.3.8 Colour
    • 2.4 Plot Examples
      • 2.4.1 Time Series 1
      • 2.4.2 Time Series 2
      • 2.4.3 Scatter plot
      • 2.4.4 Box Plot
      • 2.4.5 Faceted Plot
      • 2.4.6 Bar Chart
      • 2.4.7 Pie Chart
      • 2.4.8 Histogram plot
      • 2.4.9 QQ plot
      • 2.4.10 2D plot
      • 2.4.11 Heatmap
      • 2.4.12 Polar
  • 3 Other plots
    • 3.1 ggvis plot
    • 3.2 animation plot
    • 3.3 3D plot (install XQuartz on MacOS)
    • 3.4 More amazing plots:
  • 4 Exercises
  • 5 Reference

1 Lattice

1.1 An Overview of the Lattice Package

  • In the early 1990s, Richard Becker and William Cleveland (two researchers at Bell Labs) built a revolutionary new system for displaying data called Trellis graphics. And the lattice package is an implementation of Trellis graphics.

  • The lattice package provides a different way to plot graphics in R. Lattice graphics are created with different functions, and have different options. These functions make it easy to do some things that are hard to do with standard graphics, such as plotting multiple plots on the same page or superimposing plots. Additionally, most lattice functions can produce clean, readable output by default.

  • The real strength of the lattice package is in splitting a chart into different panels (shown in a grid), or groups (shown with different colors or symbols) using a conditioning or grouping variable.

1.1.1 How Lattice Works

  • Lattice graphics consist of one or more rectangular drawing areas called panels. The data assigned to each panel is referred to as a packet. Lattice functions work by calling one or more panel functions, which actually plot the packets within panels. Here is what typically happens in a lattice session:
    1. The end user calls a high-level lattice plotting function.
    2. The lattice function examines the calling arguments and default parameters, assembles a lattice object, and returns the object.
    3. The user calls print.lattice() or plot.lattice() with the lattice object as an argument. (This typically happens automatically on the R console.)
    4. The function plot.lattice() sets up the matrix of panels, assigns packets to different panels and and then calls the panel function specified in the lattice object to draw the individual panels.
  • Lattice graphics are extremely modular. They share many high-level functions (like plot.lattice) and low-level functions (like panel.axis, which draws axes). This means that they share many common arguments. It also means that you can customize the appearance of lattice graphics by creating substitute components.

Data sets in lattice package:

Dataset Description
USMortality Mortality Rates in US by Cause and Gender
USRegionalMortality Mortality Rates in US by Cause and Gender
barley Yield data from a Minnesota barley trial
environmental Atmospheric environmental conditions in New York City
ethanol Engine exhaust fumes from burning ethanol
melanoma Melanoma skin cancer incidence
singer Heights of New York Choral Society singers

1.1.2 A Simple Example

As you may have noticed, arguments within the lattice package are much more consistent than the graphics package. (For example, data for barplot() is specified with the height argument, while data for plot() is specified with x and y.) You can always specify the data to plot using a formula and a data frame.

  • Create a simple data set and plot a scatter plot with xyplot():
library(lattice)
d <- data.frame(x = c(0:9), y = c(1:10), z = c(rep(c("a", "b"), times = 5)))
d
x y z
0 1 a
1 2 b
2 3 a
3 4 b
4 5 a
5 6 b
6 7 a
7 8 b
8 9 a
9 10 b
# Figure 1
xyplot(y~x, data = d)

To plot this data frame, we’ll use the formula y~x and specify the data frame d. The first argument given is the formula. Formulas in the lattice package can also specify a conditioning variable. The conditioning variable is used to assign data points to different panels.

  • plot the same data shown above in two panels, split by the conditioning variable z:
# Figure 2
xyplot(y~x|z, data = d)

As you can see, the data is now split into two panels. If you would prefer to see the two data series superimposed on the same plot, you can use the argument groups to specify the grouping variable(s).

  • plot the two data series superimposed on the same plot, split by the conditioning variable z:
# Figure 3
xyplot(y~x, groups = z, data = d)

As shown in above figure, the two data series are represented by different symbols.

1.1.3 Using Lattice Functions

The easiest way to use lattice graphics is by calling a high-level plotting function. Most of these functions are the equivalent of a similar function in the graphics package. Here is a table showing how standard graphics functions map to lattice functions:

Graphics package function Lattice package function Description
barplot barchart Bar and column charts
dotchart dotplot Cleveland dot plots
hist histogram Histograms
density densityplot Kernel density plots
plot.density densityplot Kernel density plots
stripchart stripplot Strip charts
xplot xyplot Scatter plots
pairs splom Scatter plot matrices
image levelplot Image plots
contour contourplot Contour plots
persp cloud, wireframe Perspective charts of three-dimensional data
qqmath Quantile-quantile plots
qq Quantile-quantile plots

When you call a high-level lattice function, it does not actually plot the data. Instead, each of these functions returns a lattice object. To actually show the graphic, you need to use a print or plot command.

  • The lattice object
# Figure 4
obj <- xyplot(y~x, data = d)
plot(obj)

For some (but not all) lattice functions, it is possible to specify the source data in multiple forms.

  • The function histogram can accept data arguments as data frames, factors or numeric vectors.
# Figure 5
x = rnorm(100)
histogram(x, data = NULL)

Here is a table of data types accepted by different lattice functions:

Lattice function Data types
barchart Array, formula, matrix, numeric vector, table
dotplot Array, formula, matrix, numeric vector, table
histogram Factor, formula, numeric vector
densityplot Formula, numeric vector
stripplot Formula, numeric vector
qqmath Formula, numeric vector
xyplot Formula
qq Formula
splom Data frame formula, matrix
levelplot Array, formula, matrix, table
contourplot Array, formula, matrix, table
cloud Formula, matrix, table
wireframe Formula, matrix

For more details on arguments to lattice functions, see the latter section “Customizing Lattice Graphics”.

1.1.4 Custom Panel Functions

With standard graphics, you could easily superimpose points, lines, text, and other objects on existing charts. It’s possible to do the same thing with lattice graphics.

In order to add extra graphical elements to a lattice plot, you need to use a custom panel function.

  • Add a diagonal line into Figure 2:
# Figure 6
xyplot(y~x|z, 
       data  = d,
       panel = function(...){panel.abline(a = 1, b = 1)
                             panel.xyplot(...)}
)

We create a new custom panel function that calls both panel.xyplot() and panel.abline(). The new panel function will pass along its arguments to panel.xyplot(). We specify a line that crosses the y-axis at 1 (through the a = 1 argument to panel.abline()) and has slope 1 (through the b = 1 argument to panel.abline()).

1.2 High-Level Lattice Plotting Functions

1.2.1 Univariate Trellis Plots

In this section, we use the same data set for most of the examples: births in the United States during 2006. The version that is included in the nutshell package only contains a 10% sample from the original data file. Each record includes the following variables:

Arguments Description
DOB_MM Month of date of birth
DOB_WK Day of week of birth
MAGER Mother’s age
TBO_REC Total birth order
WTGAIN Weight gain by mother
SEX A factor with levels F M, representing the sex of the child
APGAR5 APGAR score
DMEDUC Mother’s education level
UPREVIS Number of prenatal visits
ESTGEST Estimated weeks of gestation
DMETH_REC Delivery Method
DPLURAL “Plural Births;” levels include 1 Single, 2 Twin, 3 Triplet or higher
DBWT Birth weight, in grams

You can view this data set with the following codes

# library(nutshell)
data(births2006.smpl)

1.2.1.1 Bar Charts

  • Calculating a table of the number of births by day of week and then printing a bar chart to show the number of births by day of week:
# Figure 7
births.dow <- table(births2006.smpl$DOB_WK)
barchart(births.dow)

Notice that many more babies are born on weekdays than on weekends. That’s a little surprising.

You might wonder if there is a difference in the number of births because of the delivery method; maybe doctors just schedule a lot of cesarean sections on weekdays, and natural births occur all the time. This is the type of question that the lattice package is great for answering.

  • Eliminate records where the delivery method was unknown and then tabulate the number of births by day of week and method:
# Figure 8
births2006.dm <- transform(births2006.smpl[births2006.smpl$DMETH_REC != "Unknown", ], 
                           DMETH_REC = as.factor(as.character(DMETH_REC)))
dob.dm.tbl <- table(WK = births2006.dm$DOB_WK, MM = births2006.dm$DMETH_REC)
barchart(dob.dm.tbl)

By default, barchart prints stacked bars with no legend and the different colors show different groups. But notice that the different shades aren’t labeled, so it’s not immediately obvious what each shade represents. Let’s try to change the way the chart is displayed.

  • Unstack the bars (stack = FALSE) and add a legend (auto.key = TRUE):
# Figure 9
barchart(dob.dm.tbl, stack = FALSE, auto.key = TRUE)

It’s a little easier to see that both types of births decrease on weekends, but it’s still a little difficult to compare values within each group. So let’s try a different approach.

  • Split bars into two different panels by telling barchart not to group by color (groups = FALSE) and change to columns (horizontal=FALSE)
# Figure 10
barchart(dob.dm.tbl, groups = FALSE, horizontal = FALSE)

The two different charts are in different panels. Now, we can more clearly see what’s going on. The number of vaginal births decreases on weekends, by maybe 25 to 30%. However, C-sections drop by 50 to 60%. As you can see, lattice graphics let you quickly try different ways to present information, helping you zero in on the method that best illustrates what is happening in the data.

1.2.1.2 Dot plots

Like bar charts, dot plots are useful for showing data where there is a single point for each category, especially when we’re going to summarize larger data tables. For example, let’s look at a chart of data on births by day of week. Is the pattern we saw above a seasonal pattern?

  • Create a new table counting births by month, week, and delivery method and then plot the results using a dot plot:
# Figure 11
dob.dm.tbl.alt <- table(WEEK = births2006.dm$DOB_WK, 
                        MONTH = births2006.dm$DOB_MM, 
                        METHOD = births2006.dm$DMETH_REC)
dotplot(dob.dm.tbl.alt, 
        stack    = FALSE, 
        auto.key = TRUE, 
        groups   = TRUE
        )

In this plot, we keep on grouping, so that different delivery methods are shown in different colors (groups = TRUE). To help highlight differences, we’ll disable stacking values (stack = FALSE). Finally, we’ll print a key so that it’s obvious what each symbol represents (auto.key = TRUE).

As you can see, there are slight seasonal differences, but the overall pattern remains the same.

As another example of dot plots, let’s look at the tire failure data in the nutshell package. In 2003, the National Highway Traffic Safety Administration (NHTSA) began a study into the durability of radial tires on light trucks. (See this for links to this study.) Tests were carried out on six different types of tires. Here is a table of the characteristics of the tires:

Tire Size Load Index Speed Rating Brand Model OE Vehicle OE Model
B P195/65R15 89 S BF Goodrich Touring T/A Chevy Cavalier
C P205/65R15 92 V Goodyear Eagle GA Lexus ES300
D P235/75R15 108 S Michelin LTX M/S Ford,Dodge E 150 Van, Ram Van 1500
E P265/75R16 114 S Firestone Wilderness AT Chevy/GMC Silverado, Tahoe, Yukon
H LT245/75R16/E 120/116 Q Pathfinder ATR A/S OWL NA NA
L 255/65R16 109 H General Grabber ST A/S Mercedes ML320

We focus on only three variables. Time_To_Failure is the time before each tire failed (in hours), Speed_At_Failure_km_h is the testing speed at which the tire failed, and Tire_Type is the type of tire tested. We know that tests were only run at certain stepped speeds; despite the fact that speed is a numeric variable, we can treat it as a factor.

  • Show the one continuous variable (time to failure) by the speed at failure for each different type of tire
# Figure 12
library(nutshell)
data(tires.sus)
dotplot(as.factor(Speed_At_Failure_km_h)~Time_To_Failure|Tire_Type, data = tires.sus)

This diagram let’s us clearly see how quickly tires failed in each of the tests. For example, all type D tires failed quickly at the testing speed of 180 km/h, but some type H tires lasted a long time before failure.

1.2.1.3 Histograms

The histogram is a very popular chart for showing the distribution of a variable. As an example of histograms, let’s look at average birth weights, grouped by number of births.

  • Show the distribution of average birth weights, split by the number of births:
# Figure 13
histogram(~DBWT|DPLURAL, data = births2006.smpl)

This format helps make each chart readable by itself, but makes it difficult to compare the different groups.

  • Stack the charts on top of each other using the layout variable:
# Figure 14
histogram(~DBWT|DPLURAL, data = births2006.smpl, layout = c(1, 5))

It’s easy to see that birth weights are roughly normally distributed within each group, but the mean weight drops as the number of births increases.

1.2.1.4 Density plots

If you’d like to see a single line showing the distribution, instead of a set of columns representing bins, you can use kernel density plots.

  • Redraw the example above and replace the histogram with a density plot:
# Figure 15
densityplot(~DBWT|DPLURAL, 
            data   = births2006.smpl, 
            layout = c(1, 5), 
            plot.points = FALSE
            )

By default, densityplot will draw a strip chart under each chart, showing every data point. However, the data set is so big, we specify plot.points = FALSE.

One advantage of density plots over histograms is that you can stack them on top of each other and still read the results.

  • Change the conditioning variable (DPLURAL) to a grouping variable so that we can stack these charts on top of each other:
# Figure 16
densityplot(~DBWT, 
            groups   = DPLURAL, 
            data     = births2006.smpl, 
            plot.points = FALSE, 
            auto.key = TRUE
            )

As you can see, it’s easier to compare distribution shapes (and centers) by superimposing the charts.

1.2.1.5 Strip plots

A good alternative to histograms are strip plots, especially when there isn’t much data to plot. Strip plots look similar to dot plots, but they show different information. Dot plots are designed to show one value per category (often a mean or a sum), while strip plots show many values. You can think of strip plots as one-dimensional scatter plots.

As an example of a strip plot, let’s look at the weights of babies born in sets of 4 or more. There were only 44 observations in our data set that match this description.

  • Use the subset argument to specify two sets of observations, and add some random vertical noise to make the points easier to read (jitter.data = TRUE):
# Figure 17
stripplot(~DBWT, 
          data   = births2006.smpl, 
          subset = (DPLURAL == "5 Quintuplet or higher"  | DPLURAL == "4 Quadruplet"), 
          jitter.data = TRUE
          )

1.2.1.6 Univariate quantile-quantile plots

A quantile-quantile plot is a useful plot that can compare the distribution of actual data values to a theoretical distribution. It plots quantiles of the observed data against quantiles of a theoretical distribution. If the plotted points form a straight diagonal line (from top right to bottom left), then it is likely that the observed data comes from the theoretical distribution. Quantile-quantile plots are a very powerful technique for seeing how closely a data set matches a theoretical distribution (or how much it deviates from it).

  • Plot 100,000 random values from a normal distribution to show what qqmath does:
# Figure 18
qqmath(rnorm(100000))

By default, the function qqmath compares the sample data to a normal distribution. If the sample data is really normally distributed, you’ll see a vertical line.

  • Plot a set of quantile-quantile plots for a random sample of 50,000 points from the birth weight data:
# Figure 19
qqmath(~DBWT|DPLURAL, 
       data   = births2006.smpl[sample(1:nrow(births2006.smpl), 50000), ], 
       pch    = 19, 
       cex    = 0.25, 
       subset = (DPLURAL != "5 Quintuplet or higher")
       )

As you can see from above figure, the distribution of birth weights is not quite normal.

As another example, let’s look at real estate prices in San Francisco in 2008 and 2009. This data set is included in the nutshell package as sanfrancisco.home.sales.

  • Show the difference of the distribution of real estate prices and normal distribution
# Figure 20
library(nutshell)
data(sanfrancisco.home.sales)
qqmath(~price, data = sanfrancisco.home.sales)

As expected, real estate prices is not normally distributed. Intuitively, it doesn’t make sense for real estate prices to be normally distributed. There are far more people with below-average incomes than above-average incomes. The lowest recorded price in the data set is $100,000; the highest is $9,500,000.

But it looks exponential, so let’s try a log transform.

  • Determine whether log-transformed data is normally distributed or not:
# Figure 21
qqmath(~log(price), data = sanfrancisco.home.sales)

A log transform yields a distribution that looks pretty close to normally distributed.

Then let’s take a look at how the distribution changes based on the number of bedrooms.

  • Plot smooth lines (type = "smooth") to show how the distribution changes based on the number of bedrooms (groups = bedrooms):
# Figure 22
qqmath(~log(price), 
       groups   = bedrooms, 
       data     = subset(sanfrancisco.home.sales, 
                         !is.na(bedrooms) & bedrooms>0 & bedrooms<7), 
       auto.key = TRUE, 
       drop.unused.levels = TRUE, 
       type     = "smooth"
       )

In this formula, we pass an explicit subset as an argument to the function instead of using the subset argument. Notice that the lines are separate, with higher values for higher numbers of bedrooms. drop.unused.levels is a logical flag indicating whether the unused levels of factors will be dropped.

We can do the same thing for square footage.

  • Show how the distribution changes based on square footage.
# Figure 23
# library(Hmisc)
qqmath(~log(price), 
       groups   = cut2(squarefeet, g = 6), 
       data     = subset(sanfrancisco.home.sales, !is.na(squarefeet)), 
       auto.key = TRUE, 
       drop.unused.levels = TRUE, 
       type     = "smooth"
       )

The function cut2 from the package HMisc to divide square footages into six even quantiles.

1.2.2 Bivariate Trellis Plots

This section describes Trellis plots for plotting two variables. Many real data sets (for example, financial data) record relationships between multiple numeric variables.

1.2.2.1 Scatter plots

As an example of a scatter plot, let’s take a look at the relationship between house size and price.

  • Show size and price:
# Figure 24
xyplot(price~squarefeet, data = sanfrancisco.home.sales)

It looks like there is a rough correspondence between size and price (the plot looks vaguely cone shaped). Let’s analyze it further.

  • Show how this relationship varies by zip code. Trim outliers (sales prices over 4,000,000 and properties over 6,000 square feet) using the subset argument.
table(subset(sanfrancisco.home.sales, !is.na(squarefeet), select = zip))
## 
## 94100 94102 94103 94104 94105 94107 94108 94109 94110 94111 94112 94114 
##     2    52    62     4    44   147    21   115   161    12   192   143 
## 94115 94116 94117 94118 94121 94122 94123 94124 94127 94131 94132 94133 
##   101   124   114    92    92   131    71    85   108   136    82    47 
## 94134 94158 
##   105    13
# Figure 25
xyplot(price~squarefeet|zip, 
       data = sanfrancisco.home.sales, 
       subset = (zip!=94100 & zip!=94104 & zip!=94108 & zip!=94111 & 
                 zip!=94133 & zip!=94158 & price<4000000 & 
                 ifelse(is.na(squarefeet), FALSE, squarefeet<6000)), 
       strip = strip.custom(strip.levels = TRUE)
       )

The first formula is to pick a subset of zip codes to plot. A few parts of the city are sparsely populated (like the financial district, 94104) and don’t have enough data to make plotting interesting. strip.custom() is the function that draws the strips by specifying certain arguments. strip.levels() is a logical vector of length 2, indicating whether or not the level of the conditioning variable is to be written on the strip.

Now, the linear relationship is much more pronounced. We can notice that the different slopes in different neighborhoods. We can make this slightly more readable by using neighborhood names.

  • Rerun the code, conditioning by neighborhood. Add a diagonal line to each plot (through a custom panel function). Change the default points plotted to be solid (pch = 19) and shrink them to a smaller size (cex=.2):
# Figure 26
dollars.per.squarefoot <- mean(
  sanfrancisco.home.sales$price / sanfrancisco.home.sales$squarefeet,
  na.rm = TRUE)
xyplot(price~squarefeet|neighborhood,
       data   = sanfrancisco.home.sales,
       pch    = 19,
       cex    = .2,
       subset = (zip!=94100 & zip!=94104 & zip!=94108 & zip!=94111 & 
                 zip!=94133 & zip!=94158 & price<4000000 & 
                 ifelse(is.na(squarefeet), FALSE, squarefeet<6000)),
       strip  = strip.custom(strip.levels   = TRUE,
                             horizontal     = TRUE,
                             par.strip.text = list(cex = .8)),
       panel  = function(...) {panel.abline(a = 0, b = dollars.per.squarefoot)
                               panel.xyplot(...)}
       )

1.2.2.2 Box plots

Box plots in the lattice package are just like box plots drawn with the graphics package. The boxes represent prices from the 25th through the 75th percentiles (the interquartile range), the dots represent median prices, and the whiskers represent the minimum or maximum values. (When there are values that stretch beyond 1.5 times the length of the interquartile range, the whiskers are truncated at those extremes.)

Let’s take a look at how the San Francisco home prices changed over time. We can use box plots to watch how the whole distribution changed in this period.

  • Show the distribution of sales prices by month
# Figure 27
table(cut(sanfrancisco.home.sales$date, "month"))
## 
## 2008-02-01 2008-03-01 2008-04-01 2008-05-01 2008-06-01 2008-07-01 
##        139        230        267        253        237        198 
## 2008-08-01 2008-09-01 2008-10-01 2008-11-01 2008-12-01 2009-01-01 
##        253        223        272        118        181        114 
## 2009-02-01 2009-03-01 2009-04-01 2009-05-01 2009-06-01 2009-07-01 
##        123        142        116        180        150         85
bwplot(price~cut(date, "month"), data = sanfrancisco.home.sales)

Unfortunately, this doesn’t produce an easily readable plot and there are a large number of outliers that are making the plot hard to see. Let’s try plotting the box plots again, this time with the logtransformed values. To make it more readable, we can change to vertical box plots and rotate the text at the bottom:

# Figure 28
bwplot(log(price)~cut(date, "month"),
       data   = sanfrancisco.home.sales,
       scales = list(x = list(rot = 90))
       )

we can more clearly see some trends in this plot. Median prices moved around a little during this period, though the interquartile range moved a lot. Moreover, the basic distribution appears pretty stable from month to month.

1.2.2.3 Scatter plots matrices

If you would like to generate a matrix of scatter plots for many different pairs of variables, you can use the splom function.

  • Show the relationships between four variables in the iris data set, splited by species. (you will see 16 subgraphs)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
# Figure 29
super.sym <- trellis.par.get("superpose.symbol")
splom(~iris[1:4], groups = Species, data = iris,
      panel = panel.superpose,
      key   = list(title   = "Three Varieties of Iris",
                   columns = 3, 
                   points  = list(pch = super.sym$pch[1:3],
                   col     = super.sym$col[1:3]),
                   text    = list(c("Setosa", "Versicolor", "Virginica")))
      )

1.2.2.4 Bivariate quantile-quantile plots

If you would like to generate quantile-quantile plots for comparing two distributions, you can use the function qq.

  • Compare two distributions of “Bass 2” and “Tenor 1”.
# library(lattice)
head(singer)
height voice.part
64 Soprano 1
62 Soprano 1
66 Soprano 1
65 Soprano 1
60 Soprano 1
61 Soprano 1
# Figure 30
qq(voice.part ~ height, 
   data   = singer,
   aspect = 1,
   subset = (voice.part == "Bass 2" | voice.part == "Tenor 1")
   )

aspect = 1 means the length of the x-axis is equal to that of the y-axis. As we can see, there’s a little difference between two distributions because the plotted points don’t form a straight diagonal line.

1.2.3 Trivariate Trellis Plots

If you would like to plot three-dimensional data with Trellis graphics, there are several functions available.

1.2.3.1 Level plots

levelplot function can plot three-dimensional data in flat grids, with colors showing different values for the third dimension. As an example of level plots, we also look at the San Francisco home sales data set.

  • Show the number of home sales in different parts of the city. You can use that coordinate data in the San Francisco home sales data set.
# Figure 31
attach(sanfrancisco.home.sales)
levelplot(table(cut(longitude, breaks = 40), cut(latitude, breaks = 40)),
          scales = list(y = list(cex = .5), x = list(rot = 90, cex = .5)),
          xlab   = "latitude",
          ylab   = "longitude"
          )

The table and cut functions are to break the longitude and latitude data into bins and count the number of homes within each bin. xlab and ylab function is a character or expression (or a “grob”) giving label(s) for the x-axis and y-axis.

If we were interested in looking at the average sales price by area, we could use a similar strategy. Instead of table, you can use the tapply function to aggregate observations.

# Figure 32
levelplot(tapply(price, 
                 INDEX = list(cut(longitude, breaks = 40), cut(latitude, breaks = 40)),
                 FUN   = mean),
          scales = list(draw = FALSE),
          xlab = "latitude",
          ylab = "longitude"
          )

scales is generally a list determining how the x- and y-axes (tick marks and labels) are drawn. draw is a logical flag, determines whether to draw the axis (i.e., tick marks and labels) at all.

Of course, you can use conditioning values with level plots.

  • Show the number of home sales, by numbers of one and two bedrooms. Simplify the data slightly by looking at houses with zero to four bedrooms and then houses with five bedrooms or more.
# Figure 33
bedrooms.capped <- ifelse(bedrooms<5, bedrooms, 5)
levelplot(table(cut(longitude, breaks = 25),cut(latitude, breaks = 25), bedrooms.capped),
          scales = list(draw = FALSE)
          )

1.2.3.2 Contour plots

contourplot function is to to show contour plots with lattice (which resemble topographic maps). We can use ?contourplot to see more details.

  • A simple example:
# Figure 34
contourplot(volcano)

1.2.3.3 Cloud plots

cloud function is to plot points in three dimensions (technically, projections into two dimensions of the points in three dimensions). You can use ?cloud to see more details. The data set volcano is package datasets.

  • A simple example:
# Figure 35
cloud(volcano, zlab = list("volcano", rot = 90))

1.2.3.4 Wire-frame plots

If you would like to show a three-dimensional surface, you could use the function wireframe.

  • A simple example:
# Figure 36
wireframe(volcano, zlab = list("volcano", rot = 90))

1.2.4 Other Plots

If you have fitted a model to a data set, the rfs function can help you visualize how well the model fits the data. The rfs function plots residual and fit-spread (RFS) plots.

# library(nutshell)
data(team.batting.00to08)
head(team.batting.00to08)
teamID yearID runs singles doubles triples homeruns walks stolenbases caughtstealing hitbypitch sacrificeflies atbats
ANA 2000 864 995 309 34 236 608 93 52 47 43 5628
BAL 2000 794 992 310 22 184 558 126 65 49 54 5549
BOS 2000 792 988 316 32 167 611 43 30 42 48 5630
CHA 2000 978 1041 325 33 216 591 119 42 53 61 5646
CLE 2000 950 1078 310 30 221 685 113 34 51 52 5683
DET 2000 823 1028 307 41 177 562 83 38 43 49 5644
# Figure 37
attach(team.batting.00to08)
rfs(lm(formula = runs~singles+doubles+triples+homeruns+walks+
                      hitbypitch+sacrificeflies+stolenbases+caughtstealing,
       data = team.batting.00to08), 
    aspect = 1
    )

Notice that the two curves are S shaped. The residual plot is a quantile-quantile plot of the residuals. Because the default distribution choice for rfs is a uniform distribution, we need to modify certain arguments.

  • Show the fitted results and determine if the residuals fit the normal distribution:
# Figure 38
rfs(lm(formula = runs~singles+doubles+triples+homeruns+walks+
                      hitbypitch+sacrificeflies+stolenbases+caughtstealing,
       data = team.batting.00to08), 
    aspect = 1,
    distribution = qnorm
    )

Notice that the plots are roughly linear. We expect a normally distributed error function for a linear regression model, and this is a good thing.

1.2.5 Customizing Lattice Graphics

Most lattice functions share common arguments; the same argument has a similar effect in multiple functions. Let’s describe what each of those arguments does and explain how to fine-tune the output of lattice functions.

1.2.5.1 Common Arguments to Lattice Functions

Lattice functions share many common arguments. Instead of explaining what each function does separately we’ll explain them in a single table.

Argument Description
x The object to plot. May be a formula, array, numeric vector, or table.
data When x is a formula, data is a data frame in which the function is evaluated.
allow.multiple Specifies how to interpret formulas of the form y1 + y2 ~ X | Z (where X is a function of multiple variables and Z may also be a function of multiple variables)
outer Specifies whether to superimpose plots or not when allow.multiple=TRUE and multiple dependent variables are specified.
box.ratio For plots that show data in rectangles (bwplot, barchart, and stripplot), a numeric value that specifies the ratio of the width of the rectangles to the inner rectangle space.
horizontal For plots that can be laid out vertically or horizontally (bwplot, dotplot, barchart and stripplot), a logical value that specifies the direction to plot.
panel The panel function used to actually draw the plots.
aspect Specifies the aspect ratio to use for different panels.
groups Specifies a variable (or expression of variables) describing groups of data to pass to the panel function.
auto.key A logical value specifying whether to automatically draw a key showing the names of groups corresponding to different colors or symbols.
prepanel A function that takes the same arguments as panel and returns a list containing values xlim, ylim, dx, and dy (and, less frequently, xat and yat).
strip A logical value specifying whether strips (that label panels) should be drawn.
xlab A character value specifying the label for the x-axis.
ylab A character value specifying the label for the y-axis.
scales A list that specifies how the x- and y-axes should be drawn.
subscripts A logical value specifying whether a vector named subscripts should be passed to the panel function.
subset Specifies the subset of values from data to plot.
xlim Specifies the minimum and maximum values for the x-axis.
ylim Specifies the minimum and maximum values for the y-axis.
drop.unused.levels A logical value (or a list outlining what to do for different components of x) specifying whether to drop unused levels of factors.
default.scales A list giving the default value of scales.
lattice.options A list of plotting parameters, similar to par values for standard R graphics.

If you would like to get more information about augments, see the help files.

1.2.5.2 Controlling How Axes Are Drawn

You can control how axes are drawn in the lattice package by named values in the argument scales. Here is a table of the available arguments.

Argument Description
rot Angle to rotate axis labels. Can specify a vector of length 2 to separately control left/bottom and right/top.
cex A numeric value that controls the size of axis labels (“character expansion” factor). Can specify a vector of length 2 to separately control left/bottom and right/top.
limits Limits for each axis; equivalent to xlim and ylim.
axs Use axs=“r” to pad date values on each side, axs=“i” to use exact values.
at A numeric vector describing where to plot tick marks (in native coordinates) or a list describing where to plot tick marks for each panel.
labels Labels to accompany at, specified as a vector (or list of vectors).
tck A numeric value specifying the length of the tick marks.

For more details on other arguments, you can see the page 310 of the book R IN A NUTSHELL.

1.2.5.3 Parameters

In the graphics package, we use par function to set or query default parameters.

  • Check the value of the parameter cex:
par("cex")
## [1] 1

That is a similar mechanism for lattice graphics. To check the value of a setting, use the function trellis.par.get. To change a setting, use the function trellis.par.set

  • Check the values of the axis.text parameter, which controls the look of text printed on axes:
trellis.par.get("axis.text")
## $alpha
## [1] 1
## 
## $cex
## [1] 0.8
## 
## $col
## [1] "#000000"
## 
## $font
## [1] 1
## 
## $lineheight
## [1] 1
  • Change the parameter axis.text$cex to 0.5
trellis.par.set(list(axis.text = list(cex = 0.5)))
  • Show a list of all settings:
show.settings()

We can also use trellis.par.get().

names(trellis.par.get())
##  [1] "grid.pars"         "fontsize"          "background"       
##  [4] "panel.background"  "clip"              "add.line"         
##  [7] "add.text"          "plot.polygon"      "box.dot"          
## [10] "box.rectangle"     "box.umbrella"      "dot.line"         
## [13] "dot.symbol"        "plot.line"         "plot.symbol"      
## [16] "reference.line"    "strip.background"  "strip.shingle"    
## [19] "strip.border"      "superpose.line"    "superpose.symbol" 
## [22] "superpose.polygon" "regions"           "shade.colors"     
## [25] "axis.line"         "axis.text"         "axis.components"  
## [28] "layout.heights"    "layout.widths"     "box.3d"           
## [31] "par.xlab.text"     "par.ylab.text"     "par.zlab.text"    
## [34] "par.main.text"     "par.sub.text"

There are 35 highlevel groups of parameters describing how different components are drawn. If you want to know details of what each of these groups of parameters control, please refer to page 313 of the book “R in a nutshell”.

1.2.5.4 plot.trellis

As we noted above, lattice functions do not plot results; they return lattice objects. To plot a lattice object, you need to call print.trellis() or plot.trellis() on the lattice object. It’s possible to control how lattice objects are printed through changing arguments of them.

  • Get the list of arguments for plot.trellis()
?plot.trellis

1.2.5.5 strip.default

To change the way strips are drawn, you can specify your own strip function as an argument to a lattice function. The simplest way to modify the appearance of the strips is by using the function strip.custom. This function accepts the same arguments as strip.default and returns a new function that can be specified as an argument to a lattice function. We have used this function when ploting Figure 25.

  • Get the list of arguments for plot.trellis
?strip.default

1.2.6 Low-Level Functions

The lattice package includes a variety of different panel functions that you can use to customize your charts. For example, you can use these functions to add lines, text, and other graphical elements to lattice graphics.

Function(s) Description
llines, panel.line Plots lines
lpoints, panel.points Plots points
ltext, panel.text Plots text
panel.axis Plots axes
panel.abline Adds a line to the chart area of a panel.
panel.curve Adds a curve (defined by a mathematical expression) to the chart area of a panel.
panel.mathdensity Plots a probability distribution given by a distribution function.
panel.lmline Plots a line fitted to the underlying data by a linear regression.

For more details on other functions, you can see the corresponding help files and page 318-319 of the book R IN A NUTSHELL.

2 ggplot2

2.1 An Overview of ggplot2 Package

ggplot2 is an R package for producing statistical, or data, graphics, but it is unlike most other graphics packages because it has a deep underlying grammar. This layered grammar, based on the Grammar of Graphics (Wilkinson 2005), focuses on the primacy of layers and adapting it for embedding within R. In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). This is made up of a set of independent components which makes ggplot2 very powerful because you are not limited to a set of pre-specified graphics, but you can create new graphics that are precisely tailored for your problem.

This package is not part of a standard R installation, so it must first be installed, then it can be loaded into R as follows.

library(ggplot2)

In ggplot2, all plots are composed of:

  • Data

  • Layers made up of geometric elements (geom), including points, lines and polygons, and statistical transformation (stat), such as histogram or a 2d relationship with a linear model.

  • Scales (scale) map values in the data space to values in an aesthetic space, whether it be colour, or size, or shape.

  • A coordinate system (coord) describes how data coordinates are mapped to the plane of the graphic.

  • A faceting specification (facet) describes how to break up the data into subsets and how to display those subsets as small multiples.

  • A theme which controls the finer points of display, like the font size and background colour.

Data sets in ggplot2 package:

Dataset Description
diamonds Prices of 50,000 round cut diamonds
economics US economic time series
economics_long US economic time series
faithfuld 2d density estimate of Old Faithful data
luv_colours ‘colors()’ in Luv space
midwest Midwest demographics
mpg Fuel economy data from 1999 and 2008 for 38 popular models of car
msleep An updated and expanded version of the mammals sleep dataset
presidential Terms of 11 presidents from Eisenhower to Obama
seals Vector field of seal movements
txhousing Housing sales in TX

In this section, we’ll mostly use one data set that’s bundled with ggplot2: mpg. It includes information about the fuel economy of popular car models in 1999 and 2008, collected by the US Environmental Protection Agency.

head(mpg)
manufacturer model displ year cyl trans drv cty hwy fl class
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
audi a4 2.0 2008 4 auto(av) f 21 30 p compact
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
Argument Description
cty and hwy record miles per gallon (mpg) for city and highway driving.
displ the engine displacement in litres.
drv the drivetrain: front wheel (f), rear wheel (r) or four wheel (4).
model the model of car.
class a categorical variable describing the “type” of car: two seater, SUV, compact, etc.

2.2 Getting Started with ggplot2

2.2.1 Key Components

Every ggplot2 plot has three key components:

  1. data

  2. A set of aesthetic mappings between variables in the data and visual properties, and

  3. At least one layer which describes how to render each observation. Layers are usually created with a geom function.

  • Use a scatterplot to show the relationship between engine size and fuel economy:
# Figure 1
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

This produces a scatterplot defined by:

  1. Data: mpg.

  2. Aesthetic mapping: engine size mapped to x position, fuel economy to y position.

  3. Layer: points.

Pay attention to the structure of this function call: data and aesthetic mappings are supplied in ggplot(), then layers are added on with +.

2.2.2 Aesthetic Attributes

To add additional variables to a plot, we can use other aesthetics like colour, shape, and size. These work in the same way as the x and y aesthetics, and are added into the call to aes():

  • Rerun the above code, underlining their classes.
# Figure 2
ggplot(mpg, aes(displ, hwy, colour = class)) +
  geom_point()

This gives each point a unique colour corresponding to its class. The legend allows us to read data values from the colour. You can also use aes(displ, hwy, shape = class) and aes(displ, hwy, size = class) to distinguish classes.

  • Change the colour, shape and size of points
# Figure 3
ggplot(mpg, aes(displ, hwy)) + 
  geom_point(colour = "blue", shape = 17, size = 2)

2.2.3 Facetting

Another technique for displaying additional categorical variables on a plot is facetting. Facetting creates tables of graphics by splitting the data into subsets and displaying the same graph for each subset.

There are two types of facetting: grid and wrapped. Wrapped is the most useful, so take it as an example.

  • Split the data into subsets by class and display the same graph for each subset:
# Figure 4
ggplot(mpg, aes(displ, hwy, colour = class)) +
  geom_point() +
  facet_wrap(~class)

2.2.4 Plot Geoms

Substituting geom_point() for a different geom function, you’d get a different type of plot. There is some of the other important geoms provided in ggplot2. This isn’t an exhaustive list, but should cover the most commonly used plot types.

Geoms Description Aesthetics
geom_point() Data symbols x, y, shape, fill, alpha, stroke
geom_line() Line (ordered on x) x, y, alpha, linetype
geom_path() Line (original order) x, y, alpha, linetype
geom_text() Text labels x, y, label, angle, hjust, vjust
geom_rect() Rectangles xmin, xmax, ymin, ymax, fill, linetype
geom_polygon() Polygons x, y, fill, linetype
geom_segment() Line segments x, y, xend, yend, linetype
geom_bar() Bars x, y, alpha, fill, linetype
geom_histogram() Histogram x, y, alpha, fill, linetype
geom_boxplot() Boxplots x, lower, upper, middle, ymin, ymax, alpha, fill, weight, shape
geom_density() Density x, y, fill, linetype, weight
geom_contour() Contour lines x, y, alpha, fill, linetype, weight
geom_smooth() Smoothed line x, y, alpha, fill, linetype, weight
ALL color, size, group

Here’s some common aesthetics:

Aesthetics Explanation
shape takes four types of values: an integer in [0, 25], a single character, a “.”, an NA
fill fills different colours according to its value (do well in histogram and boxplot)
alpha makes the points transparent (very useful for larger datasets with more overplotting)
stroke modifies the width of the border
linetype solid (default), dotted and dashed
label modifies the xlab and ylab
angle control the display direction of the axis text
hjust controls horizontal justification, defined between 0 and 1
vjust controls vertical justification, defined between 0 and 1

2.2.4.1 Adding a Smoother to a Plot

  • Add a smoothed line to the scatterplot:
# Figure 5
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_smooth()

This overlays the scatterplot with a smooth curve, including an assessment of uncertainty in the form of point-wise confidence intervals shown in grey. If you’re not interested in the confidence interval, turn it off with geom_smooth(se = FALSE).

An important argument to geom_smooth() is the method, which allows us to choose which type of model is used to fit the smooth curve:

  • method = "loess", the default for small n, uses a smooth local regression. The wiggliness of the line is controlled by the span parameter, which ranges from 0 (exceedingly wiggly) to 1 (not so wiggly). Notice that loess does not work well for large datasets (n > 1,000).
# Figure 6
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_smooth(span = 0.2)

  • method = "gam" fits a generalised additive model provided by the mgcv package. You need to first load mgcv, then use a formula like formula = y ~ s(x) or y ~ s(x, bs = "cs") (for large data). This is what ggplot2 uses when there are more than 1,000 points.
# Figure 7
library(mgcv)
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_smooth(method = "gam", formula = y ~ s(x))

  • method = "lm" fits a linear model, giving the line of best fit.
# Figure 8
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_smooth(method = "lm")

2.2.4.2 Boxplots and Jittered Points

When a set of data includes a categorical variable and one or more continuous variables, you will probably be interested to know how the values of the continuous variables vary with the levels of the categorical variable.

  • See how fuel economy varies within car class
# Figure 9
ggplot(mpg, aes(drv, hwy)) +
  geom_point()

Because there are few unique values of both class and hwy, there is a lot of overplotting. Many points are plotted in the same location, and it’s difficult to see the distribution. There are three useful techniques that help alleviate the problem:

  • Jittering, geom_jitter(), adds a little random noise to the data which can help avoid overplotting.
# Figure 10
ggplot(mpg, aes(drv, hwy)) + 
  geom_jitter()

  • Boxplots, geom_boxplot(), summarise the shape of the distribution with a handful of summary statistics.
# Figure 11
ggplot(mpg, aes(drv, hwy)) + 
  geom_boxplot()

  • Violin plots, geom_violin(), show a compact representation of the density of the distribution, highlighting the areas where more points are found.
# Figure 12
ggplot(mpg, aes(drv, hwy)) + 
  geom_violin()

2.2.4.3 Histograms and Frequency Polygons

Histograms and frequency polygons show the distribution of a single numeric variable. They provide more information about the distribution of a single group than boxplots do, at the expense of needing more space.

# Figure 13
ggplot(mpg, aes(hwy)) + 
  geom_histogram()

# Figure 14
ggplot(mpg, aes(hwy)) + 
  geom_freqpoly()

Both of them bin the data, then count the number of observations in each bin. The only difference is the display: histograms use bars and frequency polygons use lines. However, the default just splits your data into 30 bins, which is unlikely to be the best choice.

  • Change the width of the bins with the binwidth argument:
# Figure 15
ggplot(mpg, aes(hwy)) +
  geom_freqpoly(binwidth = 2.5)

  • Use factting:
# Figure 16
ggplot(mpg, aes(displ, fill = drv)) +
  geom_histogram(binwidth = 0.5) +
  facet_wrap(~drv, ncol = 1)

2.2.4.4 Bar Charts

The discrete analogue of the histogram is the bar chart, geom_bar().

  • Plot a bar chart
# Figure 17
ggplot(mpg, aes(manufacturer)) +
  geom_bar(aes(fill = drv))

2.2.4.5 Time Series with Line and Path Plots

Line and path plots are typically used for time series data. Line plots join the points from left to right, while path plots join them in the order that they appear in the dataset. Line plots usually have time on the x-axis, showing how a single variable has changed over time. Path plots show how two variables have simultaneously changed over time, with time encoded in the way that observations are connected.

We’ll show some time series plots using the economics dataset, which contains economic data on the US measured over the last 40 years.

  • Show how the unemployment rate change over these years.
# Figure 18
ggplot(economics, aes(date, unemploy / pop)) +
  geom_line()

Now we would like to examine the relationship between unemployment rate and length of unemployment. Meanwhile, we also need to see the evolution over time. The solution is to join points adjacent in time with line segments, forming a path plot.

  • Show the relationship between unemployment rate and length of unemployment in each year.
# Figure 19
ggplot(economics, aes(unemploy / pop, uempmed)) +
  geom_path(colour = "grey50") +
  geom_point(aes(colour = date))

In the plot, we colour the points to make it easier to see the direction of time. Pay attention to the difference of geom_point(colour = ...) and geom_point(aes(colour = ...)).

2.2.5 Modifying the Axes

Two families of useful helpers let you make the most common modifications. xlab() and ylab() modify the x- and y-axis labels, while xlim() and ylim() modify the limits of axes.

  • Change the labels and limits of axes:
# Figure 20
ggplot(mpg, aes(drv, hwy)) +
  geom_jitter(width = 0.25, size = 2) +
  xlim("f", "r") +
  ylim(20, 30) +
  xlab("city driving (mpg)") +
  ylab("highway driving (mpg)")

xlab(NULL) and ylab(NULL) can remove the axis labels.

2.2.6 Output

Most of the time we create a plot object and immediately plot it, but we can also save a plot to a variable and manipulate it:

p <- ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
  geom_point()

Once we have a plot object, there are a few things we can do with it:

  • Render it on screen with print():
# Figure 21
print(p)

  • Save it to disk with ggsave():
ggsave("plot.png", width = 5, height = 5)
  • Briefly describe its structure with summary():
summary(p)
## data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy,
##   fl, class [234x11]
## mapping:  x = ~displ, y = ~hwy, colour = ~factor(cyl)
## faceting: <ggproto object: Class FacetNull, Facet, gg>
##     compute_layout: function
##     draw_back: function
##     draw_front: function
##     draw_labels: function
##     draw_panels: function
##     finish_data: function
##     init_scales: function
##     map_data: function
##     params: list
##     setup_data: function
##     setup_params: function
##     shrink: TRUE
##     train_scales: function
##     vars: function
##     super:  <ggproto object: Class FacetNull, Facet, gg>
## -----------------------------------
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
  • Save a cached copy of it to disk, with saveRDS(). This saves a complete copy of the plot object, so you can easily re-create it with readRDS():
saveRDS(p, "plot.rds")
q <- readRDS("plot.rds")

2.2.7 Quick Plots

In some cases, you will want to create a quick plot with a minimum of typing. In these cases you may prefer to use qplot() over ggplot(). qplot() lets you define a plot in a single call, picking a geom by default if you don’t supply one.

# Figure 22
qplot(displ, data = mpg)

# Figure 23
qplot(displ, hwy, data = mpg)

qplot() tries to pick a sensible geometry and statistic based on the arguments provided. For example, if you give qplot() x and y variables, it’ll create a scatterplot. If you just give it an x, it’ll create a histogram or bar chart depending on the type of variable.

qplot() assumes that all variables should be scaled by default. If you want to set an aesthetic to a constant, you need to use I():

# Figure 24
qplot(displ, hwy, data = mpg, colour = "blue")

or

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = "blue"))
# Figure 25
qplot(displ, hwy, data = mpg, colour = I("blue"))

or

ggplot(mpg, aes(displ, hwy)) +
  geom_point(colour = "blue")

2.3 Advanced Learning

This section begins by describing the details in the process of drawing a more complicated plot. I hope they can help you to produce graphics using the same structured thinking that you use to design an analysis, reducing the distance between a plot in your head and one on the page.

2.3.1 Labels

geom_text() is the main tool to add labels at the specified x and y positions. It has the most aesthetics of any geom, because there are so many ways to control the appearance of a text.

  • family gives the name of a font: “sans” (the default), “serif”, or “mono”.
# Figure 26
df <- data.frame(x = 1, y = 3:1, z = c("sans", "serif", "mono"))
ggplot(df, aes(x, y)) +
  geom_text(aes(label = z, family = z))

  • fontface specifies the face: “plain” (the default), “bold” or “italic”.
# Figure 27
df <- data.frame(x = 1, y = 3:1, z = c("plain", "bold", "italic"))
ggplot(df, aes(x, y)) +
  geom_text(aes(label = z, fontface = z))

  • hjust (“left”, “center”, “right”, “inward”, “outward”) and vjust (“bottom”, “middle”, “top”, “inward”, “outward”) aesthetics can adjust the alignment of the text . The default alignment is centered.
# Figure 28
df <- data.frame(x = c(1, 1, 2, 2, 1.5), y = c(1, 2, 1, 2, 1.5), 
                 z = c("bottom-left", "bottom-right","top-left", "top-right", "center"))
ggplot(df, aes(x, y)) +
  geom_text(aes(label = z), vjust = "inward", hjust = "inward")

  • size controls the font size.

  • angle specifies the rotation of the text in degrees.

  • nudge_x and nudge_y parameters allow you to nudge the text a little horizontally or vertically.

# Figure 29
df <- data.frame(trt = c("a", "b", "c"), resp = c(1.2, 3.4, 2.5))
ggplot(df, aes(resp, trt)) +
  geom_point() +
  geom_text(aes(label = paste0("(", resp, ")")), nudge_y = -0.25) +
  xlim(1, 3.6)

  • check_overlap can overlap labels will be automatically removed.
# Figure 30
ggplot(mpg, aes(displ, hwy)) +
  geom_text(aes(label = model)) +
  xlim(1, 8)

# Figure 31
ggplot(mpg, aes(displ, hwy)) +
  geom_text(aes(label = model), check_overlap = TRUE) +
  xlim(1, 8)

  • geom_label() is a variation on geom_text(): it draws a rounded rectangle behind the text.
# Figure 32
z <- data.frame(waiting = c(55, 80), eruptions = c(2, 4.3), peak = c("peak one", "peak two"))
ggplot(faithfuld, aes(waiting, eruptions)) +
  geom_tile(aes(fill = density)) +
  geom_label(data = z, aes(label = peak))

  • a simple example
# Figure 33
ggplot(mpg, aes(displ, hwy, colour = drv)) +
  geom_label(aes(label = class), show.legend = FALSE) + 
  geom_text(aes(x = 6.5, y = 42, label = "A SIMPLE EXAMPLE"), 
            vjust = "inward", hjust = "inward", 
            fontface = "italic", 
            family = "mono",
            size = 5, 
            colour = "darkred")

2.3.2 Annotations

Annotations add metadata to our plot. But metadata is just data, so we can use:

  • geom_rect() to highlight interesting rectangular regions of the plot. geom_rect() has aesthetics xmin, xmax, ymin and ymax.
head(presidential)
name start end party
Eisenhower 1953-01-20 1961-01-20 Republican
Kennedy 1961-01-20 1963-11-22 Democratic
Johnson 1963-11-22 1969-01-20 Democratic
Nixon 1969-01-20 1974-08-09 Republican
Ford 1974-08-09 1977-01-20 Republican
Carter 1977-01-20 1981-01-20 Democratic
presidential <- subset(presidential, start > economics$date[1])
# Figure 34
p <- ggplot(economics) +
  geom_rect(aes(xmin = start, xmax = end, fill = party),
            ymin = -Inf, ymax = Inf, alpha = 0.2,
            data = presidential)
print(p)

We annotate this plot with which president was in power at the time. There is one special thing to note: the use of -Inf and Inf as positions. These refer to the top and bottom (or left and right) limits of the plot.

  • geom_line(), geom_path() and geom_segment() to add lines.
# Figure 35
p <- p + geom_line(aes(date, unemploy))
print(p)

  • geom_vline(), geom_hline() and geom_abline() allow you to add reference lines (sometimes called rules), that span the full range of the plot.
# Figure 36
p <- p + geom_vline(aes(xintercept = as.numeric(start)),
                    data = presidential,
                    colour = "grey50",
                    alpha = 0.5)
print(p)

  • geom_text() to add text descriptions or to label points. Most plots will not benefit from adding text to every single observation on the plot, but labelling outliers and other important points is very useful.
# Figure 37
p <- p + geom_text(aes(x = start, y = 2500, label = name),
                   data = presidential,
                   size = 4, 
                   vjust = 0, hjust = 0, 
                   nudge_x = 50)
print(p)

  • We can use annotate() to add a single annotation to a plot
# Figure 38
yrng <- range(economics$unemploy)
xrng <- range(economics$date)
caption <- paste(strwrap("Unemployment rates in the US have varied a lot over the years", 40), 
                 collapse = "\n")
ggplot(economics, aes(date, unemploy)) +
  geom_line() +
  annotate("text", x = xrng[1], y = yrng[2], 
           label = caption,
           hjust = 0, vjust = 1, 
           size = 4)

  • geom_abline() is useful when comparing groups across facets. In the following plot, it’s much easier to see the subtle differences if we add a reference line.
head(diamonds)
carat cut color clarity depth table price x y z
0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
# Figure 39
mod_coef <- coef(lm(log10(price) ~ log10(carat), data = diamonds))
ggplot(diamonds, aes(log10(carat), log10(price))) +
  geom_bin2d() +
  geom_abline(intercept = mod_coef[1], slope = mod_coef[2],
              colour = "white", size = 1) +
  facet_wrap(~cut, nrow = 1)

geom_bin2d() divides the plane into rectangles, counts the number of cases in each rectangle, and then (by default) maps the number of cases to the rectangle’s fill.

2.3.3 Scales

Scales control the mapping from data to aesthetics. They take your data and turn it into something that you can see, like size, colour, position or shape. Scales also provide the tools that let you read the plot: the axes and legends. Formally, each scale is a function from a region in data space (the domain of the scale) to a region in aesthetic space (the range of the scale). The axis or legend is the inverse function: it allows you to convert visual properties back to data.

A scale is required for every aesthetic used on the plot. When you write:

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class))

What actually happens is this:

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) +
  scale_x_continuous() +
  scale_y_continuous() +
  scale_colour_discrete()

It would be tedious to manually add a scale every time you used a new aesthetic, so ggplot2 does it for you. But if we want to override the defaults, we’ll need to add the scale yourself, like this:

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) +
  scale_x_continuous("A really awesome x axis label") +
  scale_y_continuous("An amazingly great y axis label")

When we + a scale, we’re not actually adding it to the plot, but overriding the existing scale. This means that the following two specifications are equivalent:

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  scale_x_continuous("Label 1") +
  scale_x_continuous("Label 2")
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  scale_x_continuous("Label 2")

You’ve probably already figured out the naming scheme for scales:

Scale The name of the aesthetic The name of the scale
scale colour continuous
shape discrete
x brewer
y

2.3.4 Axes

The component of a scale that you’re most likely to want to modify is the guide, the axis or legend associated with the scale. There are many natural correspondences between the two

Axis Legend Argument name
Label Title name
Ticks & grid line Key breaks
Tick label Key label labels

2.3.4.1 Name

The first argument to the scale function, name, is the axes/legend title. You can supply text strings (using \n for line breaks) or mathematical expressions in quote()

# Figure 40
df <- data.frame(x = 1:2, y = 1, z = "a")
ggplot(df, aes(x, y)) + 
  geom_point() +
  scale_x_continuous(quote(a + mathematical ^ expression))

Because tweaking these labels is such a common task, there are three helpers that save you some typing:xlab(), ylab() and labs():

# Figure 41
ggplot(df, aes(x, y)) + geom_point(aes(colour = z)) +
  labs(x = "X axis", y = "Y axis", colour = "Colour\nlegend")

There are two ways to remove the axis label.

  1. labs(x = "", y = "") : omits the label, but still allocates space;
  2. labs(x = NULL, y = NULL): removes the label and its space.

2.3.4.2 Breaks and Labels

The breaks argument controls which values appear as tick marks on axes and keys on legends. Each break has an associated label, controlled by the labels argument. If you set labels, you must also set breaks; otherwise, if data changes, the breaks will no longer align with the labels.

# Figure 42
df <- data.frame(x = c(1, 3, 5) * 1000, y = 0.5)
ggplot(df, aes(x, y)) +
  geom_point() +
  labs(x = NULL, y = NULL) +
  scale_x_continuous(breaks = c(2000, 4000), labels = c("2k", "4k")) +
  scale_y_continuous(breaks = c(0.25, 0.75), labels = c("25%", "75%"))

The scales package provides a number of useful labelling functions:

  1. scales::comma_format() adds commas to make it easier to read large numbers.

  2. scales::unit_format(unit, scale) adds a unit suffix, optionally scaling.

  3. scales::dollar_format(prefix, suffix) displays currency values, rounding to two decimal places and adding a prefix or suffix.

  4. scales::wrap_format() wraps long labels into multiple lines.

# Figure 43
df <- data.frame(x = c(1, 3, 5) * 1000, y = 0.5)
ggplot(df, aes(x, y)) +  geom_point() +  labs(x = NULL, y = NULL) + 
  scale_y_continuous(labels = scales:::dollar_format(prefix="$"))

You can adjust the minor breaks (the faint grid lines that appear between the major grid lines) by supplying a numeric vector of positions to the minor_breaks argument. This is particularly useful for log scales:

# Figure 44
df <- data.frame(x = c(2, 3, 5, 10, 200, 3000), y = 1)
mb <- as.numeric(1:10 %o% 10 ^ (0:4))
ggplot(df, aes(x, y)) +
  geom_point() +
  scale_x_log10(minor_breaks = mb)

2.3.5 Legends

A legend may need to draw symbols from multiple layers. For example, if you’ve mapped colour to both points and lines, the keys will show both points and lines.

# Figure 45
df <- data.frame(x = 1:3, y = 1:3, z = c("a", "b", "c"))
ggplot(df, aes(x, y)) +
  geom_point(size = 8, colour = "grey20", alpha = 0.5, show.legend = TRUE) +
  geom_point(aes(colour = z, shape = z), size = 4) +
  guides(colour = guide_legend(override.aes = list(alpha = 1))) +
  guides(fill = guide_legend(reverse=TRUE))

As we can see, if we want the geoms in the legend to display differently to the geoms in the plot. This is particularly useful when you’ve used transparency or size to deal with moderate overplotting and also used colour in the plot. You can do this using the override.aes parameter of guide_legend().

A number of settings that affect the overall display of the legends are controlled through the theme system. You can modify theme settings with the theme() function.

The position and justification of legends are controlled by the theme setting legend.position, which takes values “right”, “left”, “top”, “bottom”, or “none” (no legend).

# Figure 46
df <- data.frame(x = 1:3, y = 1:3, z = c("a", "b", "c"))
ggplot(df, aes(x, y)) +
  geom_point(aes(colour = z), size = 3) +
  xlab(NULL) +
  ylab(NULL) +
  theme(legend.position = "bottom")

Alternatively, if there’s a lot of blank space in your plot you might want to place the legend inside the plot. You can do this by setting legend.position to a numeric vector of length two. The numbers represent a relative location in the panel area: c(0, 1) is the top-left corner and c(1, 0) is the bottom-right corner. You control which corner of the legend the legend.position refers to with legend.justification, which is specified in a similar way.

# Figure 47
ggplot(df, aes(x, y)) +
  geom_point(aes(colour = z), size = 3) +
  theme(legend.position = c(0.8, 0.2), legend.justification = c(0.2, 0.8), 
        legend.direction = "horizontal")

  • A simple example
# Figure 48
df <- data.frame(x = rnorm(1000), y = rnorm(1000))
df$z <- cut(df$x, 4, labels = c("a", "b", "c", "d"))
ggplot(df, aes(x, y)) + 
  geom_point(aes(colour = z, shape = z), alpha = 0.7) +
  guides(colour = guide_legend(ncol = 2, byrow = TRUE, alpha = 1)) +
  guides(shape = guide_legend(ncol = 2, byrow = TRUE, alpha = 1)) +
  scale_colour_discrete("colour") +
  scale_shape_discrete("shape")

2.3.6 Date

Date and date/time data are continuous variables with special labels. ggplot2 works with Date (for dates) and POSIXct (for date/times) classes: if your dates are in a different format you will need to convert them with as.Date() or as.POSIXct(). scale_x_date() and scale_x_datetime() work similarly to scale_x_continuous() but have special date_breaks and date_labels arguments that work in date-friendly units:

  • date_breaks() and date_minor_breaks() allows you to position breaks by date units (years, months, weeks, days, hours, minutes, and seconds). For example, date_breaks = "2 weeks" will place a major tick mark every two weeks.

  • date_labels() controls the display of the labels using the same formatting strings as in strptime() and format():

String Meaning
%S second (00-59)
%M minute (00-59)
%l hour, in 12-hour clock (1-12)
%I hour, in 12-hour clock (01-12)
%p am/pm
%H hour, in 24-hour clock (00-23)
%a day of week, abbreviated (Mon-Sun)
%A day of week, full (Monday-Sunday)
%e day of month (1-31)
%d day of month (01-31)
%m month, numeric (01-12)
%b month, abbreviated (Jan-Dec)
%B month, full (January-December)
%y year, without century (00-99)
%Y year, with century (0000-9999)

For example, if you wanted to display dates like 14/10/1979, you would use the string “%d/%m/%Y”.

head(economics)
date pce pop psavert uempmed unemploy
1967-07-01 507.4 198712 12.5 4.5 2944
1967-08-01 510.5 198911 12.5 4.7 2945
1967-09-01 516.3 199113 11.7 4.6 2958
1967-10-01 512.9 199311 12.5 4.9 3143
1967-11-01 518.1 199498 12.5 4.7 3066
1967-12-01 525.8 199657 12.1 4.8 3018
base <- ggplot(economics, aes(date, psavert)) +
  geom_line(na.rm = TRUE) +
  labs(x = NULL, y = NULL)
# Figure 49
print(base)

# Figure 50
base + scale_x_date(date_labels = "%Y", date_breaks = "5 years")

# Figure 51
base + scale_x_date(limits = as.Date(c("2004-01-01", "2005-01-01")),
                    date_labels = "%b %y",
                    date_minor_breaks = "1 month")

# Figure 52
base + scale_x_date(limits = as.Date(c("2004-01-01", "2004-06-01")),
                    date_labels = "%m/%d",
                    date_minor_breaks = "2 weeks")

2.3.7 Themes

The ggplot2 theme system does not affect how the data is rendered by geoms, or how it is transformed by scales. Themes don’t change the perceptual properties of the plot, but they do help you make the plot aesthetically pleasing or match an existing style guide. Themes give you control over things like fonts, ticks, panel strips, and backgrounds.

The theming system is composed of four main components:

  • Theme elements specify the non-data elements that you can control. For example, the plot.title() element controls the appearance of the plot title;

  • Each element is associated with an element function, which describes the visual properties of the element. For example, element_text() sets the font size, colour and face of text elements like plot.title().

  • The theme() function which allows you to override the default theme elements by calling element functions, like theme(plot.title = element_text(colour = "red")).

  • Complete themes, like theme_grey() set all of the theme elements to values designed to work together harmoniously.

# Figure 53
base <- ggplot(mpg, aes(cty, hwy, colour = factor(cyl))) +
  geom_jitter() +
  geom_abline(colour = "black", size = 1, alpha = 0.8) +
  labs(x = "City mileage/gallon",
       y = "Highway mileage/gallon",
       colour = "Cylinders",
       title = "Highway and city mileage are highly correlated") +
  scale_colour_brewer(type = "seq", palette = "Spectral")
print(base)

Next, you need to make sure the plot matches the style guidelines of your journal: 1. The background should be white, not pale grey. 2. The legend should be placed inside the plot if there’s room. 3. Major gridlines should be a pale grey and minor gridlines should be removed. 4. The plot title should be 12pt bold text and centered.

# Figure 54
style <- theme(plot.title = element_text(face = "bold", size = 12, hjust = 0.5),
        legend.background = element_rect(fill = "white", size = 2, colour = "white"),
        legend.justification = c(0.6, 0.1),
        legend.position = c(0.1, 0.6),
        axis.ticks = element_line(colour = "grey70", size = 0.2),
        panel.grid.major = element_line(colour = "grey70", size = 0.2),
        panel.grid.minor = element_blank())
base + theme_bw() + style

There are seven other themes built in to ggplot2 1.1.0:

  • theme_grey(): a light grey background and white gridlines.

  • theme_bw() : a white background and thin grey grid lines.

  • theme_linedraw(): A theme with only black lines of various widths on white backgrounds, reminiscent of a line drawing

  • theme_light(): similar to theme_linedraw() but with light grey lines and axes, to direct more attention towards the data.

  • theme_dark(): the dark cousin of theme_light(), with similar line sizes but a dark background. Useful to make thin coloured lines pop out.

  • theme_minimal(): A minimalistic theme with no background annotations.

  • theme_classic(): A classic-looking theme, with x and y axis lines and no gridlines.

  • theme_void(): A completely empty theme.

As well as applying themes a plot at a time, you can change the default theme with theme_set(). For example, if you really hate the default grey background, run theme_set(theme_bw()) to use a white background for all plots.

To modify an individual theme component you use code like plot + theme(element.name = element_function()). There are four basic types of built-in element functions: text, lines, rectangles, and blank. Each element function has a set of parameters that control the appearance:

  • element_text() draws labels and headings. You can control the font family, face, colour, size (in points), hjust, vjust, angle (in degrees) and lineheight (as ratio of fontcase).

  • element_line() draws lines parameterised by colour, size and linetype.

  • element_rect() draws rectangles, mostly used for backgrounds, parameterised by fill colour and border colour, size and linetype.

  • element_blank() draws nothing. Use this if you don’t want anything drawn, and no space allocated for that element

# Figure 55
df <- data.frame(x = rnorm(300), y = rnorm(300))
df$z <- cut(df$x, 6, labels = c("a", "b", "c", "d", "e", "f"))
ggplot(df, aes(x, y)) + 
  geom_point(aes(colour = z, shape = z), size = 2) +
  labs(title = "This is a ggplot") + 
  xlab(NULL) + 
  ylab(NULL) +
  theme(plot.title = element_text(face = "bold", colour = "#91003F", size = 16, hjust = 0.5),
        panel.grid.major = element_line(colour = "black", linetype = "dotted", size = 1),
        plot.background = element_rect(fill = "grey80", colour = NA),
        panel.background = element_rect(fill = "#EFEDF5"),
        axis.line = element_line(colour = "black"))

2.3.7.1 Legend Elements

Element Setter Description
legend.background element_rect() legend background
legend.key element_rect() background of legend keys
legend.key.size unit() legend key size
legend.key.height unit() legend key height
legend.key.width unit() legend key width
legend.margin unit() legend margin
legend.text element_text() legend labels
legend.text.align 0, 1 legend label alignment (0 = right, 1 = left)
legend.title element_text() legend name
legend.title.align 0, 1 legend name alignment (0 = right, 1 = left)

The legend elements control the apperance of all legends. We can also modify the appearance of individual legends by modifying these elements.

# Figure 56
df <- data.frame(x = rnorm(300), y = rnorm(300))
df$z <- cut(df$x, 6, labels = c("a", "b", "c", "d", "e", "f"))
ggplot(df, aes(x, y)) + 
  geom_point(aes(colour = z, shape = z), size = 2) +
  theme_minimal() +
  scale_colour_brewer(type = "seq", palette = "Dark2") +
  theme(legend.key = element_rect(color = "grey50"),
        legend.key.width = unit(0.9, "cm"),
        legend.key.height = unit(0.75, "cm"),
        legend.text = element_text(size = 15),
        legend.title = element_text(size = 15, face = "bold"),
        legend.title.align = 0.3)

2.3.7.2 Panel Elements

Element Setter Description
panel.background element_rect() panel background (under data)
panel.border element_rect() panel border (over data)
panel.grid.major element_line() major grid lines
panel.grid.major.x element_line() vertical major grid lines
panel.grid.major.y element_line() horizontal major grid lines
panel.grid.minor element_line() minor grid lines
panel.grid.minor.x element_line() vertical minor grid lines
panel.grid.minor.y element_line() horizontal minor grid lines
aspect.ratio numeric plot aspect ratio

Panel elements control the appearance of the plotting panels. You can also modify the appearance of panel by modifying these elements.

# Figure 57
df <- data.frame(x = rnorm(300), y = rnorm(300))
df$z <- cut(df$x, 6, labels = c("a", "b", "c", "d", "e", "f"))
ggplot(df, aes(x, y)) + 
  geom_point(aes(colour = z, shape = z), size = 2) +
  theme_linedraw() +
  scale_colour_brewer(type = "seq", palette = "Dark2") +
  theme(panel.background = element_rect(fill = "#C7EAE5"),
        panel.grid.major.x = element_line(color = "gray60", size = 0.8),
        plot.background = element_rect(colour = "black", size = 2),
        aspect.ratio = 12 / 16)

2.3.7.3 Facetting Elements

Element Setter Description
strip.background element_rect() background of panel strips
strip.text element_text() strip text
strip.text.x element_text() horizontal strip text
strip.text.y element_text() vertical strip text
panel.margin unit() margin between facets
panel.margin.x unit() margin between facets (vertical)
panel.margin.y unit() margin between facets (horizontal)

The above theme elements are associated with faceted ggplots.

# Figure 58
df <- data.frame(x = rnorm(300), y = rnorm(300))
df$z <- cut(df$x, 6, labels = c("a", "b", "c", "d", "e", "f"))
ggplot(df, aes(x, y)) + 
  geom_point(aes(colour = z, shape = z), alpha = 0.7) +
  theme_linedraw() +
  scale_colour_brewer(type = "seq", palette = "Dark2") +
  facet_wrap(~z) +
  theme(panel.spacing = unit(0.5, "in"),
        strip.background = element_rect(fill = "grey20", color = "grey80", size = 1),
        strip.text = element_text(colour = "white"))

2.3.8 Colour

In R, a colour is represented as a string. Basically, a colour is defined, like in HTML/CSS, using the hexadecimal values (00 to FF) for red, green, and blue, concatenated into a string, prefixed with a “#”. A pure red colour this is represented with “#FF0000”.

Besides the “#RRGGBB” RGB colour strings, one can also use one of R’s predefined named colours in R package–RColorBrewer

2.3.8.1 Key Color Functions

First, let’ s see some key ggplot2 R functions for changing a plot color.

  1. Set ggplot color manually:
  • scale_fill_manual() for box plot, bar plot, violin plot, dot plot, etc

  • scale_color_manual() or scale_colour_manual() for lines and points

  1. Use colorbrewer palettes:
  • scale_fill_brewer() for box plot, bar plot, violin plot, dot plot, etc
  • scale_color_brewer() or scale_colour_brewer() for lines and points
  1. Use grey color scales:
  • scale_fill_grey() for box plot, bar plot, violin plot, dot plot, etc

  • scale_colour_grey() or scale_colour_brewer() for points, lines, etc

  1. Change the default ggplot gradient color:
  • scale_color_gradient(), scale_fill_gradient() for sequential gradients between two colors

  • scale_color_gradient2(), scale_fill_gradient2() for diverging gradients

  • scale_color_gradientn(), scale_fill_gradientn() for gradient between n colors

This is the default color:

# Figure 59
ggplot(diamonds, aes(carat, price, colour = cut)) +
  geom_point(size = 0.5)

Set custom color palettes:

# Figure 60
ggplot(diamonds, aes(carat, price, colour = cut)) +
  geom_point(size = 0.5) +
  scale_colour_manual(values = c("#6794a7", "#014d64", "#7ad2f6", "#01a2d9", "#76c0c1"))

Using palette in RColorBrewer:

# Figure 61
library(RColorBrewer)
ggplot(diamonds, aes(carat, price, colour = cut)) +
  geom_point(size = 0.5) +
  scale_colour_brewer(palette = "Greens")

Using grey color scales:

# Figure 62
ggplot(diamonds, aes(carat, price, colour = cut)) +
  geom_point(size = 0.5) +
  scale_colour_grey()

When data is splitted by a continuous variable:

Sequential gradients between two colors

# Figure 63
ggplot(diamonds, aes(carat, price, colour = depth)) +
  geom_point(size = 0.5)+
  scale_colour_gradient(low = "orange", high = "Darkred")

Diverge gradients

# Figure 64
ggplot(diamonds, aes(carat, price, colour = depth)) +
  geom_point(size = 0.5) +
  scale_colour_gradient2(low="#8E0F2E", mid="#BFBEBE", high="#0E4E75")

2.3.8.2 Predefined Color Palettes in RColorBrewer and ggplot2

Second, we’ll see some predefined color palettes in R package. The most commonly used color scales are Colorbrewer palettes [RColorBrewer package] and Grey color palettes [ggplot2 package].

  • Show all colorbrewer palettes in RColorBrewer package:

display.brewer.all(colorblindFriendly = TRUE) will display only colorblind friendly palettes.

  • Visualize a single RColorBrewer palette by specifying its name:
# Figure 65
display.brewer.pal(11, "Spectral")

  • Return the hexadecimal color code of the palette:
brewer.pal(11, "Spectral")
##  [1] "#9E0142" "#D53E4F" "#F46D43" "#FDAE61" "#FEE08B" "#FFFFBF" "#E6F598"
##  [8] "#ABDDA4" "#66C2A5" "#3288BD" "#5E4FA2"

2.3.8.3 Predefined Color in R

Then, there are five predefined color palettes and 657 built-in color names available in R.

show_col() in scales package will give you a quick and dirty way to show colours in a plot.

# Figure 66
library(scales)
show_col(rainbow(16), labels = T)

# Figure 67
ggplot(diamonds, aes(carat, price, colour = cut)) +
  geom_point(size = 0.5) +
  scale_colour_manual(values = rainbow(25))

Because cut has 5 factors, in values = rainbow(n), n must be greater than or equal to 5. values takes the first five values. Change n and you will get different beautiful plots.

The function colors() returns the color names, which R knows about.

# Figure 68
r_color <- colors()
head(r_color)
## [1] "white"         "aliceblue"     "antiquewhite"  "antiquewhite1"
## [5] "antiquewhite2" "antiquewhite3"
# Figure 69
show_col(r_color, labels = FALSE, border = "white")

2.3.8.4 Other Color Schemes

Finally, ggthemes package provides a large number of high-quality themes and color schemes. The most common schemes are economist, wsj, stata, excel, tableau and solarized.

library(ggthemes)
m<-excel_pal()(6)
show_col(m)

# Figure 70
ggplot(diamonds, aes(carat, price, colour = cut)) +
  geom_point(size = 0.5) +
  scale_colour_excel() 

The R package ggsci also contains a collection of high-quality color palettes inspired by colors used in scientific journals, data visualization libraries, and more. The color palettes are provided as ggplot2 scale functions:

  • scale_color_npg() and scale_fill_npg(): Nature Publishing Group color palettes

  • scale_color_aaas() and scale_fill_aaas(): American Association for the Advancement of Science color palettes

  • scale_color_lancet() and scale_fill_lancet(): Lancet journal color palettes

  • scale_color_jco() and scale_fill_jco(): Journal of Clinical Oncology color palettes

  • scale_color_tron() and scale_fill_tron(): This palette is inspired by the colors used in Tron Legacy. It is suitable for displaying data when using a dark theme.

You can find more examples in the ggsci package vignettes.

2.4 Plot Examples

2.4.1 Time Series 1

The following plot show the change of unemployment rate in fifty years. We highlight the start and end date of each president’s term by adding some lines. Moveover, the highest rate is labeled in this plot.

# Figure 71
ggplot(economics) +
  geom_line(aes(date, unemploy/pop)) +
  labs(title = "The change of unemployed rate in last five decades") +
  theme_linedraw() +
  theme(plot.title = element_text(face = "bold", size = 15, hjust = 0.5),
        panel.grid.major = element_line(colour = "grey90", size = 0.3),
        panel.background = element_rect(fill = "white"),
        plot.background = element_rect(fill = "#FEE0B6", colour = "black"),
        axis.text.x = element_text(angle = -30, vjust = 0.5)) +
  geom_vline(xintercept = presidential$start, colour = "darkred", alpha = 0.5) +
  geom_text(x = date[which.max(unemploy/pop)], y = max(unemploy/pop), 
            label = paste0("(",round(max(unemploy/pop), 4), ")"), 
            vjust = 0.8, hjust = -0.1)

2.4.2 Time Series 2

Line_Data.csv file is the stock price data of Apple (AAPL) and Amazon (AMZN) over the past 10 years. This plot can briefly show how to deal with date type data and set appropriate coordinate axis.

library(reshape2)
LineData <- read.csv("Line_Data.csv", stringsAsFactors = FALSE) 
head(LineData)
date AMZN AAPL
2000/1/1 69 25.94
2000/2/1 67 28.66
2000/3/1 55 33.95
2000/4/1 48 31.01
2000/5/1 36 21.00
2000/6/1 30 26.19
# Figure 72
LineData$date <- as.Date(LineData$date)
LineData <- melt(LineData, id = "date")
ggplot(LineData, aes(x = date, y = value, group = variable)) +
  geom_area(aes(fill = variable), alpha = 0.5, position = "identity") + 
  geom_line(aes(color = variable), size = 0.75) +
  scale_x_date(date_labels = "%Y", date_breaks = "2 year") +
  xlab("Year") + 
  ylab("Value") +
  labs(title = "The Stock Price Trend of AAPL and AMZN") +
  theme_linedraw() +
  theme( plot.title = element_text(size = 15, face = "bold", hjust = 0.5),
         axis.title = element_text(size = 10, face = "plain", color = "black"),
         axis.text = element_text(size = 10, face = "plain", color = "black"),
         legend.position = c(0.15,0.8),
         legend.background = element_blank()) +
  scale_colour_brewer(type = "seq", palette = "Set1")

2.4.3 Scatter plot

This example describes how to create a scatter plot and add regression lines using geom_point() and geom_smooth() functions. Note that we remove confidence intervals and extend the regression lines. We also use geom_rug() to add marginal rugs

# Figure 73
mtcars$cyl <- as.factor(mtcars$cyl)
ggplot(mtcars, aes(x = wt, y = mpg, color = cyl, shape = cyl)) +
  geom_point(size = 2) + 
  geom_smooth(method = lm, se = FALSE, fullrange = TRUE) +
  scale_shape_manual(values = c(3, 16, 17)) + 
  scale_color_manual(values = c('#999999', '#E69F00', '#56B4E9')) +
  labs(title = "An example of adding regression lines") + 
  theme_light() +
  theme(plot.title = element_text(face = "bold", size = 15, hjust = 0.6),
        axis.title = element_text(face = "bold", size = 12),
        axis.text = element_text(size = 10, face= "bold"),
        legend.text = element_text(face = "bold", size = 8),
        legend.title = element_text(face = "bold"),
        legend.title.align = 0.4) +
  geom_rug()

2.4.4 Box Plot

The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum. We estimate the fuel consumption in different brands of cars. Meanwhile, we highlight those outliers (red) and detailed data (black) to make the chart clearer.

# Figure 74
ggplot(mpg, aes(manufacturer, cty, fill = manufacturer)) +
  geom_boxplot(outlier.colour = "red", 
               outlier.shape = 16,
               outlier.size = 2, 
               notch = FALSE) +
  xlab(NULL) +
  labs(title = "City miles per gallon for different brands of cars") +
  guides(fill = guide_legend(ncol = 2, byrow = TRUE, alpha = 1)) +
  geom_jitter(shape = 16, position = position_jitter(0.2), size=1) +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", colour = "black", 
                                  hjust = 0, vjust = 1, 
                                  size = 15),
        axis.text.x = element_text(angle = 90, vjust = 0.2),
        axis.text = element_text(size = 10, face= "bold"),
        axis.title.y= element_text(face= "bold", size = 12),
        legend.text = element_text(face = "bold", size = 8),
        legend.title = element_text(face = "bold"),
        legend.title.align = 0.4,
        legend.key.width = unit(0.5, "cm"),
        plot.background = element_rect(fill = "#ECE2F0", colour = "darkred"))

2.4.5 Faceted Plot

In the economics data, we can consider the relationship between unemployment rate and median duration from 1960s to 2010s. As we can see, there may be a linear relationship in each plot. However, the impact of duration on unemployment rate may different.

For faceting, we need to extract “year” from “year-month-date”.

# Figure 75
economics$year <- paste0(substr(economics$date,1,3),"0s")
ggplot(economics, aes(uempmed, unemploy / pop, color = year)) +
  geom_point() +
  facet_wrap(~year) +
  xlab("median duration of unemployment in weeks") + 
  ylab("unemployment rate") +
  labs(title = "Analysis of unemployment data in each decade") +
  theme_bw() +
  scale_colour_brewer(type = "seq", palette = "Set1") +
  theme(plot.title = element_text(face = "bold", colour = "black", 
                                  hjust = 0.5, vjust = 1, size = 15),
        panel.spacing = unit(0.2, "in"),
        strip.background = element_rect(fill = "#D94801", color = "grey", size = 1),
        strip.text = element_text(colour = "white"),
        legend.position = "right",
        panel.background = element_rect(fill = "white"),
        axis.title = element_text(face = "bold", size = 12),
        legend.title.align = 0.7)

2.4.6 Bar Chart

There is a bar chart to show the diamond quality of different colors. The labels in the following plot are the percentage of ideal diamonds in each group. We need to library(dplyr)

# Figure 76
library(dplyr)
idealnumb <- diamonds %>% group_by(color) %>% filter(cut == "Ideal") %>% dplyr::summarise(n())
groupnumb <- diamonds %>% group_by(color) %>% dplyr::summarise(n())
idealrate <- idealnumb$`n()`/groupnumb$`n()`
ggplot(diamonds, aes(color, fill = cut)) +
  geom_bar(position = position_dodge(), stat = "count") + 
  ylim(0,5500) +
  coord_flip() +
  labs(title = "The diamond quality of different colors") +
  guides(fill = guide_legend(reverse = TRUE)) +
  theme_classic() +
  theme(plot.title = element_text(face = "bold", size = 15, hjust = 0.9),
        legend.position = "right", 
        aspect.ratio = 10 / 15) +
  scale_fill_brewer(palette = "Spectral") +
  annotate("text", x = c("D", "E", "F", "G", "H", "I", "J"), y = idealnumb$`n()`, 
           label = paste0(round(idealrate,3)*100, "%"),
           hjust = -0.2, 
           vjust = -1.7,
           color = "black",
           size = 3)

2.4.7 Pie Chart

There is no specific functions for pie chart in ggplot2. Pie chart is a special case of a histogram.

# Figure 77
ggplot(diamonds, aes(color)) +
  geom_bar(aes(fill = cut), position=position_dodge()) + 
  coord_polar(theta = "x") +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal() +
  labs(title = "The diamond quality of different colors") +
  theme(plot.title = element_text(face = "bold", size = 15, hjust = 1))

# Figure 78
ggplot(diamonds, aes(color)) +
  geom_bar(aes(fill = cut), position = position_dodge()) + 
  coord_polar(theta = "y") +
  scale_fill_brewer(palette = "Blues") +
  theme_minimal()+
  labs(title = "The diamond quality of different colors") +
  theme(plot.title = element_text(face = "bold", size = 15, hjust = 1))

2.4.8 Histogram plot

Histogram and density can be shown in the same graphic. To run the following code, you need library(plyr)

# prepare data
# library(plyr)
set.seed(1000)
df <- data.frame(sex = factor(rep(c("F", "M"), each = 200)),
                 weight = round(c(rnorm(200, mean = 55, sd = 5), rnorm(200, mean = 65, sd = 5))))
head(df)
sex weight
F 53
F 49
F 55
F 58
F 51
F 53
mu <- ddply(df, "sex", summarise, grp.mean = mean(weight))
head(mu)
sex grp.mean
F 55.285
M 64.885
# Figure 79
ggplot(df, aes(x = weight, color = sex)) +
  geom_histogram(aes(y =..density..), position = "identity", size = 1, fill = "white") +
  geom_vline(data = mu, aes(xintercept = grp.mean, color = sex), 
             linetype = "dashed", size = 0.8) +
  theme_classic() +
  scale_color_manual(values = c("#999999", "#E69F00", "#56B4E9")) +
  scale_fill_manual(values = c("#999999", "#E69F00", "#56B4E9")) +
  labs(title = "The histogram of female and male's weight") +
  theme(plot.title = element_text(face = "bold", size = 15, hjust = 0.6),
        axis.title = element_text(face = "bold", size = 12),
        axis.text = element_text(size = 10, face= "bold"),
        legend.text = element_text(face = "bold", size = 8),
        legend.title = element_text(face = "bold"),
        legend.title.align = 0.4) +
  geom_density(data = df[1:200,], alpha = 0.2, fill="#FC9272", color = "#EF3B2C") +
  geom_density(data = df[200:400,], alpha = 0.2, fill="#9ECAE1", color = "#4292C6")

2.4.9 QQ plot

This example describes how to create a qq plot (or quantile-quantile plot) using R software and ggplot2 package. QQ plots is used to check whether a given data follows normal distribution.

# Figure 80
mtcars$cyl <- as.factor(mtcars$cyl)
ggplot(mtcars, aes(sample = mpg, color = cyl, size = cyl)) +
  stat_qq() +
  labs(title = "Miles per gallon \n according to the weight",
       y = "Miles/(US) gallon") +
  scale_color_manual(values = rainbow(15)) + 
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 15, hjust = 0.6),
        axis.title = element_text(face = "bold", size = 12),
        axis.text = element_text(size = 10, face= "bold"),
        legend.text = element_text(face = "bold", size = 8),
        legend.title = element_text(face = "bold"),
        legend.title.align = 0.4)

2.4.10 2D plot

ggplot2 can not draw true 3d surfaces, but you can use geom_contour() and geom_tile() to visualise 3d surfaces in 2d.

# Figure 81
ggplot(faithfuld, aes(waiting, eruptions, z = density)) +
  geom_contour(aes(colour = stat(level))) +
  theme_bw() +
  labs(title = "The 2d contours of the faithful data") +
  theme(plot.title = element_text(face = "bold", size = 15, hjust = 0.6),
        axis.title = element_text(face = "bold", size = 12),
        axis.text = element_text(size = 10, face= "bold"),
        legend.text = element_text(face = "bold", size = 8),
        legend.title = element_text(face = "bold"),
        legend.title.align = 0.4)

2.4.11 Heatmap

This example describes how to compute and visualize a correlation matrix using R software and ggplot2 package. Take mycar data as an example.

# Figure 82
mydata <- mtcars[, c(1,3,4,5,6,7)]
cormat <- round(cor(mydata), 2)
reorder_cormat <- function(cormat){
  dd <- as.dist((1-cormat)/2)
  hc <- hclust(dd)
  cormat <-cormat[hc$order, hc$order]}
get_upper_tri <- function(cormat){
    cormat[lower.tri(cormat)]<- NA
    return(cormat)}
# Reorder the correlation matrix
cormat <- reorder_cormat(cormat)
upper_tri <- get_upper_tri(cormat)
# Melt the correlation matrix
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Create a heatmap
ggplot(melted_cormat, aes(Var2, Var1, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
  midpoint = 0, limit = c(-1,1), space = "Lab", name = "Pearson\nCorrelation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1),
        axis.text.y = element_text(vjust = 1, size = 12, hjust = 1)) +
  coord_fixed() +
  geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) +
  theme(axis.title.x = element_blank(),
       axis.title.y = element_blank(),
       panel.grid.major = element_blank(),
       panel.border = element_blank(),
       panel.background = element_blank(),
       axis.ticks = element_blank(),
       legend.justification = c(1, 0),
       legend.position = c(0.6, 0.7),
       legend.direction = "horizontal") +
  guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
                title.position = "top", title.hjust = 0.5))

2.4.12 Polar

We can draw a heart shape by using ‘coord_polar’ as follows.

a=2
theta = seq(0,pi,by=0.01)
rho = a*(1-sin(theta))

heart = data.frame(theta,rho)
pie <- ggplot(heart,aes(x=theta,y=rho,colour="red")) + geom_line()
pie + coord_polar(theta="x") + ggtitle("My heart") 

# ggplot(heart,aes(theta,rho,fill="blue"))+geom_rect(aes(xmin=min(theta),xmax=max(theta),ymin=min(rho),ymax=max(rho)))
# pie + coord_polar(theta="x") + ggtitle("My heart") 

3 Other plots

3.1 ggvis plot

library(ggvis)
mtcars %>%
  ggvis(~wt, ~mpg) %>%
  layer_smooths(span = input_slider(0.5, 1, value = 1, step=0.1)) %>%
  layer_points(size := input_slider(100, 1000, value = 100, ticks=F, 
                                    pre="pre_", post="_post"))
1.52.02.53.03.54.04.55.05.5wt10121416182022242628303234mpg

3.2 animation plot

library(animation)
library(plyr)
oopt = ani.options(interval = 0.3, nmax = 101)
a <- sort(rnorm(100, 2))
b <- sort(rnorm(100, 7))
out <- vector("list", 101)
for (i in 1:ani.options("nmax")) {
  ji <- seq(from = 0, to = 5, by = .05)
  a <- jitter(a, factor = 1, amount = ji[i])
  fab1 <- lm(a ~ b)
  coe <- summary(fab1)$coefficients
  r2 <- summary(fab1)$r.squared
  if (coe[2, 4] < .0001) p <- " < .0001"
  if (coe[2, 4] < .001 & coe[2, 4] > .0001) p <- " < .001"
  if (coe[2, 4] > .01) p <- round(coe[2, 4], 3)
  plot(a ~ b, main = "Linear model")
  abline(fab1, col = "red", lw = 2)
  text(x = min(b) + 2, y = max(a) - 1, 
       labels = paste("t = ", round(coe[2, 3], 3), ", p = ", p, ", R2 = ", round(r2, 3)))
  out[[i]] <- c(coe[2, 3], coe[2, 4], r2)
  ani.pause()
  }
ani.options(oopt)

3.3 3D plot (install XQuartz on MacOS)

# library(rgl)
# library(scatterplot3d)
x1=seq(-3,3,by = 0.1)
a1=1
a2=1
x2=sqrt((9-a1*x1^2)/a2)
x3=seq(-4,4,by = 0.1)
x4=sqrt((16-a1*x3^2)/a2)
plot(x3,x4)
points(x1,x2)

xy=rbind(cbind(x1,x2),cbind(x1,-x2),cbind(x3,x4),cbind(x3,-x4))

plot(xy[c(123:284),1],xy[c(123:284),2],col=2,pch = 16)
points(xy[c(1:122),1],xy[c(1:122),2],col=3,pch = 16)

z1=xy[,1]^2
z2=xy[,2]^2
z3=sqrt(2)*xy[,1]*xy[,2]
library(scatterplot3d)
scatterplot3d(z1,z2,z3,pch = 3)
library(rgl)
open3d()
plot3d(z1[c(1:122)], z2[c(1:122)], z3[c(1:122)],col = 3,size = 6)
plot3d(z1[c(123:284)], z2[c(123:284)], z3[c(123:284)],col = 2,size = 6,add = TRUE)

######
# install.packages("caTools")  # install external package
library(caTools)             # external package providing write.gif function
jet.colors <- colorRampPalette(c("red", "blue", "#007FFF", "cyan", "#7FFF7F",
                                 "yellow", "#FF7F00", "red", "#7F0000"))
dx <- 1500                    # define width
dy <- 1400                    # define height
C  <- complex(real = rep(seq(-2.2, 1.0, length.out = dx), each = dy),
              imag = rep(seq(-1.2, 1.2, length.out = dy), dx))
C <- matrix(C, dy, dx)       # reshape as square matrix of complex numbers
Z <- 0                       # initialize Z to zero
X <- array(0, c(dy, dx, 20)) # initialize output 3D array
for (k in 1:20) {            # loop with 20 iterations
  Z <- Z^2 + C               # the central difference equation
  X[, , k] <- exp(-abs(Z))   # capture results
}
write.gif(X, "Mandelbrot.gif", col = jet.colors, delay = 100)

3.4 More amazing plots:

4 Exercises

  1. Recreate the scatter plot using mtcars data set and Lattice package. Two lines displayed in graphics are smooth curves (fitted by loess).

  1. Recreate the following graphic (You can refer to Plot Examples - Time series 2):

  1. Recreate the following graphic in ggplot2. You can use the first six colors of the palette “Set3”.

tips:

5 Reference