Toolbox

Introduction

The layered structure of ggplot encourages you to design and construct graphics in a structured manner. You have learned what a layer is and how to add one to your graphic, but not what geoms and statistics are available to help you build revealing plots. This chapter lists some of the many geoms and stats included in ggplot, broken down by their purpose. This chapter will provide a good overview of the available options, but it does not describe each geom and stat in detail. For more information about individual geoms, along with many more examples illustrating their use, see the online and electronic documentation. You may also want to consult the documentation to learn more about the datasets used in this chapter.

This chapter is broken up into the following sections, each of which deals with a particular graphical challenge. This is not an exhaustive or exclusive categorisation, and there are many other possible ways to break up graphics into different categories. Each geom can be used for many different purposes, especially if you are creative. However, this breakdown should cover many common tasks and help you learn about some of the possibilities.

The examples do not go into much depth, but hopefully if you flick through this chapter, you’ll be able to see a plot that looks like the one you’re trying to create, and modify it to meet your needs.

Overall layering strategy

It is useful to think about the purpose of each layer before it is added. In general, there are three purposes for a layer:

  • To display the data. We plot the raw data for many reasons, relying on our skills at pattern detection to spot gross structure, local structure, and outliers. This layer appears on virtually every graphic. In the earliest stages of data exploration, it is often the only layer.

  • To display a statistical summary of the data. As we develop and explore models of the data, it is useful to display model predictions in the context of the data. We learn from the data summaries and we evaluate the model. Showing the data helps us improve the model, and showing the model helps reveal subtleties of the data that we might otherwise miss. Summaries are usually drawn on top of the data.

    If you review the examples in the preceding chapter, you’ll see many examples of plots of data with an added layer displaying a statistical summary.

  • To add additional metadata, context and annotations. A metadata layer displays background context or annotations that help to give meaning to the raw data. Metadata can be useful in the background and foreground.

    A map is often used as a background layer with spatial data. Background metadata should be rendered so that it doesn’t interfere with your perception of the data, so is usually displayed underneath the data and formatted so that it is minimally perceptible. That is, if you concentrate on it, you can see it with ease, but it doesn’t jump out at you when you are casually browsing the plot.

    Other metadata is used to highlight important features of the data. If you have added explanatory labels to a couple of inflection points or outliers, then you want to render them so that they pop out at the viewer. In that case, you want this to be the very last layer drawn.

Diamonds data

The diamonds dataset consists of price and quality information for about 54,000 diamonds. It’s built into ggplot2:

library(ggplot2)
dim(diamonds)
#> [1] 53940    10
head(diamonds)
#>   carat       cut color clarity depth table price    x    y    z
#> 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
#> 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
#> 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
#> 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
#> 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
#> 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

The data contains the four C’s of diamond quality, carat, cut, colour and clarity; and five physical measurements, depth, table, x, y and z, as described in Figure~.

The dataset has not been well cleaned, so as well as demonstrating interesting facts about diamonds, it also shows some data quality problems.

Basic plot types

These geoms are the fundamental building blocks of ggplo2. They are useful in their own right, but are also used to construct more complex geoms. Most of these geoms are associated with a named plot: when that geom is used by itself in a plot, that plot has a special name.

Each of these geoms is two dimensional and requires both x and y aesthetics. All understand colour (or color) and size aesthetics, and the filled geoms (bar, tile and polygon) also understand fill. The point geom understands shape and the line and path geoms understand linetype. The geoms are used for displaying data, summaries computed elsewhere, and metadata.

  • geom_area() draws an area plot, which is a line plot filled to the y-axis (filled lines). Multiple groups will be stacked on top of each other.

  • geom_bar(stat = "identity") makes a barchart. We need stat = "identity" because the default stat automatically counts values (so is essentially a 1d geom, see displaying distributions. The identity stat leaves the data unchanged.

    By default, multiple bars in the same location will be stacked on top of one another.

  • geom_line() makes a line plot. The group aesthetic determines which observations are connected; see grouping for more details. geom_path() is similar to a geom_line(), but lines are connected in the order they appear in the data, not from left to right.

  • geom_point() produces a scatterplot.

  • geom_polygon() draws polygons, which are filled paths. Each vertex of the polygon requires a separate row in the data. It is often useful to merge a data frame of polygon coordinates with the data just prior to plotting. Drawing maps illustrates this concept in more detail for map data.

  • geom_text() adds labels at the specified points. This is the only geom in this group that requires another aesthetic: label. It also has optional aesthetics hjust and vjust that control the horizontal and vertical position of the text; and angle which controls the rotation of the text. See specifications for more details.

  • geom_tile() and geom_raster() make an image plot or level plot. The tiles form a regular tessellation of the plane and typically have the fill aesthetic mapped to another variable. geom_raster() is a faster verion of geom_tile() when all the tiles are the same size.

Each of these geoms is illustrated in below. Observe the different axis ranges for the bar, area and tile plots: these geoms take up space outside the range of the data, and so push the axes out.

df <- data.frame(
  x = c(3, 1, 5), 
  y = c(2, 4, 6), 
  label = c("a","b","c")
)
p <- ggplot(df, aes(x, y, label = label)) + labs(x = NULL, y = NULL)
p + geom_point() + labs(title = "geom_point")
p + geom_bar(stat = "identity") + labs(title = "geom_bar(stat=\"identity\")")
p + geom_line() + labs(title = "geom_line")
p + geom_area() + labs(title = "geom_area")
p + geom_path() + labs(title = "geom_path")
p + geom_text() + labs(title = "geom_text")
p + geom_raster() + labs(title = "geom_raster")
p + geom_polygon() + labs(title = "geom_polygon")

Exercises

  1. What geoms would you use to draw each of the following named plots?

    1. Scatterplot
    2. Line chart
    3. Histogram
    4. Bar chart
    5. Pie chart
  2. What’s the different between geom_path() and geom_polygon()?

  3. What low-level geoms are used to draw geom_smooth()? geom_boxplot()?

Grouped data

Geoms can be roughly divided into individual and collective geoms. An individual geom has a distinct graphical object for each observation (row). For example, the point geom draws one point per row. A collective geoms display multiple observations with one geometric object. This may be a result of a statistical summary, like a boxplot, or may be fundamental to the display of the geom, like a polygon. Lines and paths fall somewhere in between: each line is composed of a set of straight segments, but each segment represents two points. How do we control which observations go are display in each graphical element? This is the job of the group aesthetic.

By default, the group is set to the interaction of all discrete variables in the plot. This often partitions the data correctly, but when it does not, or when no discrete variable is used in the plot, you will need to explicitly define the grouping structure, by mapping group to a variable that has a different value for each group.

There are three common cases where the default is not enough, and we will consider each one below. In the following examples, we will use a simple longitudinal dataset, Oxboys, from the nlme package. It records the heights (height) and centered ages (age) of 26 boys (Subject), measured on nine occasions (Occasion).

data(Oxboys, package = "nlme")
head(Oxboys)
#>   Subject     age height Occasion
#> 1       1 -1.0000    140        1
#> 2       1 -0.7479    143        2
#> 3       1 -0.4630    145        3
#> 4       1 -0.1643    147        4
#> 5       1 -0.0027    148        5
#> 6       1  0.2466    150        6

Multiple groups, one aesthetic

In many situations, you want to separate your data into groups, but render them in the same way. In other words, you want to be able to distinguish individual subjects, but not identify them. This is common in longitudinal studies with many subjects, where the plots are often descriptively called spaghetti plots. For example, the following plot shows the growth trajectory for each boy (each Subject):

ggplot(Oxboys, aes(age, height, group = Subject)) + 
  geom_line()

If you incorrectly specify the grouping variable, you’ll get a characteristic sawtooth appearance:

ggplot(Oxboys, aes(age, height)) + 
  geom_line()

If a group isn’t defined by a single variable, but instead by a combination of multiple variables, use interaction() to combine them, e.g. aes(group = interaction(school_id, student_id)).

Different groups on different layers

Sometimes we want to plot summaries based on different levels of aggregation. Different layers might have different group aesthetics, so that some display individuals while others display summaries.

Building on the previous example, suppose we want to add a single smooth line to the plot just created, based on the ages and heights of all the boys. If we use the same grouping in both layers, we get one smooth per boy:

ggplot(Oxboys, aes(age, height, group = Subject)) + 
  geom_line() + 
  geom_smooth(method = "lm", se = FALSE)

This is not what we wanted; we have inadvertently added a smoothed line for each boy. Instead of setting the grouping aesthetic in ggplot(), where it will apply to all layers, we instead set it in geom_line() so it affects the lines, but not the smooths:

ggplot(Oxboys, aes(age, height)) + 
  geom_line(aes(group = Subject)) + 
  geom_smooth(method = "lm", size = 2, se = FALSE)

Overriding the default grouping

Some plots have a discrete x scale, but you still want to draw lines connect across groups. This is the strategy used in interaction plots, profile plots, and parallel coordinate plots, among others. For example, imagine we’ve drawn boxplots of height at each measurement occasion:

ggplot(Oxboys, aes(Occasion, height)) + 
  geom_boxplot()

There is no need to specify the group aesthetic here; the default grouping works because occasion is a factor variable. Now we want to overlay lines that connect each individual boy. Simply adding geom_line() does not work: the lines are drawn within each occassion, not across each subject:

ggplot(Oxboys, aes(Occasion, height)) + 
  geom_boxplot() +
  geom_line(colour = "#3366FF", alpha = 0.5)

The default grouping uses Occasion because it’s a factor. To get the plot we want, we need to override the grouping:

ggplot(Oxboys, aes(Occasion, height)) + 
  geom_boxplot() +
  geom_line(aes(group = Subject), colour = "#3366FF", alpha = 0.5)

Matching aesthetics to graphic objects

A final important issue with collective geoms is how the aesthetics of the individual observations are mapped to the aesthetics of the complete entity. This isn’t a problem for individual geoms, because each observation is represented by a single graphical element. What happens when differing aesthetics are mapped to one geometric element?

Lines and paths operate on an off-by-one principle: there is one more observation than line segment, and so the aesthetic for the first observation is used for the first segment, the second observation for the second segment and so on. This means that the aesthetic for the last observation is not used:

df <- data.frame(x = 1:3, y = 1:3, colour = c(1,3,5))

ggplot(df, aes(x, y, colour = factor(colour))) + 
  geom_line(aes(group = 1), size = 2) +
  geom_point(size = 5)

ggplot(df, aes(x, y, colour = colour)) + 
  geom_line(aes(group = 1), size = 2) +
  geom_point(size = 5)

You could imagine a more complicated system where segments smoothly blend from one aesthetic to another. This would work for continuous variables like size or colour, but not for line type, and is not used in ggplot2. If this is the behaviour you want, you can perform the linear interpolation yourself:

xgrid <- with(df, seq(min(x), max(x), length = 50))
interp <- data.frame(
  x = xgrid,
  y = approx(df$x, df$y, xout = xgrid)$y,
  colour = approx(df$x, df$colour, xout = xgrid)$y  
)
ggplot(interp, aes(x, y, colour = colour)) + 
  geom_line(size = 2) +
  geom_point(data = df, size = 5)

An additional limitation for paths and lines is that that line type must be constant over each individual line, in R there is no way to draw a line which has varying line type.

For all other collective geoms, like polygons, the aesthetics from the individual components are only used if they are all the same, otherwise the default value is used. It’s particularly clear why this makes sense for fill: how would you colour a polygon that had a different fill colour for each point on its border?

These issues are most relevant when mapping aesthetics to continuous variables, because, as described above, when you introduce a mapping to a discrete variable, it will by default split apart collective geoms into smaller pieces. This works particularly well for bar and area plots, because stacking the individual pieces produces the same shape as the original ungrouped data:

ggplot(mpg, aes(class)) + 
  geom_bar()
ggplot(mpg, aes(class, fill = drv)) + 
  geom_bar()

If you wanted to display a similar bar chart with the colour of each bar mapped to a continuous variable, you also need to override the default grouping:

ggplot(mpg, aes(class, fill = hwy)) + 
  geom_bar()
ggplot(mpg, aes(class, fill = hwy, group = hwy)) + 
  geom_bar()

(The bars will be stacked in the order in which they appear in the data frame - you may need to sort your data before hand to get the display that you want.)

Exercises

  1. When illustrating the difference between mapping continuous and discrete colours to a line, the discrete example needed aes(group = 1). Why? What happens if that is ommitted?

  2. Draw a boxplot of hwy for each value of cyl, without turning cyl into a factor. What extra aesthetic do you need to set?

Displaying distributions

There are a number of geoms that can be used to display distributions, depending on the dimensionality of the distribution, whether it is continuous or discrete, and whether you are interested in the conditional or joint distribution.

For 1d continuous distributions the most important geom is the histogram. The code below shows the distribution depth with two histograms:

ggplot(diamonds, aes(depth)) + 
  geom_histogram()
ggplot(diamonds, aes(depth)) + 
  geom_histogram(binwidth = 0.1) + 
  xlim(55, 70)

It is important to experiment with bin placement to find a revealing view. You can change the binwidth, or specify the exact location of the breaks. Never rely on the default parameters to get a revealing view of the distribution. Zooming in on the x axis, xlim(55, 70), and selecting a smaller bin width, binwidth = 0.1, reveals far more detail. When publishing figures, don’t forget to include information about important parameters (like bin width) in the caption. If you want to compare the distribution between groups, you have a few options:

  • Show small multiples of the histogram, facet_wrap(~ var).
  • Use a frequency polygon, geom_freqpoly().
  • Use “conditional density plot”, geom_histogram(position = "fill").

The frequency polygon and conditional density plots are show below. The conditional density plot uses position_fill() to stack each bin, scaling it to the same height. This plot is perceptually challenging because you need to compare bar heights, not positions, but you can see the strongest patterns.

ggplot(diamonds, aes(depth)) + 
  geom_freqpoly(aes(colour = cut), binwidth = 0.1) +
  xlim(58, 68)
#> Warning: Removed 2 rows containing missing values (geom_path).
#> Warning: Removed 2 rows containing missing values (geom_path).
#> Warning: Removed 2 rows containing missing values (geom_path).
#> Warning: Removed 2 rows containing missing values (geom_path).
#> Warning: Removed 2 rows containing missing values (geom_path).
ggplot(diamonds, aes(depth)) + 
  geom_histogram(aes(fill = cut), binwidth = 0.1, position = "fill") +
  xlim(58, 68)

Both the histogram and frequency polygon geom use stat_bin(). This statistic produces two output variables count and density. By default, count is mapped to y-position, because it’s most interpretable. The density is the count divided by the total count multplied by the bin width, and is useful when you want to compare the shape of the distributions, not the overall size, or when you have variable binwidths (set with the breaks argument).

An alternative to bin based display of distribution is a density-estimate based display. geom_density() is effectively a smoothed version of the frequency polygon build using kernel density estimates. It has desirable theoretical properties, but is more difficult to relate back to the data. Use a density plot when you know that the underlying density is smooth, continuous and unbounded. You can use the adjust parameter to make the density more or less smooth. .

ggplot(diamonds, aes(depth)) +
  geom_density() + 
  xlim(55, 70)
#> Warning: Removed 45 rows containing non-finite values (stat_density).
ggplot(diamonds, aes(depth, fill = cut)) +
  geom_density(alpha = 0.2) + 
  xlim(55, 70)
#> Warning: Removed 43 rows containing non-finite values (stat_density).
#> Warning: Removed 1 rows containing non-finite values (stat_density).
#> Warning: Removed 1 rows containing non-finite values (stat_density).

Note that the area of each density estimate is standardised to one so that you lose information about the relative size of each group.

The histogram, frequency polygon and density display a detailed view of the distribution. However, sometimes you want to compare many distributions, and a summary that sacrifies quality for quantity can be useful. There are three options:

  • geom_boxplot(): the box-and-whisker plot shows five summary statisics along with individual “outliers”. It displays far less information than a histogram, but also takes up much less space.

    You can use boxplot with both categorical and continuous x. For continuous x, you’ll also need to set the group aesthetic to define how the x variable is broken up into bins. A useful helper function is plyr::round_any(),

    ggplot(diamonds, aes(clarity, depth)) + 
      geom_boxplot()
    ggplot(diamonds, aes(carat, depth)) + 
      geom_boxplot(aes(group = plyr::round_any(carat, 0.1))) + 
      xlim(NA, 2.05)
    #> Warning: Removed 997 rows containing non-finite values (stat_boxplot).

  • geom_violin(): the violin plot is a compact version of the density plot. The underlying computation is the same, but the results are displayed in a similar fashion to the boxplot:

    ggplot(diamonds, aes(clarity, depth)) + 
      geom_violin()
    ggplot(diamonds, aes(carat, depth)) + 
      geom_violin(aes(group = plyr::round_any(carat, 0.1))) + 
      xlim(NA, 2.05)
    #> Warning: Removed 997 rows containing non-finite values (stat_ydensity).

A final option for displaying the distribution of smaller datasets is the dotplot. It draws one point for each observation, carefully adjusted in space to show avoid overlaps and show the distribution.

ggplot(mpg, aes(displ)) + 
  geom_dotplot()
ggplot(mpg, aes(drv, displ, fill = drv)) + 
  geom_dotplot(binaxis = "y", stackdir = "center")

Dealing with overplotting

The scatterplot is a very important tool for assessing the relationship between two continuous variables. However, when the data is large, often points will be plotted on top of each other, obscuring the true relationship. In extreme cases, you will only be able to see the extent of the data, and any conclusions drawn from the graphic will be suspect. This problem is called overplotting and there are a number of ways to deal with it.

The first set of techniques involves tweaking aesthetic properties. These tend to be most effective for smaller datasets.

  • Very small amounts of overplotting can sometimes be alleviated by making the points smaller, or using hollow glyphs. The data is 2000 points sampled from two independent normal distributions.

    df <- data.frame(x = rnorm(2000), y = rnorm(2000))
    norm <- ggplot(df, aes(x, y))
    norm + geom_point()
    norm + geom_point(shape = 1)
    norm + geom_point(shape = ".") # Pixel sized

  • For larger datasets with more overplotting, you can use alpha blending (transparency) to make the points transparent. If you specify alpha as a ratio, the denominator gives the number of points that must be overplotted to give a solid colour. In R, the lowest amount of transparency you can use is 1/256, so it will not be effective for heavy overplotting.

    norm + geom_point(alpha = 1 / 3)
    norm + geom_point(alpha = 1 / 5)
    norm + geom_point(alpha = 1 / 10)

  • If there is some discreteness in the data, you can randomly jitter the points to alleviate some overlaps. This can be particularly useful in conjunction with transparency. By default, the amount of jitter added is 40% of the resolution of the data, which leaves a small gap between adjacent regions.

    In the following code, table is recorded to the nearest integer, so we set a jitter width of half of that.

    td <- ggplot(diamonds, aes(table, depth)) + 
      xlim(50, 70) + 
      ylim(55, 70)
    td + geom_point()
    #> Warning: Removed 55 rows containing missing values (geom_point).
    td + geom_jitter()
    #> Warning: Removed 60 rows containing missing values (geom_point).
    td + geom_jitter(width = 0.5)
    #> Warning: Removed 62 rows containing missing values (geom_point).
    td + geom_jitter(width = 0.5, alpha = 1 / 10)
    #> Warning: Removed 65 rows containing missing values (geom_point).
    td + geom_jitter(width = 0.5, alpha = 1 / 50)
    #> Warning: Removed 64 rows containing missing values (geom_point).
    td + geom_jitter(width = 0.5, alpha = 1 / 200)
    #> Warning: Removed 59 rows containing missing values (geom_point).

  • An alternative approach to dealing with discrete data is to count the number of unique points at each location. We can use either stat_sum() or geom_count() but if you use the stat, you can use an alternative geom, such as tiles.

    td + geom_count(alpha = 1/3)
    #> Warning: Removed 26 rows containing missing values (geom_point).
    td + geom_count(aes(x = as.integer(table)), alpha = 1/3)
    #> Warning: Removed 26 rows containing missing values (geom_point).
    td + stat_sum(
      aes(x = as.integer(table), fill = ..n.., size = NULL), 
      geom = "tile")

Alternatively, we can think of overplotting as a 2d density estimation problem, which gives rise to two more approaches:

  • Bin the points and count the number in each bin, then visualise that count (the 2d generalisation of the histogram). Breaking the plot into many small squares can produce distracting visual artefacts. (D. B. Carr et al. 1987) suggests using hexagons instead, and this is implemented with geom_hex(), using the hexbin package (D. Carr, Lewin-Koh, and Maechler 2008).

    The code below compares square and hexagonal bins, using parameters bins and binwidth to control the number and size of the bins.

    `r columns(3, 1)

    diamonds13 <- subset(diamonds, carat >= 1 & carat <= 3)
    d <- ggplot(diamonds13, aes(carat, price)) + 
      theme(legend.position = "none")
    d + stat_bin2d()
    d + stat_bin2d(bins = 10)
    d + stat_bin2d(binwidth = c(0.02, 200))
    d + stat_binhex()
    d + stat_binhex(bins = 10)
    d + stat_binhex(binwidth = c(0.02, 200))

  • Estimate the 2d density with stat_density2d(), and overlay contours from this distribution on the scatterplot, or display the density by itself as coloured tiles, or points with size proportional to density. The code below shows a few of these options.

    d + 
      geom_point() + 
      geom_density2d()
    d + 
      stat_density2d(geom = "point", aes(size = ..density..), contour = FALSE) +
      scale_size(range = c(0.2, 1.5))
    d + 
      stat_density2d(geom = "tile", aes(fill = ..density..), contour = FALSE) 

  • If you are interested in the conditional distribution of y given x, then the techniques of displaying distributions will also be useful.

Another approach to dealing with overplotting is to add data summaries to help guide the eye to the true shape of the pattern within the data. For example, you could add a smooth line showing the centre of the data with geom_smooth(). Statistical summaries has more ideas.

Surface plots

ggplot2 does not support true 3d surfaces (and likely never will). However, it does support many common tools for representing 3d surfaces in 2d: contours, coloured tiles and bubble plots. These were used to illustrated the 2d density surfaces in the previous section. For interactive 3d plots, including true 3d surfaces, see RGL, http://rgl.neoscientists.org/about.shtml.

Drawing maps

ggplot2 provides some tools to make it easy to combine draw maps. Often the biggest challenge is finding a good source of map data. There are two particularly useful packages:

World data:

US data:

There are two basic reasons you might want to use map data: to add reference outlines to a plot of spatial data, or to construct a choropleth map by filling regions with colour.

Adding map border is performed by the borders() function. The first two arguments select the map and region within the map to display. The remaining arguments control the appearance of the borders: their colour and size. If you’d prefer filled polygons instead of just borders, you can set the fill colour. The following code uses borders() to display the spatial data shown in Figure .

    
library("maps")
data(us.cities)
big_cities <- subset(us.cities, pop > 500000)
qplot(long, lat, data = big_cities) + borders("state", size = 0.5)
dr
tx_cities <- subset(us.cities, country.etc == "TX")
ggplot(tx_cities, aes(long, lat)) +
  borders("county", "texas", colour = "grey70") +
  geom_point(alpha = 0.5)

Choropleth maps are a little trickier and a lot less automated because it is challenging to match the identifiers in your data to the identifiers in the map data. The following example shows how to use map_data() to convert a map into a data frame, which can then be merge()d with your data to produce a choropleth map. The results are shown in Figure . The details for your data will probably be different, but the key is to have a column in your data and a column in the map data that can be matched.

library("maps")
states <- map_data("state")
arrests <- USArrests
names(arrests) <- tolower(names(arrests))
arrests$region <- tolower(rownames(USArrests))

choro <- merge(states, arrests, by = "region")
# Reorder the rows because order matters when drawing polygons
# and merge destroys the original ordering
choro <- choro[order(choro$order), ]
qplot(long, lat, data = choro, group = group, 
  fill = assault, geom = "polygon")
qplot(long, lat, data = choro, group = group, 
  fill = assault / murder, geom = "polygon")

The map_data() function is also useful if you’d like to process the map data in some way. In the following example we compute the (approximate) centre of each county in Iowa and then use those centres to label the map.

ia <- map_data("county", "iowa")
mid_range <- function(x) mean(range(x, na.rm = TRUE))
centres <- ia %>% tbl_df %>% group_by(subregion) %>%
  summarise(lat = mid_range(lat), long = mid_range(long))
ggplot(ia, aes(long, lat)) + 
  geom_polygon(aes(group = group), 
    fill = NA, colour = "grey60") +
  geom_text(aes(label = subregion), data = centres, 
    size = 2, angle = 45)

Revealing uncertainty

If you have information about the uncertainty present in your data, whether it be from a model or from distributional assumptions, it’s a good idea to display it. There are four basic families of geoms that can be used for this job, depending on whether the x values are discrete or continuous, and whether or not you want to display the middle of the interval, or just the extent:

  • Discrete x, range: geom_errorbar(), geom_linerange()
  • Discrete x, range & center: geom_crossbar(), geom_crossbar()
  • Continuous x, range: geom_ribbon()
  • Continuous x, range & center: geom_smooth(stat = "identity")

These geoms assume that you are interested in the distribution of y conditional on x and use the aesthetics ymin and ymax to determine the range of the y values. If you want the opposite, see cartesian coordinate systems.

y <- c(18, 11, 16)
df <- data.frame(x = 1:3, y = y, se = c(1.2, 0.5, 1.0))

base <- ggplot(df, aes(x, y, ymin = y - se, ymax = y + se))
base + geom_crossbar()
base + geom_pointrange()
base + geom_smooth(stat = "identity")
base + geom_errorbar()
base + geom_linerange()
base + geom_ribbon()

Because there are so many different ways to calculate standard errors, the calculation is up to you. For very simple cases, ggplot2 provides some tools in the form of summary functions described in below, otherwise you will have to do it yourself. The modelling chapter contains more advice on extracting confidence intervals from more sophisticated models.

Statistical summaries

It’s often useful to be able to summarise the y values for each unique x value. In ggplot2, this role is played by stat_summary(), which provides a flexible way of summarising the conditional distribution of y with the aesthetics ymin, y and ymax.

The following example uses movie ratings data in the built in movies dataset, since it includes a useful continuous variable: year.

m <- ggplot(movies, aes(year, rating))
m + stat_summary(fun.y = "median", geom = "line")
m + stat_summary(fun.data = "median_hilow", geom = "ribbon")
m + stat_summary(fun.data = "median_hilow", geom = "smooth")
m2 <- ggplot(movies, aes(round(rating), log10(votes)))
m2 + stat_summary(fun.y = "median", geom = "point")
m2 + stat_summary(fun.data = "median_hilow", geom = "errorbar")
m2 + stat_summary(fun.data = "median_hilow", geom = "crossbar")

When using stat_summary() you can either supply these the summary functions individually or altogether. These alternatives are described below.

Individual summary functions

The arguments fun.y, fun.ymin and fun.ymax accept simple numeric summary functions. You can use any summary function that takes a vector of numbers and returns a single numeric value: mean(), median(), min(), max().

midm <- function(x) mean(x, trim = 0.5)
m2 + 
  stat_summary(aes(colour = "trimmed"), fun.y = midm, geom = "point") +
  stat_summary(aes(colour = "raw"), fun.y = mean, geom = "point") + 
  scale_colour_hue("Mean")

Single summary function

fun.data can be used with more complex summary functions such as one of the summary functions from the Hmisc package (Harrell 2008):

Function Hmisc original Middle Range
meanclnormal smean.cl.boot Mean Standard error from normal approximation
meanclboot smean.cl.boot Mean Standard error from bootstrap
meansdl smean.sdl Mean Multiple of standard deviation
medianhilow smedian.hilow Median Outer quantiles with equal tail areas

You can also write your own summary function. This summary function take a vector as input and return a named vector (“y”, “ymin”, “ymax”) as output:

iqr <- function(x, ...) {
  qs <- quantile(as.numeric(x), c(0.25, 0.75), na.rm = T)
  setNames(qs, c("ymin", "ymax"))
}
m + stat_summary(fun.data = "iqr", geom = "ribbon")

Weighted data

When you have aggregated data where each row in the dataset represents multiple observations, you need some way to take into account the weighting variable. We will use some data collected on Midwest states in the 2000 US census in the build-in midwest data frame. The data consists mainly of percentages (e.g., percent white, percent below poverty line, percent with college degree) and some information for each county (area, total population, population density).

There are a few different things we might want to weight by:

  • Nothing, to look at numbers of counties.
  • Total population, to work with absolute numbers.
  • Area, to investigate geographic effects. (This isn’t useful for midwest, but would be if we had variables like percentage of farmland.)

The choice of a weighting variable profoundly affects what we are looking at in the plot and the conclusions that we will draw. There are two aesthetic attributes that can be used to adjust for weights. Firstly, for simple geoms like lines and points, use the size aesthetic:

# Unweighted
ggplot(midwest, aes(percwhite, percbelowpoverty)) + 
  geom_point()

# Weight by population
ggplot(midwest, aes(percwhite, percbelowpoverty)) + 
  geom_point(aes(size = poptotal / 1e6)) + 
  scale_size_area("Population\n(millions)", breaks = c(0.5, 1, 2, 4))

For more complicated grobs which involve some statistical transformation, we specify weights with the weight aesthetic. These weights will be passed on to the statistical summary function. Weights are supported for every case where it makes sense: smoothers, quantile regressions, boxplots, histograms, and density plots. You can’t see this weighting variable directly, and it doesn’t produce a legend, but it will change the results of the statistical summary. The following code shows how weighting by population density affects the relationship between percent white and percent below the poverty line.

``

# Unweighted
ggplot(midwest, aes(percwhite, percbelowpoverty)) + 
  geom_point() + 
  geom_smooth(method = lm, size = 1)

# Weighted by population
ggplot(midwest, aes(percwhite, percbelowpoverty)) + 
  geom_point(aes(size = poptotal / 1e6)) + 
  geom_smooth(aes(weight = poptotal), method = lm, size = 1) +
  scale_size_area(guide = "none")

When we weight a histogram or density plot by total population, we change from looking at the distribution of the number of counties, to the distribution of the number of people. The following code shows the difference this makes for a histogram of the percentage below the poverty line:

ggplot(midwest, aes(percbelowpoverty)) +
  geom_histogram(binwidth = 1) + 
  ylab("Counties")

ggplot(midwest, aes(percbelowpoverty)) +
  geom_histogram(aes(weight = poptotal), binwidth = 1) +
  ylab("Population (1000s)")

Add-on packages

If the built-in tools in ggplot2 don’t do what you need, you might want to use a special purpose tool built into one of the packages built on top of ggplot2:

  • cowplot [cran]: publication ready theme, better tools for arranging multiple ggplots on one page (https://github.com/wilkelab/cowplot)

  • GGally [cran]

  • ggbio [bioconductor]

  • ggdendro [CRAN]:

  • ggenealogy: genealogies

  • ggmap [cran]:

  • ggmcmc [cran]:

  • ggparallel [cran]: parallel coordinates plots, hammock plots & common angle plots.

  • ggsubplot [cran]: embed subplots within larger graphics

  • ggtern [cran]: ternarny plots, http://www.ggtern.com

  • ggthemes [cran]:

  • granovaGG [cran]: visualaisation of ANOVA results.

  • plotluck [github]: ggplot2 version of “I’m feeling lucky”. Automatically creates plots for one, two or three variables ( https://github.com/stefan-schroedl/plotluck).

Carr, D. B., R. J. Littlefield, W. L. Nicholson, and J. S. Littlefield. 1987. “Scatterplot Matrix Techniques for Large N.” Journal of the American Statistical Association 82 (398): 424–36.

Carr, Dan, Nicholas Lewin-Koh, and Martin Maechler. 2008. Hexbin: Hexagonal Binning Routines.

Harrell, Jr, Frank E. 2008. Hmisc: Harrell Miscellaneous. http://biostat.mc.vanderbilt.edu/s/Hmisc.