In the video you saw 9 visible aesthetics. Let’s apply them to a categorical variable - the cylinders in mtcars, cyl.
(You’ll consider line type when you encounter line plots in the next chapter).
These are the aesthetics you can consider within aes() in this chapter: x, y, color, fill, size, alpha, labels and shape.
In the following exercise you can assume that the cyl column is categorical. It has already been transformed into a factor for you.
mtcars$cyl <- as.factor(mtcars$cyl)
# 1 - Map mpg to x and cyl to y
ggplot(mtcars, aes(x = mpg, y = cyl)) +
geom_point()
# 2 - Reverse: Map cyl to x and mpg to y
ggplot(mtcars, aes(x = cyl, y = mpg)) +
geom_point()
# 3 - Map wt to x, mpg to y and cyl to col
ggplot(mtcars, aes(x = wt, y = mpg, col = as.factor(cyl))) +
geom_point()
# Change shape and size of the points in the above plot
ggplot(mtcars, aes(x = wt, y = mpg, col = as.factor(cyl))) +
geom_point(shape = 1, size = 4)
The color aesthetic typically changes the outside outline of an object and the fill aesthetic is typically the inside shading. However, as you saw in the last exercise, geom_point() is an exception. Here you use color, instead of fill for the inside of the point. But it’s a bit subtler than that.
Which shape to use? The default geom_point() uses shape = 19 (a solid circle with an outline the same colour as the inside). Good alternatives are shape = 1 (hollow) and shape = 16 (solid, no outline). These all use the col aesthetic (don’t forget to set alpha for solid points).
A really nice alternative is shape = 21 which allows you to use both fill for the inside and col for the outline! This is a great little trick for when you want to map two aesthetics to a dot.
What happens when you use the wrong aesthetic mapping? This is a very common mistake! The code from the previous exercise is in the editor. Using this as your starting point complete the instructions.
# Given from the previous exercise
ggplot(mtcars, aes(x = wt, y = mpg, col = as.factor(cyl))) +
geom_point(shape = 1, size = 4)
# 1 - Map cyl to fill
ggplot(mtcars, aes(x = wt, y = mpg, fill = as.factor(cyl))) +
geom_point(shape = 1, size = 4)
# 2 - Change shape and alpha of the points in the above plot
ggplot(mtcars, aes(x = wt, y = mpg, fill = as.factor(cyl))) +
geom_point(shape = 21, size = 4, alpha = 0.6)
# 3 - Map am to col in the above plot
ggplot(mtcars, aes(x = wt, y = mpg, fill = as.factor(cyl), col = as.factor(am))) +
geom_point(shape = 21, size = 4, alpha = 0.6)
Now that you’ve got some practice with incrementally building up plots, you can try to do it from scratch! The mtcars dataset is pre-loaded in the workspace.
# Map cyl to size
ggplot(mtcars, aes(x = wt, y = mpg, size = as.factor(cyl))) +
geom_point()
# Map cyl to alpha
ggplot(mtcars, aes(x = wt, y = mpg, alpha = as.factor(cyl))) +
geom_point()
# Map cyl to shape
ggplot(mtcars, aes(x = wt, y = mpg, shape = as.factor(cyl))) +
geom_point()
# Map cyl to labels
ggplot(mtcars, aes(x = wt, y = mpg, label = as.factor(cyl))) +
geom_text()
In the video you saw that you can use all the aesthetics as attributes. Let’s see how this works with the aesthetics you used in the previous exercises: x, y, color, fill, size, alpha, label and shape.
This time you’ll use these arguments to set attributes of the plot, not aesthetics. However, there are some pitfalls you’ll have to watch out for: these attributes can overwrite the aesthetics of your plot!
A word about shapes: In the exercise “All about aesthetics, part 2”, you saw that shape = 21 results in a point that has a fill and an outline. Shapes in R can have a value from 1-25. Shapes 1-20 can only accept a color aesthetic, but shapes 21-25 have both a color and a fill aesthetic. See the pch argument in par() for further discussion.
A word about hexadecimal colours: Hexadecimal, literally “related to 16”, is a base-16 alphanumeric counting system. Individual values come from the ranges 0-9 and A-F. This means there are 256 possible two-digit values (i.e. 00 - FF). Hexadecimal colours use this system to specify a six-digit code for Red, Green and Blue values (“#RRGGBB”) of a colour (i.e. Pure blue: “#0000FF”, black: “#000000”, white: “#FFFFFF”). R can accept hex codes as valid colours.
# Define a hexadecimal color
my_color <- "#4ABEFF"
# Draw a scatter plot with color *aesthetic*
ggplot(mtcars, aes(x = wt, y = mpg, color = as.factor(cyl))) +
geom_point()
# Same, but set color *attribute* in geom layer
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(color = my_color)
# Set the fill aesthetic; color, size and shape attributes
ggplot(mtcars, aes(x = wt, y = mpg, fill = as.factor(cyl))) +
geom_point(size = 10, shape = 23, color = my_color)
In the videos you saw that you can use all the aesthetics as attributes. Let’s see how this works with the aesthetics you used in the previous exercises: x, y, color, fill, size, alpha, label and shape.
In this exercise you will set all kinds of attributes of the points!
You will continue to work with mtcars.
# Expand to draw points with alpha 0.5
ggplot(mtcars, aes(x = wt, y = mpg, fill = as.factor(cyl))) +
geom_point(alpha = 0.5)
# Expand to draw points with shape 24 and color yellow
ggplot(mtcars, aes(x = wt, y = mpg, fill = as.factor(cyl))) +
geom_point(shape = 24, color = 'yellow')
# Expand to draw text with label rownames(mtcars) and color red
ggplot(mtcars, aes(x = wt, y = mpg, fill = as.factor(cyl))) +
geom_text(label = rownames(mtcars), color = 'red')
Note: In this chapter you saw aesthetics and attributes. Variables in a data frame are mapped to aesthetics in aes(). (e.g. aes(col = cyl)) within ggplot(). Visual elements are set by attributes in specific geom layers (geom_point(col = “red”)). Don’t confuse these two things - here you’re focusing on aesthetic mappings.
Draw a scatter plot of mtcars with mpg on the x-axis, qsec on the y-axis and factor(cyl) as colors. Copy the previous plot and expand to include factor(am) as the shape of the points. Copy the previous plot and expand to include the ratio of horsepower to weight (i.e. (hp/wt)) as the size of the points.
# Map mpg onto x, qsec onto y and factor(cyl) onto col (3 aesthetics):
ggplot(mtcars, aes(x = mpg, y = qsec, col = factor(cyl))) +
geom_point()
# Add mapping: factor(am) onto shape (now 4 aesthetics):
ggplot(mtcars, aes(x = mpg, y = qsec, col = factor(cyl), shape = factor(am))) +
geom_point()
# Add mapping: (hp/wt) onto size (now 5 aesthetics):
ggplot(mtcars, aes(x = mpg, y = qsec, col = factor(cyl), shape = factor(am),
size = (hp/wt))) +
geom_point()
Position You saw how jittering worked in the video, but bar plots suffer from their own issues of overplotting, as you’ll see here. Use the “stack”, “fill” and “dodge” positions to reproduce the plot in the viewer.
The ggplot2 base layers (data and aesthetics) have already been coded; they’re stored in a variable cyl.am. It looks like this:
cyl.am <- ggplot(mtcars, aes(x = factor(cyl), fill = factor(am)))
# The base layer, cyl.am, is available for you
# Add geom (position = "stack" by default)
cyl.am +
geom_bar()
# Fill - show proportion
cyl.am +
geom_bar(position = "fill")
# Dodging - principles of similarity and proximity
cyl.am +
geom_bar(position = "dodge")
# Clean up the axes with scale_ functions
val = c("#E41A1C", "#377EB8")
lab = c("Manual", "Automatic")
cyl.am +
geom_bar(position = "dodge") +
scale_x_discrete("Cylinders") +
scale_y_continuous("Number") +
scale_fill_manual("Transmission",
values = val,
labels = lab)
In the last chapter you saw that all the visible aesthetics can serve as attributes and aesthetics, but I very conveniently left out x and y. That’s because although you can make univariate plots (such as histograms, which you’ll get to in the next chapter), a y-axis will always be provided, even if you didn’t ask for it.
In the base package you can make univariate plots with stripchart() (shown in the viewer) directly and it will take care of a fake y axis for us. Since this is univariate data, there is no real y axis.
You can get the same thing in ggplot2, but it’s a bit more cumbersome. The only reason you’d really want to do this is if you were making many plots and you wanted them to be in the same style, or you wanted to take advantage of an aesthetic mapping (e.g. colour).
# 1 - Create jittered plot of mtcars, mpg onto x, 0 onto y
ggplot(mtcars, aes(x = mpg, y = 0)) +
geom_jitter()
# 2 - Add function to change y axis limits
ggplot(mtcars, aes(x = mpg, y = 0)) +
geom_jitter() +
scale_y_continuous(limits = c(-2,2))
In the previous section you saw that there are lots of ways to use aesthetics. Perhaps too many, because although they are possible, they are not all recommended. Let’s take a look at what works and what doesn’t.
So far you’ve focused on scatter plots since they are intuitive, easily understood and very common. A major consideration in any scatter plot is dealing with overplotting. You’ll encounter this topic again in the geometries layer, but you can already make some adjustments here.
You’ll have to deal with overplotting when you have:
Large datasets, Imprecise data and so points are not clearly separated on your plot (you saw this in the video with the iris dataset), Interval data (i.e. data appears at fixed values), or Aligned data values on a single axis. One very common technique that I’d recommend to always use when you have solid shapes it to use alpha blending (i.e. adding transparency). An alternative is to use hollow shapes. These are adjustments to make before even worrying about positioning. This addresses the first point as above, which you’ll see again in the next exercise.
# Basic scatter plot of wt on x-axis and mpg on y-axis; map cyl to col
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point(size = 4)
# Hollow circles - an improvement
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point(size = 4, shape = 1)
# Add transparency - very nice
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point(size = 4, alpha = 0.6)
In a previous exercise we defined four situations in which you’d have to adjust for overplotting. You’ll consider the last two here with the diamonds dataset:
# Scatter plot: carat (x), price (y), clarity (color)
ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
geom_point()
# Adjust for overplotting
ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
geom_point(alpha = 0.5)
# Scatter plot: clarity (x), carat (y), price (color)
ggplot(diamonds, aes(x = clarity, y = carat, color = price)) +
geom_point(alpha = 0.5)
# Dot plot with jittering
ggplot(diamonds, aes(x = clarity, y = carat, color = price)) +
geom_point(alpha = 0.5, position = "jitter")
You already saw a few examples using geom_point() where the result was not a scatter plot. For example, in the plot shown in the viewer a continuous variable, wt, is mapped to the y aesthetic, and a categorical variable, cyl, is mapped to the x aesthetic. This also leads to over-plotting, since the points are arranged on a single x position. You previously dealt with overplotting by setting the position = jitter inside geom_point(). Let’s look at some other solutions here.
# Shown in the viewer:
ggplot(mtcars, aes(x = cyl, y = wt)) +
geom_point()
# Solutions:
# 1 - With geom_jitter()
ggplot(mtcars, aes(x = cyl, y = wt)) +
geom_jitter()
# 2 - Set width in geom_jitter()
ggplot(mtcars, aes(x = cyl, y = wt)) +
geom_jitter(width = 0.1)
# 3 - Set position = position_jitter() in geom_point() ()
ggplot(mtcars, aes(x = cyl, y = wt)) +
geom_point(position = position_jitter(0.1))
In the chapter on aesthetics you saw different ways in which you will have to compensate for overplotting. In the video you saw a dataset that suffered from overplotting because of the precision of the dataset.
Another example you saw is when you have integer data. This can be continuous data measured on an integer (i.e. 1 ,2, 3 …), as opposed to numeric (i.e. 1.1, 1.4, 1.5, …), scale, or two categorical (e.g. factor) variables, which are just type integer under-the-hood.
In such a case you’ll have a small, defined number of intersections between the two variables.
You will be using the Vocab dataset. The Vocab dataset contains information about the years of education and integer score on a vocabulary test for over 21,000 individuals based on US General Social Surveys from 1972-2004.
# Examine the structure of Vocab
str(Vocab)
'data.frame': 30351 obs. of 4 variables:
$ year : num 1974 1974 1974 1974 1974 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 2 2 2 1 1 ...
$ education : num 14 16 10 10 12 16 17 10 12 11 ...
$ vocabulary: num 9 9 9 5 8 8 9 5 3 5 ...
- attr(*, "na.action")= 'omit' Named int 1 2 3 4 5 6 7 8 9 10 ...
..- attr(*, "names")= chr "19720001" "19720002" "19720003" "19720004" ...
# Basic scatter plot of vocabulary (y) against education (x). Use geom_point()
ggplot(Vocab, aes(x = education, y = vocabulary)) +
geom_point()
# Use geom_jitter() instead of geom_point()
ggplot(Vocab, aes(x = education, y = vocabulary)) +
geom_jitter()
# Using the above plotting command, set alpha to a very low 0.2
ggplot(Vocab, aes(x = education, y = vocabulary)) +
geom_jitter(alpha = 0.2)
# Using the above plotting command, set the shape to 1
ggplot(Vocab, aes(x = education, y = vocabulary)) +
geom_jitter(shape = 1)
Histograms are one of the most common and intuitive ways of showing distributions. In this exercise you’ll use the mtcars data frame to explore typical variations of simple histograms. But first, some background:
The x axis/aesthetic: The documentation for geom_histogram() states the argument stat = “bin” as a default. Recall that histograms cut up a continuous variable into discrete bins - that’s what the stat “bin” is doing. You always get 30 evenly-sized bins by default, which is specified with the default argument binwidth = range/30. This is a pretty good starting point if you don’t know anything about the variable being ploted and want to start exploring.
The y axis/aesthetic: geom_histogram() only requires one aesthetic: x. But there is clearly a y axis on your plot, so where does it come from? Actually, there is a variable mapped to the y aesthetic, it’s called ..count… When geom_histogram() executed the binning statistic (see above), it not only cut up the data into discrete bins, but it also counted how many values are in each bin. So there is an internal data frame where this information is stored. The .. calls the variable count from this internal data frame. This is what appears on the y aesthetic. But it gets better! The density has also been calculated. This is the proportional frequency of this bin in relation to the whole data set. You use ..density.. to access this information.
# 1 - Make a univariate histogram
ggplot(mtcars, aes(mpg)) +
geom_histogram()
# 2 - Plot 1, plus set binwidth to 1 in the geom layer
ggplot(mtcars, aes(mpg)) +
geom_histogram(binwidth = 1)
# 3 - Plot 2, plus MAP ..density.. to the y aesthetic (i.e. in a second aes() function)
ggplot(mtcars, aes(mpg)) +
geom_histogram(aes(y = ..density..), binwidth = 1)
# 4 - plot 3, plus SET the fill attribute to "#377EB8"
ggplot(mtcars, aes(mpg)) +
geom_histogram(aes(y = ..density..), binwidth = 1, fill = "#377EB8")
In the previous chapter you saw that there are lots of ways to position scatter plots. Likewise, the geom_bar() and geom_histogram() geoms also have a position argument, which you can use to specify how to draw the bars of the plot.
Three position arguments will be introduced here:
stack: place the bars on top of each other. Counts are used. This is the default position. fill: place the bars on top of each other, but this time use proportions. dodge: place the bars next to each other. Counts are used. In this exercise you’ll draw the total count of cars having a given number of cylinders (cyl), according to manual or automatic transmission type (am) - as shown in the viewer.
Since, in the built-in mtcars data set, cyl and am are integers, they have already been converted to factor variables for you.
# Draw a bar plot of cyl, filled according to am
ggplot(mtcars, aes(x = cyl, fill = factor(am))) +
geom_bar()
# Change the position argument to "stack""
ggplot(mtcars, aes(x = cyl, fill = factor(am))) +
geom_bar(position = "stack")
# Change the position argument to "fill""
ggplot(mtcars, aes(x = cyl, fill = factor(am))) +
geom_bar(position = "fill")
# Change the position argument to "dodge""
ggplot(mtcars, aes(x = cyl, fill = factor(am))) +
geom_bar(position = "dodge")
So far you’ve seen three different positions for bar plots: stack (the default), dodge (preferred), and fill (to show proportions).
However, you can go one step further by adjusting the dodging, so that your bars partially overlap each other. For this example you’ll again use the mtcars dataset. Like last time cyl and am are already available as factors inside mtcars.
Instead of using position = “dodge” you’re going to use position_dodge(), like you did with position_jitter() in the Scatter plots and jittering (1) exercise. Here, you’ll save this as an object, posn_d, so that you can easily reuse it.
Remember, the reason you want to use position_dodge() (and position_jitter()) is to specify how much dodging (or jittering) you want.
# 1 - The last plot form the previous exercise
ggplot(mtcars, aes(x = cyl, fill = factor(am))) +
geom_bar(position = "dodge")
# 2 - Define posn_d with position_dodge()
posn_d <- position_dodge(0.2)
# 3 - Change the position argument to posn_d
ggplot(mtcars, aes(x = cyl, fill = factor(am))) +
geom_bar(position = posn_d)
# 4 - Use posn_d as position and adjust alpha to 0.6
ggplot(mtcars, aes(x = cyl, fill = factor(am))) +
geom_bar(position = posn_d, alpha = 0.6)
Overlapping histograms pose similar problems to overlapping bar plots, but there is a unique solution here: a frequency polygon.
This is a geom specific to binned data that draws a line connecting the value of each bin. Like geom_histogram(), it takes a binwidth argument and by default stat = “bin” and position = “identity”.
# A basic histogram, add coloring defined by cyl
ggplot(mtcars, aes(mpg, fill = factor(cyl))) +
geom_histogram(binwidth = 1)
# Change position to identity
ggplot(mtcars, aes(mpg, fill = factor(cyl))) +
geom_histogram(binwidth = 1, position = "identity")
# Change geom to freqpoly (position is identity by default)
ggplot(mtcars, aes(mpg, color = factor(cyl))) +
geom_freqpoly(binwidth = 1)
As a last example of bar plots, you’ll return to histograms (which you now see are just a special type of bar plot). You saw a nice trick in a previous exercise of how to slightly overlap bars, but now you’ll see how to overlap them completely. This would be nice for multiple histograms, as long as there are not too many different overlaps!
You’ll make a histogram using the mpg variable in the mtcars data frame.
# 1 - Basic histogram plot command
ggplot(mtcars, aes(mpg)) +
geom_histogram(binwidth = 1)
# 2 - Plot 1, Expand aesthetics: am onto fill
ggplot(mtcars, aes(mpg, fill = factor(am))) +
geom_histogram(binwidth = 1)
# 3 - Plot 2, change position = "dodge"
ggplot(mtcars, aes(mpg, fill = factor(am))) +
geom_histogram(binwidth = 1, position = "dodge")
# 4 - Plot 3, change position = "fill"
ggplot(mtcars, aes(mpg, fill = factor(am))) +
geom_histogram(binwidth = 1, position = "fill")
# 5 - Plot 4, plus change position = "identity" and alpha = 0.4
ggplot(mtcars, aes(mpg, fill = factor(am))) +
geom_histogram(binwidth = 1, position = "identity", alpha = 0.4)
# 6 - Plot 5, plus change mapping: cyl onto fill
ggplot(mtcars, aes(mpg, fill = factor(cyl))) +
geom_histogram(binwidth = 1, position = "identity", alpha = 0.4)
n this example of a bar plot, you’ll fill each segment according to an ordinal variable. The best way to do that is with a sequential color series.
You’ll be using the Vocab dataset from earlier. Since this is a much larger dataset with more categories, you’ll also compare it to a simpler dataset, mtcars. Both datasets are ordinal.
# Example of how to use a brewed color palette
ggplot(mtcars, aes(x = cyl, fill = factor(am))) +
geom_bar() +
scale_fill_brewer(palette = "Set1")
# Use str() on Vocab to check out the structure
# str(Vocab)
# Plot education on x and vocabulary on fill
# Use the default brewed color palette
ggplot(Vocab, aes(x = education, fill = factor(vocabulary))) +
geom_bar(position = "fill") +
scale_fill_brewer()
In the previous exercise, you ended up with an incomplete bar plot. This was because for continuous data, the default RColorBrewer palette that scale_fill_brewer() calls is “Blues”. There are only 9 colours in the palette, and since you have 11 categories, your plot looked strange.
In this exercise, you’ll manually create a color palette that can generate all the colours you need. To do this you’ll use a function called colorRampPalette().
The input is a character vector of 2 or more colour values, e.g. “#FFFFFF” (white) and “#0000FF” (pure blue). (See this exercise for a discussion on hexadecimal codes).
The output is itself a function! So when you assign it to an object, that object should be used as a function. To see what we mean, execute the following three lines in the console:
new_col <- colorRampPalette(c("#FFFFFF", "#0000FF"))
new_col(4) # the newly extrapolated colours
[1] "#FFFFFF" "#AAAAFF" "#5555FF" "#0000FF"
munsell::plot_hex(new_col(4)) # Quick and dirty plot
new_col() is a function that takes one argument: the number of colours you want to extrapolate. You want to use nicer colours, so we’ve assigned the entire “Blues” colour palette from the RColorBrewer package to the character vector blues.
# Final plot of last exercise
ggplot(Vocab, aes(x = education, fill = factor(vocabulary))) +
geom_bar(position = "fill") +
scale_fill_brewer()
# Definition of a set of blue colors
library(RColorBrewer)
blues <- brewer.pal(9, "Blues")
# Make a color range using colorRampPalette() and the set of blues
blue_range <- colorRampPalette(blues)
# Use blue_range to adjust the color of the bars, use scale_fill_manual()
ggplot(Vocab, aes(x = education, fill = factor(vocabulary))) +
geom_bar(position = "fill") +
scale_fill_manual(values = blue_range(11))
In the video you saw how to make line plots using time series data. To explore this topic, you’ll use the economics data frame, which contains time series for unemployment and population statistics from the Federal Reserve Bank of St. Louis in the US. The data is contained in the ggplot2 package.
To begin with, you can look at how the median unemployment time and the unemployment rate (the number of unemployed people as a proportion of the population) change over time.
In the next exercises, you’ll explore to how add embellishments to the line plots, such as recession periods.
# Print out head of economics
# head(economics)
# Plot unemploy as a function of date using a line plot
ggplot(economics, aes(x = date, y = unemploy)) +
geom_line()
# Adjust plot to represent the fraction of total population that is unemployed
ggplot(economics, aes(x = date, y = unemploy/pop)) +
geom_line()
By themselves, time series often contain enough valuable information, but you always want to maximize the number of variables you can show in a plot. This allows you (and your viewers) to begin making comparisons between those variables that would otherwise be difficult or impossible.
Here, you’ll add shaded regions to the background to indicate recession periods. How do unemployment rate and recession period interact with each other?
In addition to the economics dataset from before, you’ll also use the recess dataset for the periods of recession. The recess data frame contains 2 variables: the begin period of the recession and the end. It’s already available in your workspace.
# Basic line plot
ggplot(economics, aes(x = date, y = unemploy/pop)) +
geom_line()
recess <- read.csv2("../data/recess.csv")
names(recess) <- c("begin","end")
recess$begin <- as.Date(recess$begin, format = c("%d/%m/%Y"))
recess$end <- as.Date(recess$end, format = c("%d/%m/%Y"))
# Expand the following command with geom_rect() to draw the recess periods
ggplot(economics, aes(x = date, y = unemploy/pop)) +
geom_rect(data = recess,
aes(xmin = begin, xmax = end, ymin = -Inf, ymax = +Inf),
inherit.aes = FALSE, fill = "red", alpha = 0.2) +
geom_line()
In the data chapter we discussed how the form of your data affects how you can plot it. Here, you’ll explore that topic in the context of multiple time series.
The dataset you’ll use contains the global capture rates of seven salmon species from 1950 - 2010.
In your workspace, the following dataset is available:
fish.species: Each variable (column) is a Salmon Species and each observation (row) is one Year. To get a multiple time series plot, however, both Year and Species should be in their own column. You need tidy data: one variable per column. Once you have that you can get the plot shown in the viewer by mapping Year to the x aesthetic and Species to the color aesthetic.
You’ll use the gather() function of the tidyr package, which is already loaded for you.
fish.species <- read.csv2("../data/fish.data.csv")
names(fish.species) <- c("Year","Pink","Chum","Sockeye","Coho","Rainbow",
"Chinook","Atlantic")
library(tidyr)
# Use gather to go from fish.species to fish.tidy
fish.tidy <- gather(fish.species, Species, Capture, -Year)
Now that you have tidy data, you’re ready to make your plot! The data frame fish.tidy is already available in the workspace, so you can start right away!
# Recreate the plot shown on the right
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
geom_line()
# A scatter plot with an ordinary Least Squares linear model
ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = "lm")
# A scatter plot with LOESS smooth
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth()
# The previous plot, without CI ribbon
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
# The previous plot, without points
ggplot(mtcars, aes(x = wt, y = mpg)) +
stat_smooth(method = "lm", se = FALSE)
# 1 - Define cyl as a factor variable
ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
geom_point() +
stat_smooth(method = "lm", se = FALSE)
# 2 - Plot 1, plus another stat_smooth() containing a nested aes()
ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
geom_point() +
stat_smooth(method = "lm", se = FALSE) +
stat_smooth(method = "lm", se = FALSE, aes(group = 1))
In the previous exercise we used se = FALSE in stat_smooth() to remove the 95% Confidence Interval. Here we’ll consider another argument, span, used in LOESS smoothing, and we’ll take a look at a nice scenario of properly mapping different models.
Recall that LOESS smoothing is a non-parametric form of regression that uses a weighted, sliding-window, average to calculate a line of best fit. We can control the size of this window with the span argument.
# Plot 1: change the LOESS span
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
# Add span below
geom_smooth(se = FALSE, span = 0.7)
In this plot, we set a linear model for the entire dataset as well as each subgroup, defined by cyl. In the second stat_smooth(),
# Plot 2: Set the second stat_smooth() to use LOESS with a span of 0.7
ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
geom_point() +
stat_smooth(method = "lm", se = FALSE) +
# Change method and add span below
stat_smooth(method = "loess", aes(group = 1),
se = FALSE, col = "black", span = 0.7)
Plot 2 presents a problem because there is a black line on our plot that is not included in the legend. To get this, we need to map something to col as an aesthetic, not just set col as an attribute.
# Plot 3: Set col to "All", inside the aes layer of stat_smooth()
ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
geom_point() +
stat_smooth(method = "lm", se = FALSE) +
stat_smooth(method = "loess",
# Add col inside aes()
aes(group = 1, col = "All"),
# Remove the col argument below
se = FALSE, span = 0.7)
Now we should see our “All” model in the legend, but it’s not black anymore.
library(RColorBrewer)
# Plot 4: Add scale_color_manual to change the colors
myColors <- c(brewer.pal(3, "Dark2"), "black")
ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
geom_point() +
stat_smooth(method = "lm", se = FALSE, span = 0.7) +
stat_smooth(method = "loess",
aes(group = 1, col="All"),
se = FALSE, span = 0.7) +
# Add correct arguments to scale_color_manual
scale_color_manual("Cylinders", values = myColors)
This code produces a jittered plot of vocabulary against education, variables from the Vocab data frame.
# Plot 1: Jittered scatter plot, add a linear model (lm) smooth
ggplot(Vocab, aes(x = education, y = vocabulary)) +
geom_jitter(alpha = 0.2) +
stat_smooth(method = "lm", se = FALSE) # smooth
Color by year.
# Plot 2: points, colored by year
ggplot(Vocab, aes(x = education, y = vocabulary, col = year)) +
geom_jitter(alpha = 0.2)
We need to specify year as a factor variable if we want to use it as a grouping variable for our linear models. Add the col = factor(year) aesthetic to the nested ggplot(aes()) function.
# Plot 3: lm, colored by year
ggplot(Vocab, aes(x = education, y = vocabulary, col = factor(year))) +
stat_smooth(method = "lm", se = FALSE) # smooth
Years are ordered, so use a sequential color palette.
# Plot 4: Set a color brewer palette
ggplot(Vocab, aes(x = education, y = vocabulary, col = factor(year))) +
stat_smooth(method = "lm", se = FALSE) + # smooth
scale_color_brewer() # colors
To get the proper colors, we can use col = year, because the variable year is type integer and we want a continuous scale. However, we’ll need to specify the invisible group aesthetic so that our linear models are still calculated appropriately. The scale layer, scale_color_gradientn(), has been provided for you - this allows us to map a continuous variable onto a colour scale.
# Plot 5: Add the group aes, specify alpha and size
ggplot(Vocab, aes(x = education, y = vocabulary, col = year, group = factor(year))) +
stat_smooth(method = "lm", se = FALSE, alpha = 0.6, size = 2) +
scale_color_gradientn(colors = brewer.pal(9, "YlOrRd"))
The previous example used the Vocab dataset and applied linear models describing vocabulary by education for different years. Here we’ll continue with that example by using stat_quantile() to apply a quantile regression (method rq).
By default, the 1st, 2nd (i.e. median), and 3rd quartiles are modeled as a response to the predictor variable, in this case education. Specific quantiles can be specified with the quantiles argument.
If you want to specify many quantile and color according to year, then things get too busy. We’ll explore ways of dealing with this in the next chapter.
Update the plotting code.
# Use stat_quantile instead of stat_smooth
ggplot(Vocab, aes(x = education, y = vocabulary, col = year, group = factor(year))) +
stat_quantile(alpha = 0.6, size = 2) +
scale_color_gradientn(colors = brewer.pal(9,"YlOrRd"))
The resulting plot will be a mess, because there are three quartiles drawn by default.
# Set quantile to 0.5
ggplot(Vocab, aes(x = education, y = vocabulary, col = year, group = factor(year))) +
stat_quantile(alpha = 0.6, size = 2, quantiles = 0.5) +
scale_color_gradientn(colors = brewer.pal(9,"YlOrRd"))
Another useful stat function is stat_sum(). This function calculates the total number of overlapping observations and is another good alternative to overplotting.
# Plot 1: Jittering only
p <- ggplot(Vocab, aes(x = education, y = vocabulary)) +
geom_jitter(alpha = 0.2)
p
Add stat_sum() to this plotting object p. This maps the overall count of each dot onto size. You don’t have to set any arguments; the aesthetics will be inherited from the base plot!
# Plot 2: Add stat_sum
p +
stat_sum() # sum statistic
Add the size scale with the generic scale_size() function. Use range to set the minimum and maximum dot sizes as c(1,10).
# Plot 3: Set size range
p +
stat_sum() + # sum statistic
scale_size(range = c(1, 10)) # set size scale
# Plot with linear and loess model
p <- ggplot(Vocab, aes(x = education, y = vocabulary)) +
stat_smooth(method = "loess", aes(col = "red"), se = F) +
stat_smooth(method = "lm", aes(col = "blue"), se = F) +
scale_color_discrete("Model", labels = c("red" = "LOESS", "blue" = "lm"))
# Add stat_sum
p + stat_sum()
# Add stat_sum and set size range
p + stat_sum() + scale_size(range = c(1,10))
Here we’ll look at stat_summary() in action. We’ll build up various plots one-by-one.
In this exercise we’ll consider the preparations. That means we’ll make sure the data is in the right format and that all the positions that we might use in our plots are defined. Lastly, we’ll set the base layer for our plot. ggplot2 is already loaded, so you can get started straight away!
Let’s prepare the data.
# vector of values to be ready
set.seed(123)
xx <- rnorm(100)
# Convert cyl and am to factors
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$am <- as.factor(mtcars$am)
# Define positions
posn.d <- position_dodge(0.1)
posn.jd <- position_jitterdodge(jitter.width = 0.1, dodge.width = 0.2)
posn.j <- position_jitter(0.2)
# Base layers
wt.cyl.am <- ggplot(mtcars, aes(x = cyl,
y = wt,
col = am,
fill = am,
group = am))
Now that the preparation work is done, let’s have a look at at stat_summary().
ggplot2 is already loaded, as is wt.cyl.am, which is defined as
wt.cyl.am <- ggplot(mtcars, aes(x = cyl, y = wt, col = am, fill = am, group = am))
Also all the position objects of the previous exercise, posn.d, posn.jd and posn.j, are available. For starters, Plot 1 is already coded for you.
# Plot 1: Jittered, dodged scatter plot with transparent points
wt.cyl.am +
geom_point(position = posn.jd, alpha = 0.6)
Add a stat_summary() layer to wt.cyl.am and calculate the mean and standard deviation as we did in the video: set fun.data to mean_sdl and specify fun.args to be list(mult = 1). Set the position argument to posn.d.
# Plot 2: Mean and SD - the easy way
wt.cyl.am +
stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), position = posn.d)
Repeat the previous plot, but use the 95% confidence interval instead of the standard deviation. You can use mean_cl_normal instead of mean_sdl this time. There’s no need to specify fun.args in this case. Again, set position to posn.d.
# # Plot 3: Mean and 95% CI - the easy way
# wt.cyl.am +
# stat_summary(fun.data = mean_cl_normal, position = posn.d)
# Plot 4: Mean and SD - with T-tipped error bars - fill in ___
wt.cyl.am +
stat_summary(geom = "point", fun.y = mean,
position = posn.d) +
stat_summary(geom = "errorbar", fun.data = mean_sdl,
position = posn.d, fun.args = list(mult = 1), width = 0.1)
n the video we saw that the only difference between ggplot2::mean_sdl() and Hmisc::smean.sdl() is the naming convention. In order to use the results of a function directly in ggplot2 we need to ensure that the names of the variables match the aesthetics needed for our respective geoms.
Here we’ll create two new functions in order to create the plot shown in the viewer. One function will measure the full range of the dataset and the other will measure the interquartile range.
A play vector, xx, has been created for you. Execute
mean_sdl(xx, mult = 1)
y ymin ymax
1 0.09040591 -0.82241 1.003222
in the R Console and consider the format of the output. You’ll have to produce functions which return similar outputs.
# Function to save range for use in ggplot
gg_range <- function(x) {
# Change x below to return the instructed values
data.frame(ymin = min(x), # Min
ymax = max(x)) # Max
}
gg_range(xx)
ymin ymax
1 -2.309169 2.187333
# Required output
# ymin ymax
# 1 1 100
Creating and checking the function:
# Function to Custom function:
med_IQR <- function(x) {
# Change x below to return the instructed values
data.frame(y = median(x), # Median
ymin = quantile(x)[2], # 1st quartile
ymax = quantile(x)[4]) # 3rd quartile
}
med_IQR(xx)
y ymin ymax
25% 0.06175631 -0.4938542 0.6918192
# Required output
# y ymin ymax
# 25% 50.5 25.75 75.25
In the last exercise we created functions that will allow us to plot the so-called five-number summary (the minimum, 1st quartile, median, 3rd quartile, and the maximum). Here, we’ll implement that into a unique plot type.
All the functions and objects from the previous exercise are available including the updated mtcars data frame, the position object posn.d, the base layers wt.cyl.am and the functions med_IQR() and gg_range().
The plot you’ll end up with at the end of this exercise is shown on the right. When using stat_summary() recall that the fun.data argument requires a properly labelled 3-element long vector, which we saw in the previous exercises. The fun.y argument requires only a 1-element long vector.
# The base ggplot command; you don't have to change this
wt.cyl.am <- ggplot(mtcars, aes(x = cyl,y = wt, col = am, fill = am, group = am))
Add three stat_summary calls to wt.cyl.am:
# Add three stat_summary calls to wt.cyl.am
wt.cyl.am +
stat_summary(geom = "linerange", fun.data = med_IQR,
position = posn.d, size = 3) +
stat_summary(geom = "linerange", fun.data = gg_range,
position = posn.d, size = 3,
alpha = 0.4) +
stat_summary(geom = "point", fun.y = median,
position = posn.d, size = 3,
col = "black", shape = "X")
Complete the given stat_summary() functions, don’t change the predefined arguments:
The first stat_summary() layer should have geom set to “linerange”. fun.data argument should be set to med_IQR, the function you used in the previous exercise. The second stat_summary() layer also uses the “linerange” geom. This time fun.data should be gg_range, the other function you created. Also set alpha = 0.4. *For the last stat_summary() layer, use geom = “point”. The points should have col “black” and shape “X”.
It helps to control the plot dimensions:
In the video, you saw different ways of using the coordinates layer to zoom in. In this exercise, we’ll compare some of the techniques again.
As usual, you’ll be working with the mtcars dataset, which is already cleaned up for you (cyl and am are categorical variables). Also p, a ggplot object you coded in the previous chapter, is already available. Execute p in the console to check it out.
# Basic ggplot() command, coded for you
p <- ggplot(mtcars, aes(x = wt, y = hp, col = am)) +
geom_point() + geom_smooth()
p
Extend p with a scale_x_continuous() with limits = c(3, 6) and expand = c(0, 0). What do you see?
# Add scale_x_continuous()
p +
scale_x_continuous(limits = c(3, 6), expand = c(0, 0))
Try again, this time with coord_cartesian(): Set the xlim argument equal to c(3, 6). Compare the two plots.
# Add coord_cartesian(): the proper way to zoom in
p +
coord_cartesian(xlim = c(3, 6))
We can set the aspect ratio of a plot with coord_fixed() or coord_equal(). Both use ratio = 1 as a default. A 1:1 aspect ratio is most appropriate when two continuous variables are on the same scale, as with the iris dataset.
All variables are measured in centimeters, so it only makes sense that one unit on the plot should be the same physical distance on each axis. This gives a more truthful depiction of the relationship between the two variables since the aspect ratio can change the angle of our smoothing line. This would give an erroneous impression of the data.
Of course the underlying linear models don’t change, but our perception can be influenced by the angle drawn.
# Complete basic scatter plot function
base.plot <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
geom_jitter() +
geom_smooth(method = "lm", se = FALSE)
# Plot base.plot: default aspect ratio
base.plot
# Fix aspect ratio (1:1) of base.plot
base.plot + coord_equal()
# or
base.plot + coord_fixed()
The coord_polar() function converts a planar x-y Cartesian plot to polar coordinates. This can be useful if you are producing pie charts.
We can imagine two forms for pie charts - the typical filled circle, or a colored ring.
As an example, consider the stacked bar chart shown in the viewer. Imagine that we just take the y axis on the left and bend it until it loops back on itself, while expanding the right side as we go along. We’d end up with a pie chart - it’s simply a bar chart transformed onto a polar coordinate system.
Typical pie charts omit all of the non-data ink, which we’ll learn about in the next chapter. Pie charts are not really better than stacked bar charts, but we’ll come back to this point in the fourth chapter on best practices.
The mtcars data frame is available, with cyl converted to a factor for you.
# Create a stacked bar plot: wide.bar
wide.bar <- ggplot(mtcars, aes(x = 1, fill = cyl)) +
geom_bar()
wide.bar
# Convert wide.bar to pie chart
wide.bar +
coord_polar(theta = "y")
# Create stacked bar plot: thin.bar
thin.bar <- ggplot(mtcars, aes(x = 1, fill = cyl)) +
geom_bar(width = .1) +
scale_x_continuous(limits = c(0.5,1.5))
thin.bar
# Convert thin.bar to "ring" type pie chart
thin.bar +
coord_polar(theta = "y")
The most straightforward way of using facets is facet_grid(). Here we just need to specify the categorical variable to use on rows and columns using standard R formula notation (rows ~ columns).
Notice that we can also take advantage of ordinal variables by positioning them in the correct order as columns or rows, as is the case with the number of cylinders. Get some hands-on practice in this exercise; ggplot2 is already loaded for you and mtcars is available. The variables cyl and am are factors. However, this is not necessary for facets; ggplot2 will coerce variables to factors in this case.
# Basic scatter plot
p <- ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point()
p
# 1 - Separate rows according to transmission type, am
p +
facet_grid(am ~ .)
# 2 - Separate columns according to cylinders, cyl
p +
facet_grid(. ~ cyl)
# 3 - Separate by both columns and rows
p +
facet_grid(am ~ cyl)
Facets are another way of presenting categorical variables. Recall that we saw all the ways of combining variables, both categorical and continuous, in the aesthetics chapter. Sometimes it’s possible to overdo it. Here we’ll present a plot with 6 variables and see if we can add even more.
Let’s begin by using a trick to map two variables onto two color scales - hue and lightness. We combine cyl and am into a single variable cyl_am. To accommodate this we also make a new color palette with alternating red and blue of increasing darkness. This is saved as myCol. If you are not familiar with these steps, execute the code piece-by-piece.
# Code to create the cyl_am col and myCol vector
mtcars$cyl_am <- paste(mtcars$cyl, mtcars$am, sep = "_")
myCol <- rbind(brewer.pal(9, "Blues")[c(3,6,8)],
brewer.pal(9, "Reds")[c(3,6,8)])
# Map cyl_am onto col
ggplot(mtcars, aes(x = wt, y = mpg, col = cyl_am)) +
geom_point() +
# Add a manual colour scale
scale_color_manual(values = myCol)
# Grid facet on gear vs. vs
ggplot(mtcars, aes(x = wt, y = mpg, col = cyl_am)) +
geom_point() +
scale_color_manual(values = myCol) +
facet_grid(gear ~ vs)
# Also map disp to size
ggplot(mtcars, aes(x = wt, y = mpg, col = cyl_am, size = disp)) +
geom_point() +
scale_color_manual(values = myCol) +
facet_grid(gear ~ vs)
When you have a categorical variable with many levels which are not all present in each sub-group of another variable, it may be desirable to drop the unused levels. As an example let’s return to the mammalian sleep dataset, mamsleep. It is available in your workspace.
The variables of interest here are name, which contains the full popular name of each animal, and vore, the eating behavior. Each animal can only be classified under one eating habit, so if we facet according to vore, we don’t need to repeat the full list in each sub-plot.
# Basic scatter plot
p <- ggplot(msleep, aes(x = sleep_total, y = name, col = conservation)) +
geom_point()
# Execute to display plot
p
# Facet rows accoding to vore
p +
facet_grid(vore ~ .)
# Specify scale and space arguments to free up rows
p +
facet_grid(vore ~ ., scale = "free_y", space = "free_y")
To understand all the arguments for the themes, you’ll modify an existing plot over the next series of exercises.
Here you’ll focus on the rectangles of the plotting object z that has already been created for you. If you type z in the console, you can check it out. The goal is to turn z into the plot in the viewer. Do this by following the instructions step by step.
z <- ggplot(mtcars, aes(wt, mpg, col = cyl)) +
geom_point() +
stat_smooth(method = "lm", se = F) +
facet_grid(. ~ cyl)
z
myPink <- "#FEE0D2"
# Plot 1: Change the plot background fill to myPink
z +
theme(plot.background = element_rect(fill = myPink))
# Plot 2: Adjust the border to be a black line of size 3
z +
theme(plot.background = element_rect(fill = myPink, color = "black", size = 3)) # expanded from plot 1
# Theme to remove all rectangles
no_panels <- theme(rect = element_blank())
# Plot 3: Combine custom themes
z +
no_panels +
theme(plot.background = element_rect(fill = myPink, color = "black", size = 3)) # from plot 2
To change the appearance of lines use the element_line() function.
The plot you created in the last exercise, with the fancy pink background, is available as the plotting object z. Your goal is to produce the plot in the viewer - no grid lines, but red axes and tick marks.
For each of the arguments that specify lines, use element_line() to modify attributes. e.g. element_line(color = “red”).
Remember, to remove a non-data element, use element_blank().
# Extend z with theme() function and 3 args
z +
theme(panel.grid = element_blank(),
axis.line = element_line(color = "red"),
axis.ticks = element_line(color = "red"))
Next we can make the text on your plot prettier and easier to spot. You can do this through the element_text() function and by passing the appropriate arguments inside the theme() function.
As before, the plot you’ve created in the previous exercise is available as z. The plot you should end up with after successfully completing this exercises is shown in the viewer.
# Original plot, color provided
z
# Extend z with theme() function and 3 args
myRed <- "#99000D"
z +
theme(strip.text = element_text(size = 16, color = myRed),
axis.title = element_text(color = myRed, hjust = 0, face = "italic"),
axis.text = element_text(color = "black"))
The themes layer also allows you to specify the appearance and location of legends.
The plot you’ve coded up to now is available as z. It’s also displayed in the viewer. Solve the instructions and compare the resulting plots with the plot you started with.
# Move legend by position
z +
theme(legend.position = c(0.85, 0.85))
# Change direction
z +
theme(legend.direction = "horizontal")
# Change location by name
z +
theme(legend.position = "bottom")
# Remove legend entirely
z +
theme(legend.position = "none")
The different rectangles of your plot have spacing between them. There’s spacing between the facets, between the axis labels and the plot rectangle, between the plot rectangle and the entire panel background, etc. Let’s experiment!
The last plot you created in the previous exercise, without a legend, is available as z.
# Increase spacing between facets
library(grid)
z + theme(panel.spacing.x = unit(2, "cm"))
# Adjust the plot margin
z + theme(panel.spacing.x = unit(2, "cm"),
plot.margin = unit(c(1,2,1,1), "cm"))
There are many themes available by default in ggplot2: theme_bw(), theme_classic(), theme_gray(), etc. In the previous exercise, you saw that you can apply these themes to all following plots, with theme_set():
theme_set(theme_bw())
But you can also apply them on an individual plot, with:
... + theme_bw()
You can also extend these themes with your own modifications. In this exercise, you’ll experiment with this and use some preset templates available from the ggthemes package. The workspace already contains the same basic plot from before under the name z2.
# Original plot
z2 <- ggplot(mtcars, aes(wt, mpg, col = cyl)) +
geom_point() +
stat_smooth(method = "lm", se = F) +
facet_grid(. ~ cyl)
z2
# Load ggthemes
library(ggthemes)
# Apply theme_tufte, plot additional modifications
custom_theme <- theme_tufte() +
theme(legend.position = c(0.9, 0.9),
legend.title = element_text(face = "italic", size = 12),
axis.title = element_text(face = "bold", size = 14))
# Draw the customized plot
z2 + custom_theme
In the video we saw why “dynamite plots” (bar plots with error bars) are not well suited for their intended purpose of depicting distributions. If you really want error bars on bar plots, you can still get that. However, you’ll need to set the positions manually. A point geom will typically serve you much better.
We saw an example of a dynamite plot earlier in this course. Let’s return to that code and make sure you know how to handle it. We’ll use the mtcars dataset for examples. The first part of this exercise will just be a refresher, then we’ll get into some details.
# Base layers
m <- ggplot(mtcars, aes(x = cyl, y = wt))
# Draw dynamite plot
m +
stat_summary(fun.y = mean, geom = "bar", fill = "skyblue") +
stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width = 0.1)
In the previous exercise we used the mtcars dataset to draw a dynamite plot about the weight of the cars per cylinder type.
In this exercise we will add a distinction between transmission type, am, for the dynamite plots.
# Base layers
m <- ggplot(mtcars, aes(x = cyl,y = wt, col = am, fill = am))
# Plot 1: Draw dynamite plot
m +
stat_summary(fun.y = mean, geom = "bar") +
stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width = 0.1)
# Plot 2: Set position dodge in each stat function
m +
stat_summary(fun.y = mean, geom = "bar", position = "dodge") +
stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1),
geom = "errorbar", width = 0.1, position = "dodge")
# Set your dodge posn manually
posn.d <- position_dodge(0.9)
# Plot 3: Redraw dynamite plot
m +
stat_summary(fun.y = mean, geom = "bar", position = posn.d) +
stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1),
geom = "errorbar", width = 0.1, position = posn.d)
If it is appropriate to use bar plots (see the video for a discussion!), then it would also be nice to give an impression of the number of values in each group.
stat_summary() doesn’t keep track of the count. stat_sum() does (that’s the whole point), but it’s difficult to access. In this case, the most straightforward thing to do is calculate exactly what we want to plot beforehand. For this exercise we’ve created a summary data frame called mtcars.cyl which contains the average (wt.avg), standard deviations (sd) and count (n) of car weights, according to cylinders, cyl. It also contains the proportion (prop) of each cylinder represented in the entire dataset. Use the console to familiarize yourself with the mtcars.cyl data frame.
# Base layers
library(dplyr, quietly = T)
mtcars.cyl <- mtcars %>%
group_by(cyl) %>%
summarise(wt.avg = mean(wt), sd = sd(wt), n = n()) %>%
mutate(prop = n / sum(n)) %>% ungroup()
m <- ggplot(mtcars.cyl, aes(x = cyl, y = wt.avg))
# Plot 1: Draw bar plot with geom_bar
m + geom_bar(stat = "identity", fill = "skyblue")
# Plot 2: Draw bar plot with geom_col
m + geom_col(fill = "skyblue")
# Plot 3: geom_col with variable widths.
m + geom_col(fill = "skyblue", width = mtcars.cyl$prop)
# Plot 4: Add error bars
m +
geom_col(fill = "skyblue", width = mtcars.cyl$prop) +
geom_errorbar(aes(ymin = wt.avg - sd, ymax = wt.avg + sd), width = 0.1)
In this example we’re going to consider a typical use of pie charts - a categorical variable as the proportion of another categorical variable. For example, the proportion of each transmission type am, in each cylinder, cyl class.
The first plotting function in the editor should be familiar to you by now. It’s a straightforward bar chart with position = “fill”, as shown in the viewer. This is already a good solution to the problem at hand! Let’s take it one step further and convert this plot in a pie chart.
# Bar chart
ggplot(mtcars, aes(x = cyl, fill = am)) +
geom_bar(position = "fill")
# Convert bar chart to pie chart
ggplot(mtcars, aes(x = factor(1), fill = am)) +
geom_bar(position = "fill", width = 1) +
facet_grid(. ~ cyl) +
coord_polar(theta = "y") +
theme_void()
In the previous example, we looked at one categorical variable (am) as a proportion of another (cyl). Here, we’re interested in two or more categorical variables, independent of each other. The many pie charts in the viewer is an unsatisfactory visualization. We’re interested in the relationship between all these variables (e.g. where are 8 cylinder cars represented on the Transmission, Gear and Carburetor variables?) Perhaps we also want continuous variables, such as weight. How can we combine all this information?
The trick is to use a parallel coordinates plot, like this one. Each variable is plotted on its own parallel axis. Individual observations are connected with lines, colored according to a variable of interest. This is a surprisingly useful visualization since we can combine many variables, even if they are on entirely different scales.
A word of caution though: typically it is very taboo to draw lines in this way. It’s the reason why we don’t draw lines across levels of a nominal variable - the order, and thus the slope of the line, is meaningless. Parallel plots are a (very useful) exception to the rule!
# Parallel coordinates plot using GGally
library(GGally, quietly = T)
# All columns except am
group_by_am <- 9
my_names_am <- (1:11)[-group_by_am]
# Basic parallel plot - each variable plotted as a z-score transformation
ggparcoord(mtcars, my_names_am, groupColumn = group_by_am,
alpha = 0.8)
Two different examples:
mtcars2 <- mtcars %>%
select(mpg, disp, drat, wt, qsec)
GGally::ggpairs(mtcars2)
mtcars3 <- mtcars %>%
select(mpg, cyl, disp, hp, drat)
GGally::ggpairs(mtcars3)
In the video you saw reasons for not using heat maps. Nonetheless, you may encounter a case in which you really do want to use one. Luckily, they’re fairly straightforward to produce in ggplot2.
We begin by specifying two categorical variables for the x and y aesthetics. At the intersection of each category we’ll draw a box, except here we call it a tile, using the geom_tile() layer. Then we will fill each tile with a continuous variable.
We’ll produce the heat map we saw in the video with the built-in barley dataset. The barley dataset is in the lattice package and has already been loaded for you. Begin by exploring the structure of the data in the console using str().
library(lattice)
# Create color palette
myColors <- brewer.pal(9, "Reds")
# Build the heat map from scratch
ggplot(barley, aes(x = year, y = variety, fill = yield)) +
geom_tile() + # Geom layer
facet_wrap( ~ site, ncol = 1) + # Facet layer
scale_fill_gradientn(colors = myColors) # Adjust colors
There are several alternatives to heat maps. The best choice really depends on the data and the story you want to tell with this data. If there is a time component, the most obvious choice is a line plot like what we see in the viewer. Can you come up with the correct commands to create a similar looking plot?
The barley dataset is already available in the workspace. Feel free to check out its structure before you start!
# Line plot; set the aes, geom and facet
ggplot(barley, aes(x = year, y = yield, color = variety, group = variety)) +
geom_line() +
facet_wrap( ~ site, nrow = 1)
In the videos we saw two methods for depicting overlapping measurements of spread. You can use dodged error bars or you can use overlapping transparent ribbons (shown in the viewer). In this exercise we’ll try to recreate the second option, the transparent ribbons.
The barley dataset is available. You can use str(barley) to refresh its structure before heading over to the instructions.
# Create overlapping ribbon plot from scratch
ggplot(barley, aes(x = year, y = yield, col = site, group = site, fill = site)) +
stat_summary(fun.y = mean, geom = "line") +
stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "ribbon", alpha = 0.1, col = NA)
You’re now going to prepare your data set for producing the faceted scatter plot in the next exercise, as mentioned in the video. For this, the data set needs to contain only the years 1996 and 2006, because your plot will only have two facets. ilo_data has been pre-loaded for you.
Use facet_grid() in order to add horizontal facets for 1996 and 2006 each, as detailed in the video.
ilo_data <- read.csv2("../data/ilo_data.csv")
names(ilo_data) <- c("country","year","hourly_compensation","working_hours")
var_num <- c("hourly_compensation","working_hours")
ilo_data[,var_num] <- apply(ilo_data[,var_num], 2, as.numeric)
# Filter ilo_data to retain the years 1996 and 2006
ilo_data <- ilo_data %>%
filter(year == "1996" | year == "2006")
# Again, you save the plot object into a variable so you can save typing later on
ilo_plot <- ggplot(ilo_data, aes(x = working_hours, y = hourly_compensation)) +
geom_point() +
labs(
x = "Working hours per week",
y = "Hourly compensation",
title = "The more people work, the less compensation they seem to receive",
subtitle = "Working hours and hourly compensation in European countries, 2006",
caption = "Data source: ILO, 2017"
) +
# Add facets here
facet_grid(facets = . ~ year)
ilo_plot
In the video you saw how a lot of typing can be saved by replacing code chunks with function calls. You saw how a function is usually defined, now you will apply this knowledge in order to make your previous two theme() calls reusable.
# For a starter, let's look at what you did before: adding various theme calls to your plot object
ilo_plot +
theme_minimal() +
theme(
text = element_text(family = "Bookman", color = "gray25"),
plot.subtitle = element_text(size = 12),
plot.caption = element_text(color = "gray30"),
plot.background = element_rect(fill = "gray95"),
plot.margin = unit(c(5, 10, 5, 10), units = "mm")
)
# Define your own theme function below
theme_ilo <- function() {
theme_minimal() +
theme(
text = element_text(family = "Bookman", color = "gray25"),
plot.subtitle = element_text(size = 12),
plot.caption = element_text(color = "gray30"),
plot.background = element_rect(fill = "gray95"),
plot.margin = unit(c(5, 10, 5, 10), units = "mm")
)
}
Once you have created your own theme_ilo() function, it is time to apply it to a plot object. In the video you saw that theme() calls can be chained. You’re going to make use of this and add another theme() call to adjust some peculiarities of the faceted plot
# Apply your theme function
ilo_plot <- ilo_plot +
theme_ilo()
# Examine ilo_plot
ilo_plot
ilo_plot +
# Add another theme call
theme(
# Change the background fill and color
strip.background = element_rect(fill = "gray60", color = "gray95"),
# Change the color of the text
strip.text = element_text(color = "white")
)
As shown in the video, use only geom_path() to create the basic structure of the dot plot.
# Create the dot plot
ggplot(ilo_data) +
geom_path(aes(x = working_hours, y = country))
Instead of labeling years, use the arrow argument of the geom_path() call to show the direction of change. The arrows will point from 1996 to 2006, because that’s how the data set is ordered. The arrow() function takes two arguments: The first is length, which can be specified with a unit() call, which you might remember from previous exercises. The second is type which defines how the arrow head will look.
ggplot(ilo_data) +
geom_path(aes(x = working_hours, y = country),
# Add an arrow to each path
arrow = arrow(length = unit(1.5, "mm"), type = "closed"))
A nice thing that can be added to plots are annotations or labels, so readers see the value of each data point displayed in the plot panel. This often makes axes obsolete, an advantage you’re going to use in the last exercise of this chapter. These labels are usually added with geom_text() or geom_label(). The latter adds a background to each label, which is not needed here.
ggplot(ilo_data) +
geom_path(aes(x = working_hours, y = country),
arrow = arrow(length = unit(1.5, "mm"), type = "closed")) +
# Add a geom_text() geometry
geom_text(
aes(x = working_hours,
y = country,
label = round(working_hours, 1))
)
As shown in the video, use mutate() and fct_reorder() to change the factor level ordering of a variable.
library(forcats)
# Reorder country factor levels
ilo_data <- ilo_data %>%
# Arrange data frame
arrange(year) %>%
# Reorder countries by working hours in 2006
mutate(country = fct_reorder(country,
working_hours,
last))
# Plot again
ggplot(ilo_data) +
geom_path(aes(x = working_hours, y = country),
arrow = arrow(length = unit(1.5, "mm"), type = "closed")) +
geom_text(
aes(x = working_hours,
y = country,
label = round(working_hours, 1))
)
The labels still kind of overlap with the lines in the dot plot. Use a conditional hjust aesthetic in order to better place them, and change their appearance.
# Save plot into an object for reuse
ilo_dot_plot <- ggplot(ilo_data) +
geom_path(aes(x = working_hours, y = country),
arrow = arrow(length = unit(1.5, "mm"), type = "closed")) +
# Specify the hjust aesthetic with a conditional value
geom_text(
aes(x = working_hours,
y = country,
label = round(working_hours, 1),
hjust = ifelse(year == "2006", 1.4, -0.4)
),
# Change the appearance of the text
size = 3,
family = "Bookman",
color = "gray25"
)
ilo_dot_plot
Use a function introduced in the previous video to change the viewport of the plotting area. Also apply your custom theme.
# Reuse ilo_dot_plot
ilo_dot_plot <- ilo_dot_plot +
# Add labels to the plot
labs(
x = "Working hours per week",
y = "Country",
title = "People work less in 2006 compared to 1996",
subtitle = "Working hours in European countries, development since 1996",
caption = "Data source: ILO, 2017"
) +
# Apply your theme
theme_ilo() +
# Change the viewport
coord_cartesian(xlim = c(25, 41))
# View the plot
ilo_dot_plot
The x-axis title is already quite superfluous because you’ve added labels for both years. You’ll now add country labels to the plot, so all of the axes can be removed.
In this exercise, you’re going to encounter something that is probably new to you: New data sets can be given to single geometries like geom_text(), so these geometries don’t use the data set given to the initial ggplot() call. In this exercise, you are going to need this because you only want to add one label to each arrow. If you were to use the original data set ilo_data, two labels would be added because there are two observations for each country in the data set, one for 1996 and one for 2006.
# Compute temporary data set for optimal label placement
median_working_hours <- ilo_data %>%
group_by(country) %>%
summarize(median_working_hours_per_country = median(working_hours)) %>%
ungroup()
# Have a look at the structure of this data set
str(median_working_hours)
Classes 'tbl_df', 'tbl' and 'data.frame': 17 obs. of 2 variables:
$ country : Factor w/ 17 levels "Netherlands",..: 1 2 3 4 5 6 7 8 9 10 ...
$ median_working_hours_per_country: num 27 27.8 28.4 31 30.9 ...
ilo_dot_plot +
# Add label for country
geom_text(data = median_working_hours,
aes(y = country,
x = median_working_hours_per_country,
label = country),
vjust = 1.5,
family = "Bookman",
color = "gray25") +
# Remove axes and grids
theme(
axis.ticks = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
panel.grid = element_blank(),
# Also, let's reduce the font size of the subtitle
plot.subtitle = element_text(size = 9)
)