The mtcars dataset contains information on 32 cars from a 1973 issue of Motor Trend magazine. This dataset is small, intuitive, and contains a variety of continuous and categorical variables.
# Load the ggplot2 package
library(ggplot2)
package 㤼㸱ggplot2㤼㸲 was built under R version 3.6.3
library(dplyr)
Attaching package: 㤼㸱dplyr㤼㸲
The following objects are masked from 㤼㸱package:stats㤼㸲:
filter, lag
The following objects are masked from 㤼㸱package:base㤼㸲:
intersect, setdiff, setequal, union
# Explore the mtcars data frame with str()
str(mtcars)
'data.frame': 32 obs. of 13 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
$ fcyl: Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
$ fam : Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...
# Execute the following command
ggplot(mtcars, aes(cyl, mpg)) +
geom_point()
Notice that ggplot2 treats cyl as a continuous variable. We get a plot, but it’s not quite right, because it gives the impression that there is such a thing as a 5 or 7-cylinder car, which there is not.
Although cyl (the number of cylinders) is categorical, you probably noticed that it is classified as numeric in mtcars. This is really misleading because the representation in the plot doesn’t match the actual data type. You’ll have to explicitly tell ggplot2 that cyl is a categorical variable.
# Load the ggplot2 package
library(ggplot2)
# Change the command below so that cyl is treated as factor
ggplot(mtcars, aes(factor(cyl), mpg)) +
geom_point()
Notice that ggplot2 treats cyl as a factor. This time the x-axis does not contain variables like 5 or 7, only the values that are present in the dataset.
Let’s dive a little deeper into the three main topics in this course: The data, aesthetics, and geom layers.
# Edit to add a color aesthetic mapped to disp
ggplot(mtcars, aes(wt, mpg, color = disp)) +
geom_point()
# Change the color aesthetic to a size aesthetic
ggplot(mtcars, aes(wt, mpg, size = disp)) +
geom_point()
In the previous exercise you saw that disp can be mapped onto a color gradient or onto a continuous size scale.
Another argument of aes() is the shape of the points. There are a finite number of shapes which ggplot() can automatically assign to the points.
The diamonds dataset contains details of 1,000 diamonds. Among the variables included are carat (a measurement of the diamond’s size) and price.
We’ll use two common geom layer functions: - geom_point() adds points (as in a scatter plot). - geom_smooth() adds a smooth trend curve.
# Explore the diamonds data frame with str()
str(diamonds)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 53940 obs. of 10 variables:
$ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
$ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : num 55 61 65 58 58 57 57 55 61 61 ...
$ price : int 326 326 327 334 335 336 336 337 337 338 ...
$ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
# Add geom_point() with +
ggplot(diamonds, aes(carat, price)) +
geom_point()
# Add geom_smooth() with +
ggplot(diamonds, aes(carat, price)) +
geom_point() +
geom_smooth()
If we have multiple geoms, then mapping an aesthetic to data variable inside the call to ggplot() will change all the geoms. It is also possible to make changes to individual geoms by passing arguments to the geom_*() functions.
geom_point() has an alpha argument that controls the opacity of the points. A value of 1 (the default) means that the points are totally opaque; a value of 0 means the points are totally transparent (and therefore invisible). Values in between specify transparency.
# Map the color aesthetic to clarity
ggplot(diamonds, aes(carat, price, color = clarity)) +
geom_point() +
geom_smooth()
# Make the points 40% opaque
ggplot(diamonds, aes(carat, price, color = clarity)) +
geom_point(alpha = 0.4) +
geom_smooth()
Plots can be saved as variables, which can be added two later on using the + operator. This is really useful if you want to make multiple related plots from a common base.
# Draw a ggplot
plt_price_vs_carat <- ggplot(
# Use the diamonds dataset
diamonds,
# For the aesthetics, map x to carat and y to price
aes(carat, price)
)
# Add a point layer to plt_price_vs_carat
plt_price_vs_carat + geom_point()
# Edit this to make points 20% opaque: plt_price_vs_carat_transparent
plt_price_vs_carat_transparent <- plt_price_vs_carat + geom_point(alpha = 0.2)
# See the plot
plt_price_vs_carat_transparent
# Edit this to map color to clarity,
# Assign the updated plot to a new object
plt_price_vs_carat_by_clarity <- plt_price_vs_carat + geom_point(aes(color = clarity))
# See the plot
plt_price_vs_carat_by_clarity
By assigning parts of plots to a variable then reusing that variable in other plots, it makes it really clear how much those plots have in common.
These are the aesthetics we can consider within aes() in this chapter: x, y, color, fill, size, alpha, labels and shape.
One common convention is that you don’t name the x and y arguments to aes(), since they almost always come first, but you do name other arguments.
#create factor fcyl
mtcars <- mtcars %>%
mutate(fcyl = as.factor(cyl),
fam = as.factor(am))
library(forcats)
mtcars <- mtcars %>%
mutate(fam = fct_recode(fam,
"manual" = "1",
"automatic" = "0"))
# Map x to mpg and y to fcyl
ggplot(mtcars, aes(mpg, fcyl)) +
geom_point()
# Swap mpg and fcyl
ggplot(mtcars, aes(fcyl, mpg)) +
geom_point()
# Map x to wt, y to mpg and color to fcyl
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
geom_point()
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
# Set the shape and size of the points
geom_point(shape = 1, size = 4)
Typically, the color aesthetic changes the outline of a geom and the fill aesthetic changes the inside. geom_point() is an exception: you use color (not fill) for the point color. However, some shapes have special behavior.
The default geom_point() uses shape = 19: a solid circle. An alternative is shape = 21: a circle that allow you to use both fill for the inside and color for the outline. This is lets you to map two aesthetics to each point.
All shape values are described on the points() help page.
# Map fcyl to fill
ggplot(mtcars, aes(wt, mpg, fill = fcyl)) +
geom_point(shape = 1, size = 4)
ggplot(mtcars, aes(wt, mpg, fill = fcyl)) +
# Change point shape; set alpha
geom_point(shape = 21, size = 4, alpha = 0.6)
# Map color to fam
ggplot(mtcars, aes(wt, mpg, fill = fcyl, color = fam)) +
geom_point(shape = 21, size = 4, alpha = 0.6)
Notice that mapping a categorical variable onto fill doesn’t change the colors, although a legend is generated! This is because the default shape for points only has a color attribute and not a fill attribute! Use fill when you have another shape (such as a bar), or when using a point that does have a fill and a color attribute, such as shape = 21, which is a circle with an outline. Any time you use a solid color, make sure to use alpha blending to account for over plotting.
Be careful of a major pitfall: these attributes can overwrite the aesthetics of your plot!
# Establish the base layer
plt_mpg_vs_wt <- ggplot(mtcars, aes(wt, mpg))
# Map fcyl to size
plt_mpg_vs_wt +
geom_point(aes(size = fcyl))
plt_mpg_vs_wt +
geom_point(aes(alpha = fcyl))
# Map fcyl to shape, not alpha
plt_mpg_vs_wt +
geom_point(aes(shape = fcyl))
# Use text layer and map fcyl to label
plt_mpg_vs_wt +
geom_text(aes(label = fcyl))
Label and shape are only applicable to categorical data.
This time we’ll use these arguments to set attributes of the plot, not map variables onto aesthetics.
We can specify colors in R using hex codes: a hash followed by two hexadecimal numbers each for red, green, and blue (“#RRGGBB”). Hexadecimal is base-16 counting. We have 0 to 9, and A representing 10 up to F representing 15. Pairs of hexadecimal numbers give you a range from 0 to 255. “#000000” is “black” (no color), “#FFFFFF” means “white”, and `“#00FFFF” is cyan (mixed green and blue).
# A hexadecimal color
my_blue <- "#4ABEFF"
ggplot(mtcars, aes(wt, mpg)) +
# Set the point color and alpha
geom_point(color = my_blue, alpha = 0.6)
# Change the color mapping to a fill mapping
ggplot(mtcars, aes(wt, mpg, fill = fcyl)) +
# Set point size and shape
geom_point(color = my_blue, size = 10, shape = 1)
ggplot2 lets you control these attributes in many ways to customize your plots.
We can use all the aesthetics as attributes. Let’s see how this works with the aesthetics you used in the previous exercises: x, y, color, fill, size, alpha, label and shape.
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
# Add point layer with alpha 0.5
geom_point(alpha = 0.5)
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
# Add text layer with label rownames(mtcars) and color red
geom_text(label = rownames(mtcars), color = "red")
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
# Add points layer with shape 24 and color yellow
geom_point(shape = 24, color = "yellow")
Now, we will gradually add more aesthetics layers to the plot. We’re still working with the mtcars dataset, but this time we’re using more features of the cars. Each of the columns is described on the mtcars help page.
Notice that adding more aesthetic mappings to our plot is not always a good idea! We may just increase complexity and decrease readability.
# 3 aesthetics: qsec vs. mpg, colored by fcyl
ggplot(mtcars, aes(mpg, qsec, color = fcyl)) +
geom_point()
# 4 aesthetics: add a mapping of shape to fam
ggplot(mtcars, aes(mpg, qsec, color = fcyl, shape = fam)) +
geom_point()
# 5 aesthetics: add a mapping of size to hp / wt
ggplot(mtcars, aes(mpg, qsec, color = fcyl, shape = fam, size = hp/wt)) +
geom_point()
Between the x and y dimensions, the color, shape, and size of the points, your plot displays five dimensions of the dataset!
We’ll modify some aesthetics to make a bar plot of the number of cylinders for cars with different types of transmission.
We’ll also make use of some functions for improving the appearance of the plot.
ggplot(mtcars, aes(fcyl, fill = fam)) +
geom_bar() +
# Set the axis labels
labs(x = "Number of Cylinders", y = "Count")
palette <- c(automatic = "#377EB8", manual = "#E41A1C")
ggplot(mtcars, aes(fcyl, fill = fam)) +
geom_bar() +
labs(x = "Number of Cylinders", y = "Count") +
# Set the fill color scale
scale_fill_manual("Transmission", values = palette)
# Set the position
ggplot(mtcars, aes(fcyl, fill = fam)) +
geom_bar(position = "dodge") +
labs(x = "Number of Cylinders", y = "Count") +
scale_fill_manual("Transmission", values = palette)
Choosing the right position argument is an important part of making a good plot.
We saw all the visible aesthetics can serve as attributes and aesthetics, but we left out x and y. That’s because although we can make univariate plots (such as histograms, which you’ll get to in the next chapter), a y-axis will always be provided, even if you didn’t ask for it.
We can make univariate plots in ggplot2, but we will need to add a fake y axis by mapping y to zero.
When using setting y-axis limits, we can specify the limits as separate arguments, or as a single numeric vector. That is, ylim(lo, hi) or ylim(c(lo, hi)).
# Plot 0 vs. mpg
ggplot(mtcars, aes(mpg, 0)) +
# Add jitter
geom_point(position = "jitter")
ggplot(mtcars, aes(mpg, 0)) +
geom_jitter() +
# Set the y-axis limits
ylim(-2, 2)
The best way to make your plot depends on a lot of different factors and sometimes ggplot2 might not be the best choice.
Incorrect aesthetic mapping causes confusion or misleads the audience.
Typically, the dependent variable is mapped onto the the y-axis and the independent variable is mapped onto the x-axis.
Primary:
Secondary:
Never:
Always:
Effficient
Accurate
Scatter plots (using geom_point()) are intuitive, easily understood, and very common, but we must always consider overplotting, particularly in the following four situations:
Typically, alpha blending (i.e. adding transparency) is recommended when using solid shapes. Alternatively, you can use opaque, hollow shapes.
Small points are suitable for large datasets with regions of high density (lots of overlapping).
# Plot price vs. carat, colored by clarity
plt_price_vs_carat_by_clarity <- ggplot(diamonds, aes(carat, price, color = clarity))
# Add a point layer with tiny points
plt_price_vs_carat_by_clarity + geom_point(alpha = 0.5, shape = ".")
# Set transparency to 0.5, set shape to 16
plt_price_vs_carat_by_clarity + geom_point(alpha = 0.5, shape = 16)
Let’s take a look at another case where we should be aware of overplotting: Aligning values on a single axis.
This occurs when one axis is continuous and the other is categorical, which can be overcome with some form of jittering.
# Plot base
plt_mpg_vs_fcyl_by_fam <- ggplot(mtcars, aes(fcyl, mpg, color = fam))
# Default points are shown for comparison
plt_mpg_vs_fcyl_by_fam + geom_point()
# Default points are shown for comparison
plt_mpg_vs_fcyl_by_fam + geom_point()
# Alter the point positions by jittering, width 0.3
plt_mpg_vs_fcyl_by_fam + geom_point(position = position_jitter(width = 0.3))
# Default points are shown for comparison
plt_mpg_vs_fcyl_by_fam + geom_point()
# Now jitter and dodge the point positions
plt_mpg_vs_fcyl_by_fam + geom_point(position = position_jitterdodge(jitter.width = 0.3, dodge.width = 0.3))
Overplotting 3: Low-precision data We already saw how to deal with overplotting when using geom_point() in two cases:
We used position = ‘jitter’ inside geom_point() or geom_jitter().
Let’s take a look at another case:
This results from low-resolution measurements like in the iris dataset, which is measured to 1mm precision. It’s similar to case 2, but in this case we can jitter on both the x and y axis.
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
# Swap for jitter layer with width 0.1
geom_jitter(alpha = 0.5,width = 0.1)
#jitter within geom_point
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
# Set the position to jitter
geom_point(alpha = 0.5, position = "jitter")
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
# Use a jitter position function with width 0.1
geom_point(alpha = 0.5, position = position_jitter(width = 0.1))
Notice that jitter can be a geom itself (i.e. geom_jitter()), an argument in geom_point() (i.e. position = “jitter”), or a position function, (i.e. position_jitter()).
Let’s take a look at the last case of dealing with overplotting:
This can be type integer (i.e. 1 ,2, 3…) or categorical (i.e. class factor) variables. factor is just a special class of type integer.
We’ll typically have a small, defined number of intersections between two variables, which is similar to case 3, but you may miss it if you don’t realize that integer and factor data are the same as low precision data.
The Vocab dataset provided contains the years of education and vocabulary test scores from respondents to US General Social Surveys from 1972-2004.
library(carData)
Attaching package: 㤼㸱carData㤼㸲
The following object is masked _by_ 㤼㸱.GlobalEnv㤼㸲:
Vocab
# Examine the structure of Vocab
str(Vocab)
'data.frame': 30351 obs. of 4 variables:
$ year : num 1974 1974 1974 1974 1974 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 2 2 2 1 1 ...
$ education : num 14 16 10 10 12 16 17 10 12 11 ...
$ vocabulary: Factor w/ 11 levels "0","1","2","3",..: 10 10 10 6 9 9 10 6 4 6 ...
# Plot vocabulary vs. education
ggplot(Vocab, aes(education, vocabulary)) +
# Add a point layer
geom_point()
ggplot(Vocab, aes(education, vocabulary)) +
# Change to a jitter layer
geom_jitter()
ggplot(Vocab, aes(education, vocabulary)) +
# Set the transparency to 0.2
geom_jitter(alpha=0.2)
ggplot(Vocab, aes(education, vocabulary)) +
# Set the shape to 1
geom_jitter(alpha = 0.2, shape = 1)
Notice how jittering and alpha blending serves as a great solution to the overplotting problem here. Setting the shape to 1 didn’t really help, but it was useful in the previous exercises when you had less data. We need to consider each plot individually.
Histograms cut up a continuous variable into discrete bins and, by default, maps the internally calculated count variable (the number of observations in each bin) onto the y aesthetic. An internal variable called density can be accessed by using the .. notation, i.e. ..density… Plotting this variable will show the relative frequency, which is the height times the width of each bin.
# Plot mpg
ggplot(mtcars, aes(mpg)) +
# Add a histogram layer
geom_histogram()
ggplot(mtcars, aes(mpg)) +
# Set the binwidth to 1
geom_histogram(binwidth = 1)
# Map y to ..density..
ggplot(mtcars, aes(mpg, ..density..)) +
geom_histogram(binwidth = 1)
datacamp_light_blue <- "#51A8C9"
ggplot(mtcars, aes(mpg, ..density..)) +
# Set the fill color to datacamp_light_blue
geom_histogram(binwidth = 1, fill = datacamp_light_blue)
Histograms are one of the most common exploratory plots for continuous data. If you want to use density on the y-axis be sure to set your binwidth to an intuitive value.
Here, we’ll examine the various ways of applying positions to histograms. geom_histogram(), a special case of geom_bar(), has a position argument that can take on the following values:
# Update the aesthetics so the fill color is by fam
ggplot(mtcars, aes(mpg, fill = fam)) +
geom_histogram(binwidth = 1)
ggplot(mtcars, aes(mpg, fill = fam)) +
# Change the position to dodge
geom_histogram(binwidth = 1, position = "dodge")
ggplot(mtcars, aes(mpg, fill = fam)) +
# Change the position to fill
geom_histogram(binwidth = 1, position = "fill")
ggplot(mtcars, aes(mpg, fill = fam)) +
# Change the position to identity, with transparency 0.4
geom_histogram(binwidth = 1, position = "identity", alpha = 0.4)
Let’s see how the position argument changes geom_bar().
We have three position options:
While we will be using geom_bar() here, note that the function geom_col() is just geom_bar() where both the position and stat arguments are set to “identity”. It is used when we want the heights of the bars to represent the exact values in the data.
# Plot fcyl, filled by fam
ggplot(mtcars, aes(fcyl, fill = fam)) +
# Add a bar layer
geom_bar()
ggplot(mtcars, aes(fcyl, fill = fam)) +
# Set the position to "fill"
geom_bar(position = "fill")
ggplot(mtcars, aes(fcyl, fill = fam)) +
# Change the position to "dodge"
geom_bar(position = "dodge")
Different kinds of plots need different position arguments, so it’s important to be familiar with this attribute.
We can customize bar plots further by adjusting the dodging so that our bars partially overlap each other. Instead of using position = “dodge”, we’re going to use position_dodge(), like we did with position_jitter() in the the previously. Here, we’ll save this as an object, posn_d, so that we can easily reuse it.
Remember, the reason we want to use position_dodge() (and position_jitter()) is to specify how much dodging (or jittering) you want.
ggplot(mtcars, aes(cyl, fill = fam)) +
# Change position to use the functional form, with width 0.2
geom_bar(position = position_dodge(width = 0.2))
ggplot(mtcars, aes(cyl, fill = fam)) +
# Set the transparency to 0.6
geom_bar(position = position_dodge(width = 0.2), alpha = 0.6)
By using these position functions, we can customize your plot to suit your needs.
We’ll fill each segment according to an ordinal variable. The best way to do that is with a sequential color palette.
#example of using a sequential color palette
ggplot(mtcars, aes(fcyl, fill = fam)) +
geom_bar() +
scale_fill_brewer(palette = "Set1")
Vocab = Vocab %>%
mutate(vocabulary = as.factor(vocabulary))
# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
geom_bar()
# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
# Add a bar layer with position "fill"
geom_bar(position = "fill")
# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
# Add a bar layer with position "fill"
geom_bar(position = "fill") +
# Add a brewer fill scale with default palette
scale_fill_brewer()
We’ll use the economics dataset to make some line plots. The dataset contains a time series for unemployment and population statistics from the Federal Reserve Bank of St. Louis in the United States. The data is contained in the ggplot2 package.
To begin with, we can look at how the median unemployment time and the unemployment rate (the number of unemployed people as a proportion of the population) change over time.
# Print the head of economics
head(economics)
# Using economics, plot unemploy vs. date
ggplot(economics, aes(x = date, y = unemploy)) +
# Make it a line plot
geom_line()
# Change the y-axis to the proportion of the population that is unemployed
ggplot(economics, aes(date, unemploy/pop)) +
geom_line()
We already saw how the form of your data affects how you can plot it. Let’s explore that further with multiple time series. Here, it’s important that all lines are on the same scale, and if possible, on the same plot.
fish.species contains the global capture rates of seven salmon species from 1950–2010. Each variable (column) is a Salmon species and each observation (row) is one year. fish.tidy contains the same data, but in three columns: Species, Year, and Capture (i.e. one variable per column).
load("fish.RData")
library(tidyr)
# Use gather to go from fish.species to fish.tidy
fish.tidy <- gather(fish.species, Species, Capture, -Year)
str(fish.species)
'data.frame': 61 obs. of 8 variables:
$ Year : int 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 ...
$ Pink : int 100600 259000 132600 235900 123400 244400 203400 270119 200798 200085 ...
$ Chum : int 139300 155900 113800 99800 148700 143700 158480 125377 132407 113114 ...
$ Sockeye : int 64100 51200 58200 66100 83800 72000 84800 69676 100520 62472 ...
$ Coho : int 30500 40900 33600 32400 38300 45100 40000 39900 39200 32865 ...
$ Rainbow : int 0 100 100 100 100 100 100 100 100 100 ...
$ Chinook : int 23200 25500 24900 25300 24500 27700 25300 21200 20900 20335 ...
$ Atlantic: int 10800 9701 9800 8800 9600 7800 8100 9000 8801 8700 ...
str(fish.tidy)
'data.frame': 427 obs. of 3 variables:
$ Year : int 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 ...
$ Species: chr "Pink" "Pink" "Pink" "Pink" ...
$ Capture: int 100600 259000 132600 235900 123400 244400 203400 270119 200798 200085 ...
# Plot the Rainbow Salmon time series
ggplot(fish.species, aes(x = Year, y = Rainbow)) +
geom_line()
# Plot the Pink Salmon time series
ggplot(fish.species, aes(x = Year, y = Pink)) +
geom_line()
# Plot multiple time-series by grouping by species
ggplot(fish.tidy, aes(Year, Capture)) +
geom_line(aes(group = Species))
# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(x = Year, y = Capture, color = Species)) +
geom_line(aes(group = Species))
As we can see in the the last couple of plots, a grouping aesthetic was vital here. If you don’t specify color = Species, you’ll get a mess of lines.
To change stylistic elements of a plot, call theme() and set plot properties to a new value. For example, the following changes the legend position.
p + theme(legend.position = new_value)
Here, the new value can be
recess <- data.frame(
begin = c("1969-12-01","1973-11-01","1980-01-01","1981-07-01","1990-07-01","2001-03-01", "2007-12-01"),
end = c("1970-11-01","1975-03-01","1980-07-01","1982-11-01","1991-03-01","2001-11-01", "2009-07-30"),
event = c("Fiscal & Monetary\ntightening", "1973 Oil crisis", "Double dip I","Double dip II", "Oil price shock", "Dot-com bubble", "Sub-prime\nmortgage crisis"),
y = c(.01415981, 0.02067402, 0.02951190, 0.03419201, 0.02767339, 0.02159662,0.02520715),
stringsAsFactors = F
)
library(lubridate)
Attaching package: 㤼㸱lubridate㤼㸲
The following object is masked from 㤼㸱package:base㤼㸲:
date
recess$begin <- ymd (recess$begin)
recess$end <- ymd (recess$end)
plt_prop_unemployed_over_time = ggplot(economics, aes(x = date, y = unemploy/pop)) +
ggtitle(c("The percentage of unemployed Americans \n increases sharply during recessions")) +
geom_line() +
geom_rect(data = recess,
aes(xmin = begin, xmax = end, ymin = -Inf, ymax = +Inf, fill = "Recession"),
inherit.aes = FALSE, alpha = 0.2) +
geom_label(data = recess, aes(x = end, y = y, label=event), size = 3) +
scale_fill_manual(name = "", values="red", label="Recessions")
plt_prop_unemployed_over_time
# View the default plot
plt_prop_unemployed_over_time
# Remove legend entirely
plt_prop_unemployed_over_time +
theme(legend.position = "none")
# Position the legend at the bottom of the plot
plt_prop_unemployed_over_time +
theme(legend.position = "bottom")
# Position the legend inside the plot at (0.6, 0.1)
plt_prop_unemployed_over_time +
theme(legend.position = c(0.6,0.1))
But be careful when placing a legend inside your plotting space. You could end up obscuring data.
Many plot elements have multiple properties that can be set. For example, line elements in the plot such as axes and gridlines have a color, a thickness (size), and a line type (solid line, dashed, or dotted). To set the style of a line, you use element_line(). For example, to make the axis lines into red, dashed lines, you would use the following.
p + theme(axis.line = element_line(color = "red", linetype = "dashed"))
Similarly, element_rect() changes rectangles and element_text() changes text. You can remove a plot element using element_blank().
plt_prop_unemployed_over_time +
theme(
# For all rectangles, set the fill color to grey92
rect = element_rect(fill = "grey92"),
# For the legend key, turn off the outline
legend.key = element_rect(color = NA)
)
plt_prop_unemployed_over_time +
theme(
rect = element_rect(fill = "grey92"),
legend.key = element_rect(color = NA),
# Turn off axis ticks
axis.ticks = element_blank(),
# Turn off the panel grid
panel.grid = element_blank()
)
plt_prop_unemployed_over_time +
theme(
rect = element_rect(fill = "grey92"),
legend.key = element_rect(color = NA),
axis.ticks = element_blank(),
panel.grid = element_blank(),
# Add major y-axis panel grid lines back
panel.grid.major.y = element_line(
# Set the color to white
color = "white",
# Set the size to 0.5
size = 0.5,
# Set the line type to dotted
linetype = "dotted"
)
)
plt_prop_unemployed_over_time +
theme(
rect = element_rect(fill = "grey92"),
legend.key = element_rect(color = NA),
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.grid.major.y = element_line(
color = "white",
size = 0.5,
linetype = "dotted"
),
# Set the axis text color to grey25
axis.text = element_text(color ="grey25"),
# Set the plot title font face to italic and font size to 16
plot.title = element_text(size = 16, face = "italic")
)
Excellent Explanatory Plot! This plot is ready for prime time – it’s pretty AND informative. Make sure that all your text is legible for the context in which it will be viewed.
Whitespace means all the non-visible margins and spacing in the plot.
To set a single whitespace value, use unit(x, unit), where x is the amount and unit is the unit of measure.
Borders require you to set 4 positions, so use margin(top, right, bottom, left, unit). To remember the margin order, think TRouBLe.
The default unit is “pt” (points), which scales well with text. Other options include “cm”, “in” (inches) and “lines” (of text).
# View the original plot
plt_mpg_vs_wt_by_cyl <- ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
ylab("Miels per gallon") +
xlab("weight (1000/lbs)") +
geom_point()
plt_mpg_vs_wt_by_cyl
plt_mpg_vs_wt_by_cyl +
theme(
# Set the axis tick length to 2 lines
axis.ticks.length = unit(2, "lines")
)
plt_mpg_vs_wt_by_cyl +
theme(
# Set the legend key size to 3 centimeters
legend.key.size = unit(3, "cm")
)
plt_mpg_vs_wt_by_cyl +
theme(
# Set the legend margin to (20, 30, 40, 50) points
legend.margin = margin(20, 30, 40, 50, "pt")
)
plt_mpg_vs_wt_by_cyl +
theme(
# Set the plot margin to (10, 30, 50, 70) millimeters
plot.margin = margin(10, 30, 50, 70, "mm")
)
Changing the whitespace can be useful if you need to make your plot more compact, or if you want to create more space to reduce “business”.
Built-in themes In addition to making your own themes, there are several out-of-the-box solutions that may save you lots of time.
# Add a black and white theme
plt_prop_unemployed_over_time +
theme_bw()
# Add a classic theme
plt_prop_unemployed_over_time +
theme_classic()
# Add a void theme
plt_prop_unemployed_over_time +
theme_void()
The black and white theme works really well if you use transparency in your plot.
Outside of ggplot2, another source of built-in themes is the ggthemes package.
library(ggthemes)
package 㤼㸱ggthemes㤼㸲 was built under R version 3.6.3
# Use the fivethirtyeight theme
plt_prop_unemployed_over_time +
theme_fivethirtyeight()
# Use Tufte's theme
plt_prop_unemployed_over_time +
theme_tufte()
# Use the Wall Street Journal theme
plt_prop_unemployed_over_time +
theme_wsj()
ggthemes has over 20 themes for you to try.
Reusing a theme across many plots helps to provide a consistent style. You have several options for this.
A good strategy that we’ll use here is to begin with a built-in theme then modify it.
# Save the theme as theme_recession
theme_recession <- theme(
rect = element_rect(fill = "grey92"),
legend.key = element_rect(color = NA),
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.grid.major.y = element_line(color = "white", size = 0.5, linetype = "dotted"),
axis.text = element_text(color = "grey25"),
plot.title = element_text(face = "italic", size = 16),
legend.position = c(0.6, 0.1)
)
# Combine the Tufte theme with theme_recession
theme_tufte_recession <- theme_tufte() + theme_recession
# Add the Tufte recession theme to the plot
plt_prop_unemployed_over_time + theme_tufte_recession
theme_recession <- theme(
rect = element_rect(fill = "grey92"),
legend.key = element_rect(color = NA),
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.grid.major.y = element_line(color = "white", size = 0.5, linetype = "dotted"),
axis.text = element_text(color = "grey25"),
plot.title = element_text(face = "italic", size = 16),
legend.position = c(0.6, 0.1)
)
theme_tufte_recession <- theme_tufte() + theme_recession
# Set theme_tufte_recession as the default theme
theme_set(theme_tufte_recession)
# Draw the plot (without explicitly adding a theme)
plt_prop_unemployed_over_time
plt_prop_unemployed_over_time +
# Add Tufte's theme
theme_tufte()
plt_prop_unemployed_over_time +
theme_tufte() +
# Add individual theme elements
theme(
# Turn off the legend
legend.position = "none",
# Turn off the axis ticks
axis.ticks = element_blank()
)
plt_prop_unemployed_over_time +
theme_tufte() +
theme(
legend.position = "none",
axis.ticks = element_blank(),
# Set the axis title's text color to grey60
axis.title = element_text(color = "grey60"),
# Set the axis text's text color to grey60
axis.text = element_text(color = "grey60")
)
plt_prop_unemployed_over_time +
theme_tufte() +
theme(
legend.position = "none",
axis.ticks = element_blank(),
axis.title = element_text(color = "grey60"),
axis.text = element_text(color = "grey60"),
# Set the panel gridlines major y values
panel.grid.major.y = element_line(
# Set the color to grey60
color = "grey60",
# Set the size to 0.25
size = 0.25,
# Set the linetype to dotted
linetype = "dotted"
)
)
Let’s focus on producing beautiful and effective explanatory plots. In the next couple of exercises, we’ll create a plot that is similar to the one shown in the video using gm2007, a filtered subset of the gapminder dataset.
This type of plot will be in an info-viz style, meaning that it would be similar to something you’d see in a magazine or website for a mostly lay audience.
library(gapminder)
package 㤼㸱gapminder㤼㸲 was built under R version 3.6.3
gapminder
gm2007 <- gapminder %>%
filter(year == 2007) %>%
select(country, lifeExp, continent) %>%
filter(lifeExp > 80.6 | lifeExp <46) %>%
arrange(lifeExp)
gm2007
gm2007_full <- gapminder %>%
filter(year == 2007) %>%
select(country, lifeExp, continent)
# Add a geom_segment() layer
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
geom_point(size = 4) +
geom_segment(aes(xend = 30, yend = country), size = 2) +
theme(legend.position="right")
# Add a geom_text() layer
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
geom_point(size = 4) +
geom_segment(aes(xend = 30, yend = country), size = 2) +
geom_text(aes(label = lifeExp), color = "white", size = 1.5) +
theme(legend.position="right")
library(RColorBrewer)
# Set the color scale
palette <- brewer.pal(5, "RdYlBu")[-(2:4)]
# Modify the scales
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
geom_point(size = 4) +
geom_segment(aes(xend = 30, yend = country), size = 2) +
geom_text(aes(label = round(lifeExp,1)), color = "white", size = 1.5) +
scale_x_continuous("", expand = c(0, 0), limits = c(30,90), position = "top") +
scale_color_gradientn(colors = palette) +
theme(legend.position="right")
# Set the color scale
palette <- brewer.pal(5, "RdYlBu")[-(2:4)]
# Add a title and caption
plt_country_vs_lifeExp <- ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
geom_point(size = 4) +
geom_segment(aes(xend = 30, yend = country), size = 2) +
geom_text(aes(label = round(lifeExp,1)), color = "white", size = 1.5) +
scale_x_continuous("", expand = c(0,0), limits = c(30,90), position = "top") +
scale_color_gradientn(colors = palette) +
labs(title = "Highest and lowest life expectancies, 2007", caption = "Source: gapminder") +
theme(legend.position="right")
plt_country_vs_lifeExp
We completed our basic plot. Now let’s polish it by playing with the theme and adding annotations. In this exercise, we’ll use annotate() to add text and a curve to the plot.
The following values have been calculated for you to assist with adding embellishments to the plot:
global_mean <- mean(gm2007_full$lifeExp)
x_start <- global_mean + 4
y_start <- 5.5
x_end <- global_mean
y_end <- 7.5
# Define the theme
plt_country_vs_lifeExp <- plt_country_vs_lifeExp +
theme_classic() +
theme(axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text = element_text(color = "black"),
axis.title = element_blank(),
legend.position = "none")
plt_country_vs_lifeExp
# Add a vertical line
plt_country_vs_lifeExp <- plt_country_vs_lifeExp +
geom_vline(xintercept = global_mean, color = "grey40", linetype = 3)
plt_country_vs_lifeExp
plt_country_vs_lifeExp <- plt_country_vs_lifeExp +
annotate(
"text",
x = x_start, y = y_start,
label = "The\nglobal\naverage",
vjust = 1, size = 3, color = "grey40"
)
plt_country_vs_lifeExp
plt_country_vs_lifeExp <- plt_country_vs_lifeExp +
annotate(
"curve",
x = x_start, y = y_start,
xend = x_end, yend = y_end,
arrow = arrow(length = unit(0.2, "cm"), type = "closed"),
color = "grey40"
)
plt_country_vs_lifeExp
Your explanatory plot clearly shows the countries with the highest and lowest life expectancy and would be great for a lay audience.