Today’s tutorial

Today we will be using graphical methods to do some basic exploratory data analysis.

Starting off

Remember to set your working directory before you start. (See last week’s tutorial if you have forgotten how.) You also need to load today’s packages. If they are not already installed, you will need to install them first. (Ask me how.)

library(ggplot2)
library(ggthemes)
library(GGally)

For example, imagine I am loading a CSV file full of (fake but realistic) information on polar bears, including information about sex, age, and body measurements such as head length, body length, etc. We can rename some of the values to make them more attractive when plotting.

PolarBearData <- read.csv("PolarBearData.csv")
PolarBearData$AgeClass <- factor(PolarBearData$AgeClass,
                                 levels = c("ADULT", "SUBADULT"),
                                 labels = c("Adult", "Subadult"))
PolarBearData$Sex <- factor(PolarBearData$Sex,
                            levels = c("F", "M"),
                            labels = c("Female", "Male"))

Let’s get plotting in ggplot!

There are three major systems for plotting graphics in R. They are:

  1. base graphics
  2. lattice
  3. ggplot

For basic plots, base graphics (such as the plot() and hist() functions) are easy and attractive. But for more complicated plots, or for data exploration where you are exploring multiple variables, it is easiest to use ggplot().

For the remainder of today’s lab, we will be working with the ggplot2 package. This package includes powerful methods for visual data exploration.

Scatterplot (continuous x, continuous y)

ggplot is based on the “grammar of graphics”. In this system you build a plot, one layer at a time. For example, suppose I want to create this scatter plot, which also shows linear regressions of head length on body length for both males and females:

The code that I used to create this graph was:

ggplot(data = PolarBearData,
       aes(x = BodyLength.cm, y = HeadLength.mm, colour = Sex, fill = Sex, shape = Sex)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  scale_fill_manual(values = c("red", "blue")) +
  scale_colour_manual(values = c("red", "blue")) +
  labs(x = "Body Length (cm)", y = "Head Length (mm)") +
  theme_few()

Let’s break this down. Every ggplot2 plot must include three things:

  1. The dataset.
  2. “Aesthetic mappings” between columns of data and elements of the plot. (For example, which column do you want to be the x-axis and which column do you want to be the y-axis?) These are specified using the aes() function.
  3. One or more layer describing how to display the data. These are specified using geom_ functions. There are many geoms, depending on whether your x- and y-axes are discrete or continuous.1

The first step to creating ggplot graphics is to call the ggplot() function. Inside the ggplot() function, you define all of the common elements of your plot. Generally, this means you want to define the dataset and specify the axes. The data can be defined as data = PolarBearData. The x- and y-axes are specified as aesthetics inside a function called aes(). “Aesthetics” are various aspects of the plot (such as axes, colours, etc.) that will be mapped to columns of your data.

This command creates a new (empty) plot with HeadLength.mm on the x-axis and HeadLength.mm on the y-axis.

ggplot(data = PolarBearData, aes(x = BodyLength.cm, y = HeadLength.mm))

The next step is to plot the data as a new layer on top of the graph we set up. We can do this using one of ggplot’s geom (i.e., geometry) functions. We will use geom_point() to add points to our plot. Note that you add this element to the plot with a +:

ggplot(data = PolarBearData, aes(x = BodyLength.cm, y = HeadLength.mm)) +
  geom_point()

If we want to colour our data points by another variable, we can do it easily with an aesthetic. For example, to colour data points by Sex:

ggplot(data = PolarBearData, aes(x = BodyLength.cm, y = HeadLength.mm, colour = Sex)) +
  geom_point()

To add a line of best fit to our plot, we can add geom_smooth().

By default, geom_smooth() adds a line produced using a smoothing technique called LOESS. LOESS is useful for exploring patterns without assuming that the relationship between variables is necessarily a straight line. LOESS is essentially a moving (weighted) average calculated within a sliding window. To change the width of the sliding window, you can use the span option. When datasets are large (i.e., \(n \ge 1000\)), geom_smooth() uses a generalized additive model (GAM) rather than LOESS.

ggplot(data = PolarBearData, aes(x = BodyLength.cm, y = HeadLength.mm)) +
  geom_point() +
  geom_smooth()

To force geom_smooth() to use a linear regression, add the method = "lm" option:

ggplot(data = PolarBearData, aes(x = BodyLength.cm, y = HeadLength.mm)) +
  geom_point() +
  geom_smooth(method = "lm")

We can change the colour of a line and the shading of its confidence interval by adding aesthetics. If we add Sex as an aesthetic, ggplot will display males and females distinctly, and conduct regressions on them separately:

ggplot(data = PolarBearData, aes(x = BodyLength.cm, y = HeadLength.mm,
                                 colour = Sex, fill = Sex)) +
  geom_point() +
  geom_smooth(method = "lm")

We can also add a shape aesthetic to help the colorblind. And the ggthemes package includes color-blind friendly palettes, which we can specify with scale_color_colorblind() and scale_fill_colorblind().

ggplot(data = PolarBearData, aes(x = BodyLength.cm, y = HeadLength.mm,
           colour = Sex, fill = Sex, shape = Sex)) +
  geom_point() +
  geom_smooth(method = "lm") +
  scale_color_colorblind() +
  scale_fill_colorblind()

To change the colours of ggplot from their defaults (magenta and green), you can use scale_fill_manual(). scale_colour_manual() will allow you to choose the colours of lines. (Note: I have placed a cheat sheet for colours in R on D2L.) We can also use alpha to set the transparency of the data points. (Note: alpha = 0 is completely transparent; alpha = 1 is completely opaque.)

ggplot(data = PolarBearData,
       aes(x = BodyLength.cm, y = HeadLength.mm, colour = Sex,
           fill = Sex, shape = Sex)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  scale_fill_manual(values = c("red", "blue")) +
  scale_colour_manual(values = c("red", "blue"))

We can use the labs() function to change the x- and y-axis labels:

ggplot(data = PolarBearData,
       aes(x = BodyLength.cm, y = HeadLength.mm, colour = Sex, fill = Sex, shape = Sex)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  scale_fill_manual(values = c("red", "blue")) +
  scale_colour_manual(values = c("red", "blue")) +
  labs(x = "Body Length (cm)", y = "Head Length (mm)")

Our last step will be to change the default ggplot theme. Although ggplot is very powerful, its default settings (i.e., grey background with white lines) look terrible when printed. The ggthemes package contains many ggplot themes. I like theme_few():

ggplot(data = PolarBearData,
       aes(x = BodyLength.cm, y = HeadLength.mm, colour = Sex, fill = Sex, shape = Sex)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  scale_fill_manual(values = c("red", "blue")) +
  scale_colour_manual(values = c("red", "blue")) +
  labs(x = "Body Length (cm)", y = "Head Length (mm)") +
  theme_few()

Scatterplots can be further improved by showing the univariate distribution of each variable in the margins of the figure (e.g., using the ggMarginal() function from the ggExtra package).

Histogram (continuous x)

A histogram plots the sample distribution as binned counts or proportions. They are useful for determining shape of distribution, esp. when you have moderate to large sample sizes (i.e., \(n \ge 20\)). To define a histogram, use your variable of interest as the x-axis. Do not define a y-axis aesthetic: the count data of your binned variable becomes the y-axis automatically:

ggplot(data = PolarBearData, aes(x = HeadLength.mm)) +
  geom_histogram(fill = "lightgrey") +
  labs(x = "Head Length (mm)", y = "Count") +
  theme_few()

You can change the width of a bin using the binwidth option. By default, ggplot uses a total of 30 bins. To change the bin width to 10 mm:

ggplot(data = PolarBearData, aes(x = HeadLength.mm)) +
  geom_histogram(fill = "lightgrey", binwidth = 10) +
  labs(x = "Head Length (mm)", y = "Count") +
  theme_few()

Alternatives to histograms include frequency polygons (geom_freqpoly()), and–when you have large sample sizes–kernel density plots (geom_density()):

ggplot(data = PolarBearData, aes(x = HeadLength.mm)) +
  geom_density(fill = "lightgrey") +
  labs(x = "Head Length (mm)", y = "Density") +
  theme_few()

As usual, you can colour each sex separately using colour and/or fill.

ggplot(data = PolarBearData, aes(x = HeadLength.mm, colour = Sex, fill = Sex)) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(values = c("red", "blue")) +
  scale_colour_manual(values = c("red", "blue")) +
  labs(x = "Head Length (mm)", y = "Density") +
  theme_few()

Or you can create separate plots for males and females using facetting in the facet_wrap() function:

ggplot(data = PolarBearData, aes(x = HeadLength.mm)) +
  geom_density(fill = "lightgrey") +
  facet_wrap(~Sex, nrow = 2) +
  labs(x = "Head Length (mm)", y = "Density") +
  theme_few()

Box plot (discrete x, continuous y)

geom_boxplot() creates a box plot. It is most useful when you have sample sizes of \(n \ge 10\). By default, a box plot shows a central line for the median, the hinges of the box represent the interquartile range (IQR; i.e., the central 50% of the data), and the whiskers represent the most extreme observations up to a distance of \(1.5 \times IQR\) from the median. Data points that are more than \(1.5 \times IQR\) from the median are classified as outliers and shown as dots.

If you have small sample sizes, it can be helpful to show the raw data points by adding a dotplot (i.e., geom_jitter(height = 0)). With larger sample sizes, it can be helpful to add a violin plot (i.e., geom_violin()) to show the distribution of the underlying data.

ggplot(data = PolarBearData, aes(x = Sex, y = HeadLength.mm)) +
  geom_violin(fill = "lightgrey") + 
  geom_boxplot(width = 0.25) +
  labs(y = "Head Length (mm)") +
  theme_few()

Bar plot (discrete x)

geom_bar() will create a bar plot. The y-axis is the number of observations that fall into category x. To remove the legend, we can add theme(legend.position = "none").

ggplot(data = PolarBearData, aes(x = Sex, fill = Sex)) +
  geom_bar() +
  scale_fill_manual(values = c("red", "blue")) +
  labs(y = "Count") +
  theme_few() +
  theme(legend.position = "none")

Bubble plot (discrete x, discrete y)

geom_count() creates a bubble plot. It shows the number of observations that fall into each combination of categories.

ggplot(data = PolarBearData, aes(x = Sex, y = AgeClass)) +
  geom_count(shape = "square") +
  scale_size_area() +
  labs(x = "Sex", y = "Age Class", size = "Count") +
  theme_few()

ggpairs

The ggpairs() function (in the GGally package) is an easy way to produce a scatterplot matrix that shows every two-way combination of your variables. Here, we will look at every two-way combination of age class, sex, body length, and head length. What information can you take from this plot?

SelectedData <- PolarBearData[,c("AgeClass", "Sex", "BodyLength.cm", "HeadLength.mm")]
ggpairs(SelectedData) +
  theme_few()

Saving a plot

To save a plot to a file, use ggsave(). By default, this will save the last plot that you created. For example:

  • ggsave("plot.pdf") would save the last plot as a PDF file (vector graphics)
    • If submitting your figure to a print journal, you should save it as a PDF using the greyscale (colormodel = "grey") or CMYK (colormodel = "cmyk") options to ensure that the colours appear as you expect when printed
  • ggsave("plot.jpeg") would save the last plot as a JPEG file (raster graphics)
    • To change image resolution and image size, use the dpi, width, height, and units options.

This week’s task

On D2L, I have added two CSV files. Explore each of these datasets graphically. What can you learn about the data? Try to create a variety of different plots (e.g., 2-D and 1-D scatterplots, bar plots, box plots, etc.) to appropriately visualize the data.

  1. MammalData.csv: This is the mammal dataset that you worked with last week. Please see last week’s tutorial for a description of the data and its source.
  2. milk.csv: This dataset (from https://onlinelibrary.wiley.com/doi/abs/10.1002/evan.20289) contains information about the composition of milk of various primates. The dataset contains eight columns and 29 data rows (plus the header). Data rows include:
    • clade: Broad taxonomic group
    • species: Species name
    • kcal.per.g: Kilocalories per gram of milk
    • perc.fat: Percent fat
    • perc.protein: Percent protein
    • perc.lactose: Percent lactose
    • mass: Body mass of mother, in kilograms
    • neocortex.perc: Percent of brain mass that is neocortex

  1. See the Data Visualization Cheatsheet in the Resources folder on D2L.↩︎