Today we will be using graphical methods to do some basic exploratory data analysis.
Remember to set your working directory before you start. (See last week’s tutorial if you have forgotten how.) You also need to load today’s packages. If they are not already installed, you will need to install them first. (Ask me how.)
library(ggplot2)
library(ggthemes)
library(GGally)
For example, imagine I am loading a CSV file full of (fake but realistic) information on polar bears, including information about sex, age, and body measurements such as head length, body length, etc. We can rename some of the values to make them more attractive when plotting.
PolarBearData <- read.csv("PolarBearData.csv")
PolarBearData$AgeClass <- factor(PolarBearData$AgeClass,
levels = c("ADULT", "SUBADULT"),
labels = c("Adult", "Subadult"))
PolarBearData$Sex <- factor(PolarBearData$Sex,
levels = c("F", "M"),
labels = c("Female", "Male"))
There are three major systems for plotting graphics in R. They are:
For basic plots, base graphics (such as the plot() and
hist() functions) are easy and attractive. But for more
complicated plots, or for data exploration where you are exploring
multiple variables, it is easiest to use ggplot().
For the remainder of today’s lab, we will be working with the ggplot2 package. This package includes powerful methods for visual data exploration.
ggplot is based on the “grammar of graphics”. In this system you build a plot, one layer at a time. For example, suppose I want to create this scatter plot, which also shows linear regressions of head length on body length for both males and females:
The code that I used to create this graph was:
ggplot(data = PolarBearData,
aes(x = BodyLength.cm, y = HeadLength.mm, colour = Sex, fill = Sex, shape = Sex)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
scale_fill_manual(values = c("red", "blue")) +
scale_colour_manual(values = c("red", "blue")) +
labs(x = "Body Length (cm)", y = "Head Length (mm)") +
theme_few()
Let’s break this down. Every ggplot2 plot must include three things:
aes() function.geom_ functions. There are many geoms,
depending on whether your x- and y-axes are discrete
or continuous.1The first step to creating ggplot graphics is to call the
ggplot() function. Inside the ggplot()
function, you define all of the common elements of your plot. Generally,
this means you want to define the dataset and specify the
axes. The data can be defined as data = PolarBearData.
The x- and y-axes are specified as
aesthetics inside a function called aes().
“Aesthetics” are various aspects of the plot (such as axes, colours,
etc.) that will be mapped to columns of your data.
This command creates a new (empty) plot with HeadLength.mm on the x-axis and HeadLength.mm on the y-axis.
ggplot(data = PolarBearData, aes(x = BodyLength.cm, y = HeadLength.mm))
The next step is to plot the data as a new layer on top of the graph
we set up. We can do this using one of ggplot’s geom
(i.e., geometry) functions. We will use geom_point() to add
points to our plot. Note that you add this element to the plot with a
+:
ggplot(data = PolarBearData, aes(x = BodyLength.cm, y = HeadLength.mm)) +
geom_point()
If we want to colour our data points by another variable, we can do it easily with an aesthetic. For example, to colour data points by Sex:
ggplot(data = PolarBearData, aes(x = BodyLength.cm, y = HeadLength.mm, colour = Sex)) +
geom_point()
To add a line of best fit to our plot, we can add
geom_smooth().
By default, geom_smooth() adds a line produced using a
smoothing technique called LOESS. LOESS is useful for exploring patterns
without assuming that the relationship between variables is necessarily
a straight line. LOESS is essentially a moving (weighted) average
calculated within a sliding window. To change the width of the sliding
window, you can use the span option. When datasets are
large (i.e., \(n \ge 1000\)),
geom_smooth() uses a generalized additive model (GAM)
rather than LOESS.
ggplot(data = PolarBearData, aes(x = BodyLength.cm, y = HeadLength.mm)) +
geom_point() +
geom_smooth()
To force geom_smooth() to use a linear regression, add
the method = "lm" option:
ggplot(data = PolarBearData, aes(x = BodyLength.cm, y = HeadLength.mm)) +
geom_point() +
geom_smooth(method = "lm")
We can change the colour of a line and the shading of its confidence interval by adding aesthetics. If we add Sex as an aesthetic, ggplot will display males and females distinctly, and conduct regressions on them separately:
ggplot(data = PolarBearData, aes(x = BodyLength.cm, y = HeadLength.mm,
colour = Sex, fill = Sex)) +
geom_point() +
geom_smooth(method = "lm")
We can also add a shape aesthetic to help the
colorblind. And the ggthemes package includes color-blind
friendly palettes, which we can specify with
scale_color_colorblind() and
scale_fill_colorblind().
ggplot(data = PolarBearData, aes(x = BodyLength.cm, y = HeadLength.mm,
colour = Sex, fill = Sex, shape = Sex)) +
geom_point() +
geom_smooth(method = "lm") +
scale_color_colorblind() +
scale_fill_colorblind()
To change the colours of ggplot from their defaults (magenta and
green), you can use scale_fill_manual().
scale_colour_manual() will allow you to choose the colours
of lines. (Note: I have placed a cheat sheet for colours in R on D2L.)
We can also use alpha to set the transparency of the data
points. (Note: alpha = 0 is completely transparent; alpha = 1 is
completely opaque.)
ggplot(data = PolarBearData,
aes(x = BodyLength.cm, y = HeadLength.mm, colour = Sex,
fill = Sex, shape = Sex)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
scale_fill_manual(values = c("red", "blue")) +
scale_colour_manual(values = c("red", "blue"))
We can use the labs() function to change the x-
and y-axis labels:
ggplot(data = PolarBearData,
aes(x = BodyLength.cm, y = HeadLength.mm, colour = Sex, fill = Sex, shape = Sex)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
scale_fill_manual(values = c("red", "blue")) +
scale_colour_manual(values = c("red", "blue")) +
labs(x = "Body Length (cm)", y = "Head Length (mm)")
Our last step will be to change the default ggplot theme. Although
ggplot is very powerful, its default settings (i.e., grey background
with white lines) look terrible when printed. The ggthemes package
contains many ggplot themes. I like theme_few():
ggplot(data = PolarBearData,
aes(x = BodyLength.cm, y = HeadLength.mm, colour = Sex, fill = Sex, shape = Sex)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
scale_fill_manual(values = c("red", "blue")) +
scale_colour_manual(values = c("red", "blue")) +
labs(x = "Body Length (cm)", y = "Head Length (mm)") +
theme_few()
Scatterplots can be further improved by showing the univariate
distribution of each variable in the margins of the figure (e.g., using
the ggMarginal() function from the ggExtra package).
A histogram plots the sample distribution as binned counts or proportions. They are useful for determining shape of distribution, esp. when you have moderate to large sample sizes (i.e., \(n \ge 20\)). To define a histogram, use your variable of interest as the x-axis. Do not define a y-axis aesthetic: the count data of your binned variable becomes the y-axis automatically:
ggplot(data = PolarBearData, aes(x = HeadLength.mm)) +
geom_histogram(fill = "lightgrey") +
labs(x = "Head Length (mm)", y = "Count") +
theme_few()
You can change the width of a bin using the binwidth
option. By default, ggplot uses a total of 30 bins. To change the bin
width to 10 mm:
ggplot(data = PolarBearData, aes(x = HeadLength.mm)) +
geom_histogram(fill = "lightgrey", binwidth = 10) +
labs(x = "Head Length (mm)", y = "Count") +
theme_few()
Alternatives to histograms include frequency polygons
(geom_freqpoly()), and–when you have large sample
sizes–kernel density plots (geom_density()):
ggplot(data = PolarBearData, aes(x = HeadLength.mm)) +
geom_density(fill = "lightgrey") +
labs(x = "Head Length (mm)", y = "Density") +
theme_few()
As usual, you can colour each sex separately using
colour and/or fill.
ggplot(data = PolarBearData, aes(x = HeadLength.mm, colour = Sex, fill = Sex)) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = c("red", "blue")) +
scale_colour_manual(values = c("red", "blue")) +
labs(x = "Head Length (mm)", y = "Density") +
theme_few()
Or you can create separate plots for males and females using
facetting in the facet_wrap() function:
ggplot(data = PolarBearData, aes(x = HeadLength.mm)) +
geom_density(fill = "lightgrey") +
facet_wrap(~Sex, nrow = 2) +
labs(x = "Head Length (mm)", y = "Density") +
theme_few()
geom_boxplot() creates a box plot. It is most useful
when you have sample sizes of \(n \ge
10\). By default, a box plot shows a central line for the median,
the hinges of the box represent the interquartile range (IQR; i.e., the
central 50% of the data), and the whiskers represent the most extreme
observations up to a distance of \(1.5 \times
IQR\) from the median. Data points that are more than \(1.5 \times IQR\) from the median are
classified as outliers and shown as dots.
If you have small sample sizes, it can be helpful to show the raw
data points by adding a dotplot (i.e.,
geom_jitter(height = 0)). With larger sample sizes, it can
be helpful to add a violin plot (i.e., geom_violin()) to
show the distribution of the underlying data.
ggplot(data = PolarBearData, aes(x = Sex, y = HeadLength.mm)) +
geom_violin(fill = "lightgrey") +
geom_boxplot(width = 0.25) +
labs(y = "Head Length (mm)") +
theme_few()
geom_bar() will create a bar plot. The y-axis
is the number of observations that fall into category x. To
remove the legend, we can add
theme(legend.position = "none").
ggplot(data = PolarBearData, aes(x = Sex, fill = Sex)) +
geom_bar() +
scale_fill_manual(values = c("red", "blue")) +
labs(y = "Count") +
theme_few() +
theme(legend.position = "none")
geom_count() creates a bubble plot. It shows the number
of observations that fall into each combination of categories.
ggplot(data = PolarBearData, aes(x = Sex, y = AgeClass)) +
geom_count(shape = "square") +
scale_size_area() +
labs(x = "Sex", y = "Age Class", size = "Count") +
theme_few()
The ggpairs() function (in the GGally package) is an
easy way to produce a scatterplot matrix that shows every two-way
combination of your variables. Here, we will look at every two-way
combination of age class, sex, body length, and head length. What
information can you take from this plot?
SelectedData <- PolarBearData[,c("AgeClass", "Sex", "BodyLength.cm", "HeadLength.mm")]
ggpairs(SelectedData) +
theme_few()
To save a plot to a file, use ggsave(). By default, this
will save the last plot that you created. For example:
ggsave("plot.pdf") would save the last plot as a PDF
file (vector graphics)
colormodel = "grey") or CMYK
(colormodel = "cmyk") options to ensure that the colours
appear as you expect when printedggsave("plot.jpeg") would save the last plot as a JPEG
file (raster graphics)
dpi,
width, height, and units
options.On D2L, I have added two CSV files. Explore each of these datasets graphically. What can you learn about the data? Try to create a variety of different plots (e.g., 2-D and 1-D scatterplots, bar plots, box plots, etc.) to appropriately visualize the data.
See the Data Visualization Cheatsheet in the Resources folder on D2L.↩︎