Part 2 is here.
Goal: understand the principles that ggplot is built on, and the steps needed to create a wide variety of basic plots.
We assume you’re familiar with the basic mechanics of R:
So not at this level :)
This is intended to be a hands-on workshop, so we also assume:
Visualizing data is critical:
The x and y mean, standard deviation, and x-y correlation are unchanged throughout this animation.
Another example of this is Anscombe’s Quartet:
All four of these datasets have identical mean(x)
, mean(y)
, var(x)
, var(y)
, cor(x, y)
, and regression (intercept, slope, r-squared). 🤯
Lots of research has been done on effective data visualization with respect to science communication. Read a bit of it. For example here are one author’s ten principles of effective data visualization:
To these I would only add “know your audience”.
Remember, data visualization can have consequences!
One of the simplest datasets included with R is cars
:
cars
plot(cars)
That seems pretty good! What’s the problem?
Well, what about iris
? This is a famous dataset; from the help (?iris
):
This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
iris
Note that each row of iris
is an individual flower; there are four observations per row. We’ll come back to this structural point later.
Let’s plot two of its columns against each other, coloring by species:
plot(iris$Sepal.Length, iris$Sepal.Width, col = iris$Species)
legend(7, 4.3,
unique(iris$Species),
col = 1:length(iris$Species),
pch = 1)
This is a bunch of code for such a simple plot; note that:
plot
code understands numeric vectors, so we need to repeatedly specify iris$<column>
Things quickly gets worse if we want more complexity or features. What’s the underlying pproblem?
Without a grammar, there is no underlying theory, so most graphics packages are just a big collection of special cases.
From the ggplot2 book.
Above we made some scatterplots, perhaps the simplest graph type.
What precisely is a scatterplot? You have seen many before and have probably even drawn some by hand. A scatterplot represents each observation as a point, positioned according to the value of two variables. As well as a horizontal and vertical position, each point also has a size, a colour and a shape. These attributes are called aesthetics, and are the properties that can be perceived on the graphic. Each aesthetic can be mapped to a variable, or set to a constant value.
This insight had been made before Hadley Wickham’s original paper, but in the context of R it laid the ground for ggplot2:
To be precise, the layered grammar defines the components of a plot as:
- a default dataset and set of mappings from variables to aesthetics,
- one or more layers, with each layer having one geometric object, one statistical transformation, one position adjustment, and optionally, one dataset and set of aesthetic mappings,
- one scale for each aesthetic mapping used,
- a coordinate system,
- the facet specification.
We are learning about (a subset of) these steps today.
Say we have a plot we want to make, a slightly more complicated version of Wickham (2010) Figure 2 above:
In the grammar of graphics / ggplot2 system, plots are built up from sequential layers: these are procedural steps, but also literal visual layers, the net result of which is the final plot. Later steps can modify and override what’s ‘presented’ by previous layers.
Visually:
We’re going to walk through these layers, one by one.
The first (or in back-to-front numbering, as in the image above, the seventh) step involves our data.
As noted above, the structure of our data has implications for how we plot it; more precisely, to effectively use ggplot2 we want our data to be structured a certain way. But again 😄 let’s come back to that point.
Generally, our data for plotting should be in tabular format, with rows and named columns. In R this is typically a data.frame
or a tibble
.
Hey, iris
is a data frame. Let’s call ggplot()
on it!
library(ggplot2)
ggplot(iris)
Well, that was disappointing.
Remember how easy plot(cars)
was above…why didn’t anything happen here? Well, ggplot()
doesn’t know how to map our plot aesthetics to our data, and it doesn’t know what geom to use for subsequent visualization.
As we said above, the aesthetics of each layer in our plot can either be * constant, or * mapped to a column of data
Inverting this statement means that * Any non-constant aesthetic has to be its own column in the data
This idea of mapping aesthetics to columns thus has implications for our the structure of our data.
Remember what iris
looks like:
This is problematic. What if we wanted an aesthetic like color
to depend on what dimension or organ we’re measuring?
iris
is structured in a form convenient for humans, but not one particularly handy for computers.
In general it’s best to start with your data in “tidy” form, a.k.a. long form, when preparing to use ggplot2. This means that every row contains exactly one observation; specifically:
With all this in mind, it’s clear we need to reshape our data. Let’s assume, for the rest of this workshop, that we’re particularly interested in comparing observations of petals versus those of sepals:
# Here we use base R's "reshape" function
# There are many alternatives; in particular, check out
# the powerful "tidyr" package
iris_long <- reshape(iris,
varying = c("Sepal.Length",
"Sepal.Width",
"Petal.Length",
"Petal.Width"),
timevar = "dimension",
direction = "long")
iris_long
Note that this is not strictly “tidy data”, per the definition above. Why not?
With this reshaping, we can proceed to map aesthetics to columns.