Looking at the numbers and character strings that define a dataset is rarely useful. To convince yourself, print and stare at this data table:
library(tidyverse)
library(dslabs)
data(murders)
head(murders)
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
What do you learn from staring at this table? How quickly can you determine which states have the largest populations? Which states have the smallest? How large is a typical state? Is there a relationship between population size and total murders? How do murder rates vary across regions of the country? For most human brains it is quite difficult to extract this information just from looking at the numbers. In contrast, the answer to all the questions above are readily available from examining this plot.
## Warning: package 'ggthemes' was built under R version 4.0.5
## Warning: package 'ggrepel' was built under R version 4.0.5
We are reminded of the saying “a picture is worth a thousand words”. Data visualization provides a powerful way to communicate a data-driven finding. In some cases, the visualization is so convincing that no follow-up analysis is required. We also note that many widely used data analysis tools were initiated by discoveries made via exploratory data analysis (EDA). EDA is perhaps the most important part of data analysis, yet is often overlooked.
With the talks New Insights on Poverty and The Best Stats You’ve Ever Seen, Hans Rosling forced us to to notice the unexpected with a series of plots related to world health and economics. In his videos, he used animated grpahs to show us how the world was changing and that old narratives are no longer true. We will use this data as an example to learn about ggplot2 and data visualization.
It is also important to note that mistakes, biases, systematic errors and other unexpected problems often lead to data that should be handled with care. Failure to discover these problems often leads to flawed analyses and false discoveries. As an example, consider that measurement devices sometimes fail and that most data analysis procedures are not designed to detect these. Yet, these data analysis procedures will still give you an answer. The fact that it can be hard or impossible to notice an error just from the reported results, makes data visualization particularly important.
Today we will learn the basics of the ggplot2 package - the software we will use to learn the basics of data visualization and exploratory data analysis. We will use motivating examples and start by reproducing the murders by state example to learn the basics of ggplot2. Then we will cover world health and economics and infectious disease trends in the United States.
Note that there is much more to data visualization than what we cover here. More references include:
We will cover the basics of interactive graphics later in this course. If you want to check out interactive graphs now, below are some useful resources for learning more.
We have learned several data visualization techniques and are ready to learn how to create them in R. We will be using the ggplot2
package. We can load it, along with dplyr
, as part of the tidyverse:
library(tidyverse)
One reason ggplot2
is generally more intuitive for beginners is that it uses a grammar of graphics, the gg in ggplot2
. This is analogous to the way learning grammar can help a beginner construct hundreds of different sentences by learning just a a handful of verbs, nouns and adjectives without having to memorize each specific sentence. Similarly, by learning a handful of ggplot2
building blocks and its grammar, you will be able to create hundreds of different plots.
Another reason ggplot2
makes it easier for beginners is that its default behavior is carefully chosen to satisfy the great majority of cases and are aesthetically pleasing. As a result, it is possible to create informative and elegant graphs with relatively simple and readable code.
One limitation is that ggplot is designed to work exclusively with data tables in which rows are observations and columns are variables. However, a substantial percentage of datasets that beginners work with are, or can be converted into, this format. An advantage of this approach is that assuming that our data follows this format simplifies the code and learning the grammar.
To use ggplot2
you will have to learn several functions and arguments. These are hard to memorize so we highly recommend you have the a ggplot2 cheat sheet handy.
We construct a graph that summarizes the US murders dataset.
library(dslabs)
data(murders)
We can clearly see how much states vary across population size and the total number of murders. Not surprisingly, we also see a clear relationship between murder totals and population size. A state falling on the dashed grey line has the same murder rate as the US average. The four geographic regions are denoted with color and depicts how most southern states have murder rates above the average.
This data visualization shows us pretty much all the information in the data table. The code needed to make this plot is relatively simple. We will learn to create the plot part by part.
The first step in learning ggplot2
is to be able to break a graph apart into components. Let’s break down this plot and introduce some of the ggplot2
terminology. The three main components to note are:
We also note that:
We will now construct the plot piece by piece.
ggplot
objectThe first step in creating a ggplot2
graph is to define a ggplot
object. We do this with the function ggplot
which initializes the graph. If we read the help file for this function we see that the first argument is used to specify which data is associated with this object:
ggplot(data = murders)
We can also pipe the data. So this line of code is equivalent to the one above:
murders %>% ggplot()
Note that it renders a plot, in this case a blank slate since no geometry has been defined. The only style choice we see is a grey background.
What has happened above is that the object was created and because it was not assigned, it was automatically evaluated. But note that we can define an object, for example like this:
p <- ggplot(data=murders)
class(p)
## [1] "gg" "ggplot"
To render the plot associated with this object we simply print the object p
. The following two lines of code produce the same plot we see above:
print(p)
p
In ggplot we create graphs by adding layers. Layers can define geometries, compute summary statistics, define what scales to use, or even change styles. To add layers, we use the the symbol +
. In general a line of code will look like this:
DATA %>%
ggplot()
+ LAYER 1 + LAYER 2 + … + LAYER N Usually, the first added layer defines the geometry. We want to make a scatter plot. So what geometry do we use?
Taking a quick look at the cheat sheet we see that the function used to create plots with this geometry is geom_point
.
We will see that geometry function names follow this pattern: geom
and the name of the geometry connected by an underscore. For geom_point
to know what to do, we need to provide data and a mapping. We have already connected the object p
with the murders
data table and if we add as a layer geom_point
we will default to using this data. To find out what mappings are expected we read the Aesthetics section of the geom_point
help file:
Aesthetics
geom_point understands the following aesthetics:
x
y
alpha
colour
and, as expected, we see that at least two arguments are required: x
and y
.
aes
aes
will be one of the functions that you will most use. The function connects data with what we see on the graph. We refer to this connection as the aesthetic mappings. The outcome of this function is often used as the argument of a geometry function. This example produces a scatter plot of total murders versus population in millions:
murders %>% ggplot() + geom_point(aes(x= population/10^6 , y = total))
Note that we can drop the x =
and y =
if we wanted to as these are the first and second expected arguments as seen on the help page.
Also note that we can add a layer to the p
object that was defined above as p <- ggplot(data = murders)
:
p <- murders %>% ggplot()
p + geom_point(aes(population/10^6, total))
Note that the scale and labels are defined by default when adding this layer. Also notice that we use the variable names from the object component: population
and total
.
Keep in mind that the behavior of recognizing the variables from the data component is quite specific to aes
. With most functions, if you try to access the values of population
or total
outside of aes
you receive an error.
A second layer in the plot we wish to make involves adding a label to each point to identify the state. The geom_label
and geom_text
functions permit us to add text to the plot, without and with a rectangle behind the text respectively.
Because each state (each point) has a label we need an aesthetic mapping to make the connection. By reading the help file we learn that we supply the mapping between point and label through the label
argument of aes
. So the code looks like this:
p + geom_point(aes(population/10^6, total)) + geom_text(aes(population/10^6, total, label = abb))
We have successfully added a second layer to the plot.
As an example of the unique behavior of aes
mentioned above, note that this call
p_test <- p + geom_text(aes(population/10^6, total, label = abb))
is fine, this call
p_test <- p + geom_text(aes(population/10^6, total), label = abb)
will give you an error as abb
is not found once it is outside of the aes
function and geom_text
does not know where to find abb
as it is not a global variable.
Note that each geometry function has many arguments other than aes
and data
. They tend to be specific to the function. For example, in the plot we wish to make, the points are larger than the default ones. In the help file we see that size
is an aesthetic and we can change it like this:
p + geom_point(aes(population/10^6, total), size=3) + geom_text(aes(population/10^6, total, label = abb))
Note that size
is not a mapping, it affects all the points so we do not need to include it inside aes
.
Now that the points are larger, it is hard to see the labels. If we read the help file for geom_text
we learn of the nudge_x
argument which moves the text slightly to the right:
p + geom_point(aes(population/10^6, total), size=3) + geom_text(aes(population/10^6, total, label = abb), nudge_x =3)
This is preferred as it makes it easier to read the text.
Note that in the previous line of code, we define the mapping aes(population/10^6, total)
twice, once in each geometry. We can avoid this by using a global aesthetic mapping. We can do this when we define the blank slate ggplot
object. Remember that the function ggplot
contains an argument that permits us to define aesthetic mappings:
args(ggplot)
## function (data = NULL, mapping = aes(), ..., environment = parent.frame())
## NULL
If we define a mapping in ggplot
, then all the geometries that are added as layers will default to this mapping. We redefine p
:
p <- murders %>%
ggplot(aes(x = population/10^6, y = total, label = abb))
and then we can simply use code like this:
p + geom_point(size = 3) +
geom_text(nudge_x = 1.5)
We keep the size
and nudge_x
argument in geom_point
and geom_text
respectively because we only want to increase the size of points and nudge only the labels. Also note that the geom_point
function does not need a label
argument and therefore ignores it.
If we need to, we can override the global mapping by defining a new mapping within each layer. These local definitions override the global. Here is an example:
p + geom_point(size=3) + geom_text(aes(x=10, y = 800, label= "SKO BUFFS"))
Clearly, the second call to geom_text
does not use population
and total
on the x
and y
axis.
Recall that our desired scales are in log-scale. This is not the default so this change needs to be added through a scales layer. A quick look at the cheat sheet reveals scale_x_continuous
is needed to edit the behavior of scales. We use it like this:
p + geom_point(size = 3) +
geom_text(nudge_x =0.05) +
scale_x_continuous(trans = "log10") +
scale_y_continuous(trans = "log10")
Because we are in the log-scale now, the nudge must be made smaller.
This particular transformation is so common that ggplot
provides specialized functions:
p + geom_point(size = 3) +
geom_text(nudge_x =0.05) +
scale_x_log() +
scale_y_log()
Similarly, the cheat sheet quickly reveals that to change labels and add a title we use the following functions: xlab
, ylab
and ggtitle
.
p + geom_point(size = 3) +
geom_text(nudge_x = 0.05) +
scale_x_log10() +
scale_y_log10() +
xlab("Populations in millions (log scale)") +
ylab("Total number of murders (log scale)") +
ggtitle("US Gun Murders in 2010") +
theme(plot.title = element_text(hjust = 0.5))
We are almost there! All we have to do is add color, a legend and optional changes to the style.
Note that we can change the color of the points using the color
argument in the geom_point
function. To facilitate exposition we will redefine p
to be everything except the points layer:
p <- murders %>%
ggplot(aes(population/10^6, total, label = abb)) +
geom_text(nudge_x = 0.05) +
scale_x_log10() +
scale_y_log10() +
xlab("Populations in millions (log scale)") +
ylab("Total number of murders (log scale)") +
ggtitle("US Gun Murders in 2010")
and then test out what happens by adding different calls to geom_point
. We can make all the points blue by adding the color
argument:
p + geom_point(size = 3, color = "blue")
This, of course, is not what we want. We want to assign color depending on the geographical region. A nice default behavior of ggplot2
is that if we assign a categorical variable to color, it automatically assigns a different color to each category. It also adds a legend!
To map each point to a color, we need to use aes
since this is a mapping. So we use the following code:
p + geom_point(aes(color = region), size = 3)
The x
and y
mappings are inherited from those already defined in p
. So we do not redefine them. We also move aes
to the first argument since that is where the mappings are expected in this call.
Here we see yet another useful default behavior: ggplot2
has automatically added a legend that maps color to region.
We want to add a line that represents the average murder rate for the entire country. Note that once we determine the per million rate to be \(r\), this line is defined by the formula: \(y = r x\) with \(y\) and \(x\) our axes: total murders and population in millions respectively. In the log-scale this line turns into: \(\log(y) = \log(r) + \log(x)\). So in our plot it’s a line with slope 1 and intercept \(\log(r)\). To compute this value we use our dplyr
skills:
r <- murders %>%
summarize(murder_rate = sum(total) / sum(population) *10^6) %>%
.$murder_rate
r
## [1] 30.34555
To add a line we use the geom_abline
function. ggplot
uses ab
in the name to remind us we are supplying the intercept (a
) and slope (b
). The default line has slope 1 and intercept 0 so we only have to define the intercept:
p + geom_point(aes(col=region), size=3) + geom_abline(intercept = log10(r))
We can change the line type and color of the lines using arguments. We also draw it first so it doesn’t go over our points.
p <- p + geom_abline(intercept =log10(r), lty=2, color = "darkred") +
geom_point(aes(color = region), size = 3)
p
Note that we redefined p
.
The default plots created by ggplot
are already very useful. But often, we need to make minor tweaks to the default behavior. Although it is not always obvious how to make these even with the cheat sheet, ggplot2
is very flexible.
For example, note that we can make changes to the legend via the scale_color_discrete
function. For example, in our plot the word region is not capitalized. We can change that like this:
p <- p + scale_color_discrete(name= "Region")
p
The power of ggplot2
is augmented further due to the availability of add-on packages. The remaining changes required to put the finishing touches on our plot require the ggthemes
and ggrepel
packages.
The style of a ggplot
graph can be changed using the theme
functions. Several themes are included as part of the ggplot2
package. In fact, for most of the plots in this course we use a function in the dslabs
package that automatically sets a default theme:
ds_theme_set()
Many other themes are added by the package ggthemes
. Among those are the theme_economist
theme that we used. After installing the package, you can change the style of the plot by adding a layer:
library(ggthemes)
p + theme_economist()
You can see how some of the other themes look like by simply changing the function. For example you might try the theme_fivethirtyeight()
theme instead.
The final difference has to do with the position of the labels. Note that in our plot, some of the labels fall on top of each other. The add-on package ggrepel
includes a geometry that adds labels ensuring that they don’t fall on top of each other. We simply change geom_text
with geom_text_repel
.
So now that we are done testing we can write one piece of code that produces our desired plot from scratch.
library(ggthemes)
library(ggrepel)
### First define the slope of the line
r <- murders %>%
summarize(murder_rate= sum(total) / sum(population) * 10^6) %>% .$murder_rate
## Now make the plot
murders %>% ggplot(aes(population/10^6, total, label = abb)) +
geom_abline(intercept = log10(r), lty = 2, color = "darkgrey") +
geom_point(aes(col=region), size = 3) +
geom_text_repel() +
scale_x_log10() +
scale_y_log10() +
xlab("Populations in millions (log scale)") +
ylab("Total number of murders (log scale)") +
ggtitle("US Gun Murders in 2010") +
scale_color_discrete(name = "Region") +
theme_economist()