After completing this worksheet, you should be able to use the powerful ggplot2 to make basic plots using the grammar of graphics. You may find the ggplot2 documentation or the R Graph Catalog to be helpful.
In addition to the ggplot2 package, we will use three packages with sample data, and we will load dplyr to get nice printing of data frames. Let’s load them now, and also bring some of the data frames into the global environment.
library(ggplot2)
library(dplyr)
library(gapminder)
data("gapminder")
library(historydata)
data("paulist_missions")
data("naval_promotions")
library(europop) # https://github.com/mdlincoln/europop
data("europop")
# Make the Paulist missions data a little more mangeable
library(lubridate)
weeks <- function(x) {
w <- ifelse(x < 7, "< 1 weeks",
ifelse(x <= 14, "1-2 weeks",
ifelse(x <= 21, "2-3 weeks",
"> 3 weeks")))
factor(w, levels = c("< 1 weeks", "1-2 weeks", "2-3 weeks", "> 3 weeks"),
ordered = TRUE)
}
paulist_missions <- paulist_missions %>%
mutate(start_date = mdy(start_date),
end_date = mdy(end_date),
year = year(start_date),
days = as.numeric(end_date - start_date) / 24 / 60 / 60,
duration = weeks(days))
## Warning: 1 failed to parse.
paulists_by_year <- paulist_missions %>%
group_by(year) %>%
summarize(converts = sum(converts, na.rm = TRUE),
confessions = sum(confessions, na.rm = TRUE))
The fundamental insight of the grammar of graphics is the variables in the data can be mapped to aesthetics in the visualization. A variable in a data frame will be found in a column. An aesthetic in ggplot2 can take many forms, depending on the kinds of marks (glyphs) that you are going to make in the plot. But the most common aesthetics are x
and y
position, size
, color
, fill
, shape
and weight
. Some less common but still useful are label
and linetype
. The ggplot2 package lets us explicitly set which variables are mapped to which marks using the aes()
function.
The three basic parts of a call to ggplot2 are these:
ggplot()
function as its first argument.aes()
function. The aes()
function is normally passed as the second argument to ggplot()
(though it can also be specified in the various geoms).geom_point()
.Consider this basic plot. First, let’s look at the date.
paulist_missions
## Source: local data frame [841 x 14]
##
## mission_number church city state
## (int) (chr) (chr) (chr)
## 1 1 St. Joseph's Church New York NY
## 2 2 St. Michael's Church Loretto PA
## 3 3 St. Mary's Church Hollidaysburg PA
## 4 4 Church of St. John Evangelist Johnstown PA
## 5 5 St. Peter's Church New York NY
## 6 6 St. Patrick's Cathedral New York NY
## 7 7 St. Patrick's Church Erie PA
## 8 8 St. Philip Benizi Church Cussewago PA
## 9 9 St. Vincent's Church (Benedictine) Youngstown PA
## 10 10 St. Peter's Church Saratoga NY
## .. ... ... ... ...
## Variables not shown: start_date (time), end_date (time), confessions
## (int), converts (int), order (chr), lat (dbl), long (dbl), year (dbl),
## days (dbl), duration (fctr)
Now let’s make a scatter plot.
ggplot(paulist_missions, aes(x = confessions, y = converts)) +
geom_point()
## Warning: Removed 6 rows containing missing values (geom_point).
Data set: paulist_missions
aesthetics: x = confessions, y = converts
graph layer: geom_point()
Each glyph on the plot is an individual record or row from the data set. The rows convert’s and confesssion’s values dictate its location on the x axis and the y axis.
ggplot(paulist_missions, aes(x = converts, y = confessions)) +
geom_point()
## Warning: Removed 6 rows containing missing values (geom_point).
We can specify more than two variables and aesthetics if we wish. Here we map the duration (notice: a categorical variable) to color.
ggplot(paulist_missions, aes(x = confessions, y = converts,
color = duration)) +
geom_point()
## Warning: Removed 6 rows containing missing values (geom_point).
We can also specify static properties, These can go either in the call to ggplot()
if they affect the entire plot, or in a specific layer (one of the geom_*()
functions) if they affect just that layer.
We might notice that our chart suffers from overplotting: the points are on top of each other and we can’t distinguish between them. Let try changing the shape of each point, and try making each point slightly transparent to see if this helps. Notice that in the code below, those properties are specified with static values outside of the aes()
function.
ggplot(paulist_missions, aes(x = confessions, y = converts,
color = duration)) +
geom_point(alpha = 0.5, shape = 1)
## Warning: Removed 6 rows containing missing values (geom_point).
days
, converts
, and confessions
variables. Try using the x
, y
, and size
properties.ggplot(paulist_missions, aes(x = days, y = confessions, color = converts)) +
geom_count()
## Warning: Removed 7 rows containing non-finite values (stat_sum).
We can change the labels of the plot using the labs()
function as below. (Alternatively, you can use the xlab()
, ylab()
, and ggtitle()
functions.)
ggplot(paulist_missions, aes(x = confessions, y = converts,
color = duration)) +
geom_point(alpha = 0.5, shape = 1) +
labs(title = "Paulist missions",
x = "Confessions (= attendance)",
y = "Converts (to Roman Catholicism)",
color = "Duration of mission")
## Warning: Removed 6 rows containing missing values (geom_point).
ggplot(paulist_missions, aes(x = days, y = confessions, color = converts)) +
geom_count() +
labs(title = "The Paulist Missions",
x = "Number of Days",
y = "Total number of confessions",
color = "Number of Converts")
## Warning: Removed 7 rows containing non-finite values (stat_sum).
So far we have only used points (with geom_point()
) as the meaningful glyphs in our plot. Now we will take a tour of different kinds of glyphs that are available to us in ggplot2. Not every variable is suited to every kind of glyph, and sometimes we have to aggregate our data to make certain kinds of plots. (The data aggregation will be covered in a later worksheet.)
A histogram shows the distribution of values in a dataset by “binning” the data: in other words, it takes the domain of the data, splits it into different bins, then counts how many values fall into each bin. One bar is drawn for each bin. Here we count the kinds o
ggplot(paulist_missions, aes(x = converts)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(paulist_missions, aes(x = confessions)) +
geom_histogram(bins = 5)
## Warning: Removed 6 rows containing non-finite values (stat_bin).
bins =
or binwidth =
. See ?geom_histogram
.)Lines are good for showing trends.
ggplot(paulists_by_year, aes(x = year, y = converts)) +
geom_line()
ggplot(paulists_by_year, aes(x = year, y = confessions)) +
geom_line() +
geom_point(data = paulists_by_year, aes(x = year, y = confessions))
geom_line()
. And instead of specifying the y
value in the call to ggplot()
you will do it in the functons for each layer. For instance: geom_line(aes(y = converts))
.)ggplot(paulists_by_year, aes(x = year)) +
geom_line(data = paulists_by_year, aes(y = converts)) +
geom_line(data = paulists_by_year, aes(y = confessions))
converts / confessions
.)ggplot() +
geom_line(data = paulists_by_year, aes(x = year, y = converts/confessions), color = "blue")
If you map color =
to a categorical value, you will get a different colored line for each category.
Bar plots can be used in much the same way as a line plot if you specify stat = "identity"
. That call tells ggplot to use a y
value that is present in the data.
ggplot(paulists_by_year, aes(x = year, y = converts)) +
geom_bar(stat = "identity")
But bar plots are better used for counts of categorical variables. Here we count the number of missions done by the Paulists and the Redemptorists.
ggplot(paulist_missions, aes(x = order)) +
geom_bar(stat = "count")
ggplot(paulist_missions, aes(x = state)) +
geom_bar(stat = "count")
Faceting is not a geom like the examples above, but it can create a separate panel in a plot for different categories in the data. For instance, in the plot below, we have created a separate panel for each
ggplot(paulist_missions, aes(x = converts, y = confessions)) +
geom_count(shape = 1, alpha = 0.6) +
facet_wrap(~ order)
## Warning: Removed 6 rows containing non-finite values (stat_sum).
ggplot(paulist_missions, aes(x = converts, y = confessions)) +
geom_count(shape = 1, alpha = 0.6) +
facet_wrap(~ state)
## Warning: Removed 6 rows containing non-finite values (stat_sum).
geom_count()
. What does it do? (Hint: ?geom_count
.)It counts the number of observations at each “location,” and then proportionally correlates the size of the point to the number count.
There are a number of data sets available to you. You may try using early_colleges
, catholic_dioceses
, naval_promotions
, quasi_war
, sarna
, us_national_population
, or us_state_populations
(all from the historydata package), gapminder
(from the gapminder package), or europop
(from the europop package).
Create three plots below, using any one or more than one of those datasets. Your three plots should try to make some kind of historical observation. For each plot, include no more than three sentences explaining what you think the plot means. You should try to make each plot as informative as possible by using different geoms and including as many variables as is reasonable in each plot. Be sure to add good titles and labels.
You may wish to look at the R Graph Catalog to find examples of what you can do with ggplot along with sample code.
ggplot(early_colleges, aes(x = state)) +
geom_bar(stat = "count", fill="blue", colour="red") +
labs(title = "Colleges per State prior to 1848")
Explanation of plot 1. This bar plot is illustrating the total count of colleges within each US state prior to 1848.
ggplot() +
geom_point(data = gapminder, aes(x = year, y = lifeExp, size = gdpPercap, alpha = 0.5))
Explanation of plot 2. This is a point graph showing the life expectancy for 142 countries from 1952 to 2007. Furthermore, the size of each point is a proportional representation of the GDP per capita of that country. Thus, the graph is illustrating the increase in life expectancies generally over time and that countries with higher rates of gdp per cap tend to have a higher left expectancy.
ggplot(catholic_dioceses, aes(x = date, y = rite)) +
geom_count(alpha = 0.3, shape = 1, stroke = 1.5, color = "purple")
Explanation of plot 3. This final graph was a somewhat poor attempt at trying to illustrate categorical or nominal data over time. This showing the frequency at which new catholci dioceses are founded according to the rite that they oversaw. You can see a high concentration for Latin beginning at roughly 1800 to the present and then an increase in Byzantine starting in the 1950s to about the 1990s.