The functions in the ggplot2 package build up a graph in layers. We’ll build a a complex graph by starting with a simple graph and adding additional elements, one at a time.
The example uses data from the 1985 Current Population Survey to explore the relationship between wages (wage) and experience (expr).
# load package
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
# load data
data(CPS85 , package = "mosaicData")
glimpse(CPS85)
## Observations: 534
## Variables: 11
## $ wage <dbl> 9.00, 5.50, 3.80, 10.50, 15.00, 9.00, 9.57, 15.00, 11.0…
## $ educ <int> 10, 12, 12, 12, 12, 16, 12, 14, 8, 12, 17, 17, 14, 14, …
## $ race <fct> W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, NW, …
## $ sex <fct> M, M, F, F, M, F, F, M, M, F, M, M, M, M, M, M, M, M, M…
## $ hispanic <fct> NH, NH, NH, NH, NH, NH, NH, NH, NH, NH, Hisp, NH, Hisp,…
## $ south <fct> NS, NS, NS, NS, NS, NS, NS, NS, NS, NS, NS, NS, NS, NS,…
## $ married <fct> Married, Married, Single, Married, Married, Married, Ma…
## $ exper <int> 27, 20, 4, 29, 40, 27, 5, 22, 42, 14, 18, 3, 4, 14, 35,…
## $ union <fct> Not, Not, Not, Not, Union, Not, Union, Not, Not, Not, N…
## $ age <int> 43, 38, 22, 47, 58, 49, 23, 42, 56, 32, 41, 26, 24, 34,…
## $ sector <fct> const, sales, sales, clerical, const, clerical, service…
The first function in building a graph is the ggplot function. It specifies the
data frame containing the data to be plotted the mapping of the variables to visual properties of the graph. The mappings are placed within the aes function (where aes stands for aesthetics).
# specify dataset and mapping
library(ggplot2)
ggplot(data = CPS85,
mapping = aes(x = exper, y = wage))
Geoms are the geometric objects (points, lines, bars, etc.) that can be placed on a graph. They are added using functions that start with geom_. In this example, we’ll add points using the geom_point function, creating a scatterplot.
In ggplot2 graphs, functions are chained together using the + sign to build a final plot.
The graph indicates that there is an outlier. One individual has a wage much higher than the rest. We’ll delete this case before continuing.
A number of parameters (options) can be specified in a geom_ function. Options for the geom_point function include color, size, and alpha. These control the point color, size, and transparency, respectively. Transparency ranges from 0 (completely transparent) to 1 (completely opaque). Adding a degree of transparency can help visualize overlapping points. Next, let’s add a line of best fit. We can do this with the geom_smooth function. Options control the type of line (linear, quadratic, nonparametric), the thickness of the line, the line’s color, and the presence or absence of a confidence interval. Here we request a linear regression (method = lm) line (where lm stands for linear model).
# add points
ggplot(data = CPS85,
mapping = aes(x = exper, y = wage)) +
geom_point()
# delete outlier
library(dplyr)
plotdata <- filter(CPS85, wage < 40)
# redraw scatterplot
ggplot(data = plotdata,
mapping = aes(x = exper, y = wage)) +
geom_point()
# make points blue, larger, and semi-transparent
ggplot(data = plotdata,
mapping = aes(x = exper, y = wage)) +
geom_point(color = "cornflowerblue",
alpha = .7,
size = 3)
# add a line of best fit.
ggplot(data = plotdata,
mapping = aes(x = exper, y = wage)) +
geom_point(color = "cornflowerblue",
alpha = .7,
size = 3) +
geom_smooth(method = "lm")
In addition to mapping variables to the x and y axes, variables can be mapped to the color, shape, size, transparency, and other visual characteristics of geometric objects. This allows groups of observations to be superimposed in a single graph.
Let’s add sex to the plot and represent it by color. The color = sex option is placed in the aes function, because we are mapping a variable to an aesthetic. The geom_smooth option (se = FALSE) was added to suppresses the confidence intervals.
It appears that men tend to make more money than women. Additionally, there may be a stronger relationship between experience and wages for men than than for women.
# indicate sex using color
ggplot(data = plotdata,
mapping = aes(x = exper,
y = wage,
color = sex)) +
geom_point(alpha = .7,
size = 3) +
geom_smooth(method = "lm",
se = FALSE,
size = 1.5)
Scales control how variables are mapped to the visual characteristics of the plot. Scale functions (which start with scale_) allow you to modify this mapping. In the next plot, we’ll change the x and y axis scaling, and the colors employed. We’re getting there. The numbers on the x and y axes are better, the y axis uses dollar notation, and the colors are more attractive (IMHO).
Here is a question. Is the relationship between experience, wages and sex the same for each job sector? Let’s repeat this graph once for each job sector in order to explore this.
# modify the x and y axes and specify the colors to be used
ggplot(data = plotdata,
mapping = aes(x = exper,
y = wage,
color = sex)) +
geom_point(alpha = .7,
size = 3) +
geom_smooth(method = "lm",
se = FALSE,
size = 1.5) +
scale_x_continuous(breaks = seq(0, 60, 10)) +
scale_y_continuous(breaks = seq(0, 30, 5),
label = scales::dollar) +
scale_color_manual(values = c("indianred3",
"cornflowerblue"))
Facets reproduce a graph for each level a given variable (or combination of variables). Facets are created using functions that start with facet_. Here, facets will be defined by the eight levels of the sector variable. It appears that the differences between mean and women depend on the job sector under consideration.
# reproduce plot for each level of job sector
ggplot(data = plotdata,
mapping = aes(x = exper,
y = wage,
color = sex)) +
geom_point(alpha = .7) +
geom_smooth(method = "lm",
se = FALSE) +
scale_x_continuous(breaks = seq(0, 60, 10)) +
scale_y_continuous(breaks = seq(0, 30, 5),
label = scales::dollar) +
scale_color_manual(values = c("indianred3",
"cornflowerblue")) +
facet_wrap(~sector)
Graphs should be easy to interpret and informative labels are a key element in achieving this goal. The labs function provides customized labels for the axes and legends. Additionally, a custom title, subtitle, and caption can be added.
# add informative labels
ggplot(data = plotdata,
mapping = aes(x = exper,
y = wage,
color = sex)) +
geom_point(alpha = .7) +
geom_smooth(method = "lm",
se = FALSE) +
scale_x_continuous(breaks = seq(0, 60, 10)) +
scale_y_continuous(breaks = seq(0, 30, 5),
label = scales::dollar) +
scale_color_manual(values = c("indianred3",
"cornflowerblue")) +
facet_wrap(~sector) +
labs(title = "Relationship between wages and experience",
subtitle = "Current Population Survey",
caption = "source: http://mosaic-web.org/",
x = " Years of Experience",
y = "Hourly Wage",
color = "Gender")
Finally, we can fine tune the appearance of the graph using themes. Theme functions (which start with theme_) control background colors, fonts, grid-lines, legend placement, and other non-data related features of the graph. Let’s use a cleaner theme. Now we have something. It appears that men earn more than women in management, manufacturing, sales, and the “other” category. They are most similar in clerical, professional, and service positions. The data contain no women in the construction sector. For management positions, wages appear to be related to experience for men, but not for women (this may be the most interesting finding). This also appears to be true for sales.
Of course, these findings are tentative. They are based on a limited sample size and do not involve statistical testing to assess whether differences may be due to chance variation.
# use a minimalist theme
ggplot(data = plotdata,
mapping = aes(x = exper,
y = wage,
color = sex)) +
geom_point(alpha = .6) +
geom_smooth(method = "lm",
se = FALSE) +
scale_x_continuous(breaks = seq(0, 60, 10)) +
scale_y_continuous(breaks = seq(0, 30, 5),
label = scales::dollar) +
scale_color_manual(values = c("indianred3",
"cornflowerblue")) +
facet_wrap(~sector) +
labs(title = "Relationship between wages and experience",
subtitle = "Current Population Survey",
caption = "source: http://mosaic-web.org/",
x = " Years of Experience",
y = "Hourly Wage",
color = "Gender") +
theme_minimal()