\(\underline{\textbf{Chapter Description}}\)
A picture paints a thousand words, which is why R ggplot2 is such a powerful tool for graphical data analysis. In this chapter, you’ll progress from simply plotting data to applying a variety of statistical methods. These include a variety of linear models, descriptive and inferential statistics (mean, standard deviation and confidence intervals) and custom functions.
library(tidyverse)
# Used to load the "Vocab" dataset
library(carData)
# library(Hmisc)
mtcars <- mtcars %>%
mutate(fcyl = factor(cyl), fam = factor(am))
# Structure of data frame
str(mtcars)
'data.frame': 32 obs. of 13 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
$ fcyl: Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
$ fam : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
data("Vocab")
Vocab <- Vocab %>%
mutate(year_group = as.factor(ifelse(year < 1995,
"[1974,1995]",
"[1995,2016]")))
To practice on the remaining layers (statistics, coordinates and facets), we’ll continue working on several datasets from the first course.
The mtcars dataset contains information for 32 cars from
Motor Trends magazine from 1974. This dataset is small, intuitive, and
contains a variety of continuous and categorical (both nominal and
ordinal) variables.
In the previous course you learned how to effectively use some basic geometries, such as point, bar and line. In the first chapter of this course you’ll explore statistics associated with specific geoms, for example, smoothing and lines.
Look at the structure of mtcars.
Using mtcars, draw a scatter plot of
mpg vs. wt.
# View the structure of mtcars
str(mtcars)
'data.frame': 32 obs. of 13 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
$ fcyl: Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
$ fam : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
# Using mtcars, draw a scatter plot of mpg vs. wt
mpg_wt_plot <- ggplot(data = mtcars, aes(x=wt, y=mpg)) +
geom_point()
# Show Base Plot
mpg_wt_plot
Update the plot to add a smooth trend line.
Use the default method, which uses the LOESS model to fit the curve.
# Amend the plot to add a smooth layer
mpg_wt_plot +
geom_smooth(formula = y ~ x, method = "loess")
method to "lm", and turn off the model’s 95%
confidence interval (the ribbon) by setting se to
FALSE.# Amend the plot with a linear regression smoothing; turn off standard error ribbon
mpg_wt_plot +
geom_smooth(formula = y ~ x, method = "lm", se = FALSE)
geom_smooth() for
stat_smooth().# Amend the plot. Swap geom_smooth() for stat_smooth().
mpg_wt_plot +
stat_smooth(formula = y ~ x, method = "lm", se = FALSE)
You can use either stat_smooth() or
geom_smooth() to apply a linear model.
Remember to always think about how the examples and concepts we discuss throughout the data visualization courses can be applied to your own datasets!
We’ll continue with the previous exercise by considering the
situation of looking at sub-groups in our dataset. For this we’ll
encounter the invisible group aesthetic.
mtcars has been given an extra column,
fcyl, that is the cyl column converted to a
proper factor variable.
Using mtcars, plot mpg
vs. wt, colored by cyl.
Make sure to changecyl to a factor variable
Add a point layer.
Add a smooth stat using a linear model, and don’t show the
se ribbon.
Update the plot to add a second smooth stat.
Add a dummy group aesthetic to this layer, setting
the value to 1.
Use the same method and se values as
the first stat smooth layer.
# (1) Using mtcars, plot mpg vs. wt, colored by cyl
ggplot(data = mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
# Add a point layer
geom_point() +
# Add a smooth lin reg stat, no ribbon
stat_smooth(formula = y~x, method = "lm", se = FALSE) +
# (2) Amend the plot to add another smooth layer with dummy grouping
stat_smooth(aes(group = 1), formula = y~x, method = "lm", se = FALSE)
Notice that the color aesthetic defined an invisible group aesthetic. Defining the group aesthetic for a specific geom means we can overwrite that. Here, we use a dummy variable to calculate the smoothing model for all values.
group = 1 is the same as making cyl a factor
variable.stat_smooth (Part 1)In the previous exercise we used se = FALSE in
stat_smooth() to remove the 95% Confidence Interval. Here
we’ll consider another argument, span, used in LOESS
smoothing, and we’ll take a look at a nice scenario of properly mapping
different models.
Explore the effect of the span argument on LOESS
curves. Add three smooth LOESS stats, each without the standard error
ribbon.
Color the 1st one "red"; set its span
to 0.9.
Color the 2nd one "green"; set its span
to 0.6.
Color the 3rd one "blue"; set its span
to 0.3.
Compare LOESS and linear regression smoothing on small regions of data.
Add a smooth LOESS stat, without the standard error ribbon.
Add a smooth linear regression stat, again without the standard error ribbon.
LOESS isn’t great on very short sections of data; compare the pieces of linear regression to LOESS over the whole thing.
color to a dummy
variable, "All".ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
# Add 3 smooth LOESS stats, varying span & color
stat_smooth(formula=y~x, method = "loess", color = "red", span = 0.9, se = FALSE) +
stat_smooth(formula=y~x, method = "loess", color = "green", span = 0.6, se = FALSE) +
stat_smooth(formula=y~x, method = "loess", color = "blue", span = 0.3, se = FALSE)
# Amend the plot to color by fcyl
ggplot(mtcars, aes(x = wt, y = mpg, color = fcyl)) +
geom_point() +
# Add a smooth LOESS stat, no ribbon
stat_smooth(formula = y~x, method = "loess", se = FALSE) +
# Add a smooth lin. reg. stat, no ribbon
stat_smooth(formula = y~x, method = "lm", se = FALSE)
# Amend the plot
ggplot(mtcars, aes(x = wt, y = mpg, color = fcyl)) +
geom_point() +
# Map color to dummy variable "All"
stat_smooth(aes(color="All"), se = FALSE) +
stat_smooth(method = "lm", se = FALSE)
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'
The default span for LOESS is 0.9.
stat_smooth (Part 2)In this exercise we’ll take a look at the standard error ribbons,
which show the 95% confidence interval of smoothing models.
ggplot2 and the Vocab data frame are already
loaded for you.
Vocab has been given an extra column,
year_group, splitting the dates into before and after
1995.
Using Vocab, plot vocabulary
vs. education, colored by year_group.
Use geom_jitter() to add jittered points with
transparency 0.25.
Add a smooth linear regression stat (with the standard error ribbon).
Update the smooth stat.
Map the fill color to year_group.
Set the line size to 2.
# Using Vocab, plot vocabulary vs. education, colored by year group
ggplot(data = Vocab, aes(x = education, y = vocabulary, color = year_group)) +
# Add jittered points with transparency 0.25
geom_jitter(alpha = 0.25) +
# Add a smooth lin. reg. line (with ribbon)
stat_smooth(method = "lm", se = TRUE)
# Amend the plot
ggplot(data = Vocab, aes(x = education, y = vocabulary, color = year_group)) +
geom_jitter(alpha = 0.25) +
# Map the fill color to year_group, set the line size to 2
stat_smooth(aes(fill = year_group), size = 2, method = "lm", se = TRUE)
Here, we’ll continue with the Vocab dataset and use
stat_quantile() to apply a quantile regression.
Linear regression predicts the mean response from the explanatory
variables, quantile regression predicts a quantile response (e.g. the
median) from the explanatory variables. Specific quantiles can be
specified with the quantiles argument.
Specifying many quantiles and color your models according to year can make plots too busy. We’ll explore ways of dealing with this in the next chapter.
Update the plot to add a quantile regression stat, at
quantiles 0.05, 0.5, and
0.95.
Amend the plot to color according to
year_group.
# (2) Amend the plot to color by year_group
ggplot(Vocab, aes(x = education, y = vocabulary, color = year_group)) +
geom_jitter(alpha = 0.25) +
# (1) Add a quantile stat, at 0.05, 0.5, and 0.95
stat_quantile(quantiles = c(0.05, 0.5, 0.95))
Warning in rq.fit.br(wx, wy, tau = tau, ...): Solution may be nonunique
stat_sumIn the Vocab dataset, education and
vocabulary are integer variables. In the first course, you
saw that this is one of the four causes of overplotting. You’d get a
single point at each intersection between the two variables.
One solution, shown in the step 1, is jittering with transparency.
Another solution is to use stat_sum(), which calculates the
total number of overlapping observations and maps that onto the
size aesthetic.
stat_sum() allows a special variable,
..prop.., to show the proportion of values within
the dataset.
Run the code to see how jittering & transparency solves overplotting.
stat_sum().Modify the size aesthetic with the appropriate scale function.
scale_size() function to set the
range from 1 to 10.Inside stat_sum(), set size to
..prop.. so circle size represents the proportion of the
whole dataset.
Update the plot to group by education, so that
circle size represents the proportion of the group
# Run this, look at the plot, then update it
ggplot(Vocab, aes(x = education, y = vocabulary)) +
# Replace this with a sum stat
stat_sum(alpha = 0.25)
ggplot(Vocab, aes(x = education, y = vocabulary)) +
stat_sum() +
# Add a size scale, from 1 to 10
scale_size(range = c(1, 10))
# Amend the stat to use proportion sizes
ggplot(Vocab, aes(x = education, y = vocabulary)) +
stat_sum(aes(size = ..prop..)) # ..prop.. shows the propotion of values within the dataet.
# Amend the plot to group by education
ggplot(Vocab, aes(x = education, y = vocabulary, group = education)) +
stat_sum(aes(size = ..prop..))
If a few data points overlap, jittering is great.
When you have lots of overlaps (particularly where continuous
data has been rounded), using stat_sum() to count the
overlaps is more useful.
In the following exercises, we’ll aim to make the plot shown in the viewer.
As before, we’ll use mtcars, where fcyl and
fam are proper factor variables of the original
cyl and am variables.
Here, we’ll establish our positions and base layer of the plot.
Establishing these items as independent objects will allow us to recycle them easily in many layers, or plots.
position_jitter() adds jittering (e.g. for
points).
position_dodge() dodges geoms, (e.g. bar,
col, boxplot, violin, errorbar, pointrange).
position_jitterdodge() jitters
and dodges geoms, (e.g. points).
Using these three functions, define these position objects:
posn_j: will jitter with a
width of 0.2.
posn_d: will dodge with a
width of 0.1.
posn_jd: will jitter and
dodge with a jitter.width of 0.2 and
a dodge.width of 0.1.
Plot wt vs. fcyl, colored by
fam. Assign this base layer to
p_wt_vs_fcyl_by_fam.
geom_point().# Define position objects
# 1. Jitter with width 0.2
posn_j <- position_jitter(width = 0.2)
# 2. Dodge with width 0.1
posn_d <- position_dodge(width = 0.1)
# 3. Jitter-dodge with jitter.width 0.2 and dodge.width 0.1
posn_jd <- position_jitterdodge(jitter.width = 0.2, dodge.width = 0.1)
# Create the plot base: wt vs. fcyl, colored by fam
p_wt_vs_fcyl_by_fam <- ggplot(data = mtcars, aes(x = fcyl, y = wt, color = fam))
# Add a point layer
p_wt_vs_fcyl_by_fam +
geom_point()
Now that the position objects have been created, you can apply them
to the base plot to see their effects. You do this by adding a point
geom and setting the position argument to the position
object.
The variables from the last exercise, posn_j,
posn_d, posn_jd, and
p_wt_vs_fcyl_by_fam are available in your workspace.
Apply the jitter position, posn_j, to the base
plot.
Apply the dodge position, posn_d, to the base
plot.
Apply the jitter-dodge position, posn_jd, to the
base plot.
# Add jittering only
p_wt_vs_fcyl_by_fam_jit <- p_wt_vs_fcyl_by_fam +
geom_point(position = posn_j)
p_wt_vs_fcyl_by_fam_jit
# Add dodging only
p_wt_vs_fcyl_by_fam +
geom_point(position = posn_d)
# Add jittering and dodging
p_wt_vs_fcyl_by_fam +
geom_point(position = posn_jd)
Although you can set position by setting the position
argument to a string (for example position = "dodge"),
defining objects promotes consistency between layers.
The preparation is done; now let’s explore
stat_summary().
Summary statistics refers to a combination of location (mean or median) and spread (standard deviation or confidence interval).
These metrics are calculated in stat_summary() by
passing a function to the fun.data argument.
mean_sdl(), calculates multiples of the standard deviation
and mean_cl_normal() calculates the t-corrected 95% CI.
Arguments to the data function are passed to
stat_summary()’s fun.args argument as a
list.
The position object, posn_d, and the plot with jittered
points, p_wt_vs_fcyl_by_fam_jit, are available.
Add error bars representing the standard deviation.
Set the data function to mean_sdl (without
parentheses).
Draw 1 standard deviation each side of the mean, pass arguments
to the mean_sdl() function by assigning them to
fun.args in the form of a list.
Use posn_d to set the position.
The default geom for stat_summary() is
"pointrange" which is already great.
"errorbar" geom by
assigning it to the geom argument.Update the plot to add a summary stat of 95% confidence limits.
Set the data function to mean_cl_normal (without
parentheses).
Again, use the dodge position.
p_wt_vs_fcyl_by_fam_jit +
# Add a summary stat of std deviation limits
stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), position = posn_d)
p_wt_vs_fcyl_by_fam_jit +
# Change the geom to be an errorbar
stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), position = posn_d,
geom = "errorbar")
p_wt_vs_fcyl_by_fam_jit +
# Add a summary stat of normal confidence limits
stat_summary(fun.data = mean_cl_normal, position = posn_d)