\(\underline{\textbf{Chapter Description}}\)

A picture paints a thousand words, which is why R ggplot2 is such a powerful tool for graphical data analysis. In this chapter, you’ll progress from simply plotting data to applying a variety of statistical methods. These include a variety of linear models, descriptive and inferential statistics (mean, standard deviation and confidence intervals) and custom functions.

library(tidyverse)

# Used to load the "Vocab" dataset
library(carData)

# library(Hmisc)

Data Sets Utilized

mtcars <- mtcars %>%
  mutate(fcyl = factor(cyl), fam = factor(am))

# Structure of data frame
str(mtcars)

'data.frame':   32 obs. of  13 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
 $ fcyl: Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
 $ fam : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...

data("Vocab")

Vocab <- Vocab %>% 
  mutate(year_group = as.factor(ifelse(year < 1995, 
                                       "[1974,1995]", 
                                       "[1995,2016]")))

(1) Stats with Geoms

Smoothing

To practice on the remaining layers (statistics, coordinates and facets), we’ll continue working on several datasets from the first course.

The mtcars dataset contains information for 32 cars from Motor Trends magazine from 1974. This dataset is small, intuitive, and contains a variety of continuous and categorical (both nominal and ordinal) variables.

In the previous course you learned how to effectively use some basic geometries, such as point, bar and line. In the first chapter of this course you’ll explore statistics associated with specific geoms, for example, smoothing and lines.

Exercise 1

Look at the structure of mtcars.
Using mtcars, draw a scatter plot of mpg vs. wt.

# View the structure of mtcars
str(mtcars)

'data.frame':   32 obs. of  13 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
 $ fcyl: Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
 $ fam : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...

# Using mtcars, draw a scatter plot of mpg vs. wt
mpg_wt_plot <- ggplot(data = mtcars, aes(x=wt, y=mpg)) +
  geom_point()

# Show Base Plot
mpg_wt_plot

Exercise 2

Update the plot to add a smooth trend line.
Use the default method, which uses the LOESS model to fit the curve.

# Amend the plot to add a smooth layer
mpg_wt_plot + 
  geom_smooth(formula = y ~ x, method = "loess")

Exercise 3

Update the smooth layer. Apply a linear model by setting method to "lm", and turn off the model’s 95% confidence interval (the ribbon) by setting se to FALSE.

# Amend the plot with a linear regression smoothing; turn off standard error ribbon
mpg_wt_plot +
  geom_smooth(formula = y ~ x, method = "lm", se = FALSE)

Exercise 4

Draw the same plot again, swapping geom_smooth() for stat_smooth().

# Amend the plot. Swap geom_smooth() for stat_smooth().
mpg_wt_plot +
  stat_smooth(formula = y ~ x, method = "lm", se = FALSE)

Concluding Remarks

You can use either stat_smooth() or geom_smooth() to apply a linear model.
Remember to always think about how the examples and concepts we discuss throughout the data visualization courses can be applied to your own datasets!

Grouping Variables

We’ll continue with the previous exercise by considering the situation of looking at sub-groups in our dataset. For this we’ll encounter the invisible group aesthetic.

mtcars has been given an extra column, fcyl, that is the cyl column converted to a proper factor variable.

Instructions

Using mtcars, plot mpg vs. wt, colored by cyl.
- Make sure to changecyl to a factor variable
- Add a point layer.
- Add a smooth stat using a linear model, and don’t show the se ribbon.
Update the plot to add a second smooth stat.
- Add a dummy group aesthetic to this layer, setting the value to 1.
- Use the same method and se values as the first stat smooth layer.

# (1) Using mtcars, plot mpg vs. wt, colored by cyl
ggplot(data = mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +

  # Add a point layer
  geom_point() +
  
  # Add a smooth lin reg stat, no ribbon
  stat_smooth(formula = y~x, method = "lm", se = FALSE) +

  # (2) Amend the plot to add another smooth layer with dummy grouping
  stat_smooth(aes(group = 1), formula = y~x, method = "lm", se = FALSE)

Concluding Remarks

Notice that the color aesthetic defined an invisible group aesthetic. Defining the group aesthetic for a specific geom means we can overwrite that. Here, we use a dummy variable to calculate the smoothing model for all values.

My Remarks

Play around with the group aesthetic. It looks like setting group = 1 is the same as making cyl a factor variable.

Modifying `stat_smooth` (Part 1)

In the previous exercise we used se = FALSE in stat_smooth() to remove the 95% Confidence Interval. Here we’ll consider another argument, span, used in LOESS smoothing, and we’ll take a look at a nice scenario of properly mapping different models.

Instructions

Explore the effect of the span argument on LOESS curves. Add three smooth LOESS stats, each without the standard error ribbon.
- Color the 1st one "red"; set its span to 0.9.
- Color the 2nd one "green"; set its span to 0.6.
- Color the 3rd one "blue"; set its span to 0.3.
Compare LOESS and linear regression smoothing on small regions of data.
- Add a smooth LOESS stat, without the standard error ribbon.
- Add a smooth linear regression stat, again without the standard error ribbon.
LOESS isn’t great on very short sections of data; compare the pieces of linear regression to LOESS over the whole thing.
- Amend the smooth LOESS stat to map color to a dummy variable, "All".

Exercise 1

ggplot(mtcars, aes(x = wt, y = mpg)) +
  
  geom_point() +
  
  # Add 3 smooth LOESS stats, varying span & color
  stat_smooth(formula=y~x, method = "loess", color = "red", span = 0.9, se = FALSE) +
  stat_smooth(formula=y~x, method = "loess", color = "green", span = 0.6, se = FALSE) +
  stat_smooth(formula=y~x, method = "loess", color = "blue", span = 0.3, se = FALSE)

Exercise 2

# Amend the plot to color by fcyl
ggplot(mtcars, aes(x = wt, y = mpg, color = fcyl)) +
  geom_point() +
  
  # Add a smooth LOESS stat, no ribbon
  stat_smooth(formula = y~x, method = "loess", se = FALSE) +
  
  # Add a smooth lin. reg. stat, no ribbon
  stat_smooth(formula = y~x, method = "lm", se = FALSE)

Exercise 3

# Amend the plot
ggplot(mtcars, aes(x = wt, y = mpg, color = fcyl)) +
  geom_point() +
  
  # Map color to dummy variable "All"
  stat_smooth(aes(color="All"), se = FALSE) +
  stat_smooth(method = "lm", se = FALSE)

`geom_smooth()` using method = 'loess' and formula 'y ~ x'

`geom_smooth()` using formula 'y ~ x'

Concluding Remarks

The default span for LOESS is 0.9.

A lower span will result in a better fit with more detail; but don’t overdo it or you’ll end up over-fitting!

Modifying `stat_smooth` (Part 2)

In this exercise we’ll take a look at the standard error ribbons, which show the 95% confidence interval of smoothing models. ggplot2 and the Vocab data frame are already loaded for you.

Vocab has been given an extra column, year_group, splitting the dates into before and after 1995.

Instructions

Using Vocab, plot vocabulary vs. education, colored by year_group.
- Use geom_jitter() to add jittered points with transparency 0.25.
- Add a smooth linear regression stat (with the standard error ribbon).

It’s easier to read the plot if the standard error ribbons match the lines, and the lines have more emphasis.

Update the smooth stat.
- Map the fill color to year_group.
- Set the line size to 2.

Exercise 1

# Using Vocab, plot vocabulary vs. education, colored by year group
ggplot(data = Vocab, aes(x = education, y = vocabulary, color = year_group)) +
  
  # Add jittered points with transparency 0.25
  geom_jitter(alpha = 0.25) +

  # Add a smooth lin. reg. line (with ribbon)
  stat_smooth(method = "lm", se = TRUE)

Exercise 2

# Amend the plot
ggplot(data = Vocab, aes(x = education, y = vocabulary, color = year_group)) +
  geom_jitter(alpha = 0.25) +

  # Map the fill color to year_group, set the line size to 2
  stat_smooth(aes(fill = year_group), size = 2, method = "lm", se = TRUE)

Concluding Remarks

Notice that since 1995, education has relatively smaller effect on increasing vocabulary.

(2) Stats: Sum and Quantile

Quantiles

Here, we’ll continue with the Vocab dataset and use stat_quantile() to apply a quantile regression.

Linear regression predicts the mean response from the explanatory variables, quantile regression predicts a quantile response (e.g. the median) from the explanatory variables. Specific quantiles can be specified with the quantiles argument.

Specifying many quantiles and color your models according to year can make plots too busy. We’ll explore ways of dealing with this in the next chapter.

Instructions

Update the plot to add a quantile regression stat, at quantiles 0.05, 0.5, and 0.95.
Amend the plot to color according to year_group.

Exercises 1-2

# (2) Amend the plot to color by year_group
ggplot(Vocab, aes(x = education, y = vocabulary, color = year_group)) +
  geom_jitter(alpha = 0.25) +
  
  # (1) Add a quantile stat, at 0.05, 0.5, and 0.95
  stat_quantile(quantiles = c(0.05, 0.5, 0.95))

Warning in rq.fit.br(wx, wy, tau = tau, ...): Solution may be nonunique

Concluding Remarks

Quantile regression is a great tool for getting a more detailed overview of a large dataset.

Using `stat_sum`

In the Vocab dataset, education and vocabulary are integer variables. In the first course, you saw that this is one of the four causes of overplotting. You’d get a single point at each intersection between the two variables.

One solution, shown in the step 1, is jittering with transparency. Another solution is to use stat_sum(), which calculates the total number of overlapping observations and maps that onto the size aesthetic.

stat_sum() allows a special variable, ..prop.., to show the proportion of values within the dataset.

Instructions

Run the code to see how jittering & transparency solves overplotting.
- Replace the jittered points with a sum stat, using stat_sum().
Modify the size aesthetic with the appropriate scale function.
- Add a scale_size() function to set the range from 1 to 10.
Inside stat_sum(), set size to ..prop.. so circle size represents the proportion of the whole dataset.
Update the plot to group by education, so that circle size represents the proportion of the group

Exercise 1

# Run this, look at the plot, then update it
ggplot(Vocab, aes(x = education, y = vocabulary)) +
  
  # Replace this with a sum stat
  stat_sum(alpha = 0.25)

Exercise 2

ggplot(Vocab, aes(x = education, y = vocabulary)) +
  stat_sum() +
  
  # Add a size scale, from 1 to 10
  scale_size(range = c(1, 10))

Exercise 3

# Amend the stat to use proportion sizes
ggplot(Vocab, aes(x = education, y = vocabulary)) +
  stat_sum(aes(size = ..prop..)) # ..prop.. shows the propotion of values within the dataet.

Exercise 4

# Amend the plot to group by education
ggplot(Vocab, aes(x = education, y = vocabulary, group = education)) +
  stat_sum(aes(size = ..prop..))

Concluding Remarks

If a few data points overlap, jittering is great.
When you have lots of overlaps (particularly where continuous data has been rounded), using stat_sum() to count the overlaps is more useful.

(3) Stats Outside Geoms

In the following exercises, we’ll aim to make the plot shown in the viewer.

As before, we’ll use mtcars, where fcyl and fam are proper factor variables of the original cyl and am variables.

Preparations

Here, we’ll establish our positions and base layer of the plot.

Establishing these items as independent objects will allow us to recycle them easily in many layers, or plots.

position_jitter() adds jittering (e.g. for points).
position_dodge() dodges geoms, (e.g. bar, col, boxplot, violin, errorbar, pointrange).
position_jitterdodge() jitters and dodges geoms, (e.g. points).

Instructions

Using these three functions, define these position objects:
- posn_j: will jitter with a width of 0.2.
- posn_d: will dodge with a width of 0.1.
- posn_jd: will jitter and dodge with a jitter.width of 0.2 and a dodge.width of 0.1.
Plot wt vs. fcyl, colored by fam. Assign this base layer to p_wt_vs_fcyl_by_fam.
- Plot the data using geom_point().

Exercise 1

# Define position objects
# 1. Jitter with width 0.2
posn_j <- position_jitter(width = 0.2)

# 2. Dodge with width 0.1
posn_d <- position_dodge(width = 0.1)

# 3. Jitter-dodge with jitter.width 0.2 and dodge.width 0.1
posn_jd <- position_jitterdodge(jitter.width = 0.2, dodge.width = 0.1)

Exercise 2

# Create the plot base: wt vs. fcyl, colored by fam
p_wt_vs_fcyl_by_fam <- ggplot(data = mtcars, aes(x = fcyl, y = wt, color = fam))

# Add a point layer
p_wt_vs_fcyl_by_fam +
  geom_point()

Concluding Remarks

The default positioning of the points is highly susceptible to overplotting.

Using Position Objects

Now that the position objects have been created, you can apply them to the base plot to see their effects. You do this by adding a point geom and setting the position argument to the position object.

The variables from the last exercise, posn_j, posn_d, posn_jd, and p_wt_vs_fcyl_by_fam are available in your workspace.

Instructions

Apply the jitter position, posn_j, to the base plot.
Apply the dodge position, posn_d, to the base plot.
Apply the jitter-dodge position, posn_jd, to the base plot.

Exercise 1

# Add jittering only
p_wt_vs_fcyl_by_fam_jit <- p_wt_vs_fcyl_by_fam +
  geom_point(position = posn_j)
p_wt_vs_fcyl_by_fam_jit

Exercise 2

# Add dodging only
p_wt_vs_fcyl_by_fam +
  geom_point(position = posn_d)

Exercise 3

# Add jittering and dodging
p_wt_vs_fcyl_by_fam +
  geom_point(position = posn_jd)

Concluding Remarks

Although you can set position by setting the position argument to a string (for example position = "dodge"), defining objects promotes consistency between layers.

Plotting Variations

The preparation is done; now let’s explore stat_summary().

Summary statistics refers to a combination of location (mean or median) and spread (standard deviation or confidence interval).

These metrics are calculated in stat_summary() by passing a function to the fun.data argument. mean_sdl(), calculates multiples of the standard deviation and mean_cl_normal() calculates the t-corrected 95% CI.

Arguments to the data function are passed to stat_summary()’s fun.args argument as a list.

The position object, posn_d, and the plot with jittered points, p_wt_vs_fcyl_by_fam_jit, are available.

Instructions

Add error bars representing the standard deviation.
- Set the data function to mean_sdl (without parentheses).
- Draw 1 standard deviation each side of the mean, pass arguments to the mean_sdl() function by assigning them to fun.args in the form of a list.
- Use posn_d to set the position.
The default geom for stat_summary() is "pointrange" which is already great.
- Update the summary stat to use an "errorbar" geom by assigning it to the geom argument.
Update the plot to add a summary stat of 95% confidence limits.
- Set the data function to mean_cl_normal (without parentheses).
- Again, use the dodge position.

Exercise 1

p_wt_vs_fcyl_by_fam_jit +
  # Add a summary stat of std deviation limits
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), position = posn_d)

Exercise 2

p_wt_vs_fcyl_by_fam_jit +
  
  # Change the geom to be an errorbar
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), position = posn_d,
               geom = "errorbar")

Exercise 3

p_wt_vs_fcyl_by_fam_jit +
  # Add a summary stat of normal confidence limits
  stat_summary(fun.data = mean_cl_normal, position = posn_d)

Concluding Remarks

You can always assign your own function to the fun.data argument as long as the result is a data frame and the variable names match the aesthetics that you will need for the geom layer.

Intermediate Data Visualization w/ ggplot2

Chapter 1: Statistics

Eduardo Martinez

Data Sets Utilized

(1) Stats with Geoms

Smoothing

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Concluding Remarks

Grouping Variables

Instructions

Concluding Remarks

My Remarks

Modifying stat_smooth (Part 1)

Instructions

Exercise 1

Exercise 2

Exercise 3

Concluding Remarks

Modifying stat_smooth (Part 2)

Instructions

Exercise 1

Exercise 2

Concluding Remarks

(2) Stats: Sum and Quantile

Quantiles

Instructions

Exercises 1-2

Concluding Remarks

Using stat_sum

Instructions

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Concluding Remarks

(3) Stats Outside Geoms

Preparations

Instructions

Exercise 1

Exercise 2

Concluding Remarks

Using Position Objects

Instructions

Exercise 1

Exercise 2

Exercise 3

Concluding Remarks

Plotting Variations

Instructions

Exercise 1

Exercise 2

Exercise 3

Concluding Remarks

Modifying `stat_smooth` (Part 1)

Modifying `stat_smooth` (Part 2)

Using `stat_sum`