Vocab

Learning objectives

Introduction

ggplot2 is one of the most well-known and widely used R packages. It is arguably the most powerful and flexible package for creating plots in R.

ggplot2 has a very steep learning curve. ggpubr is a package that creates “wrappers” for much of ggplot2’s core functionality. A wrapper is a function that runs another function for you, often making its use easier while also making certain decisions for you by setting certain defaults.

ggpubr is really cool, but unfortunately its syntax is different from both base R plotting and ggplot2. I will provide you ggpubr code whenever we need it; feel free to experiment, but you’ll need to tinker with it or read the help file.

This Software Check point will have you do the following things to get you set up to use these packages

Preliminaries

Download packaged

Only do this once, then comment out of the script.

# install.packages("ggplot2")
# install.packages("ggpubr")

Load the libraries

library(ggplot2)
library(ggpubr)

Load the msleep package

The mammals dataset is a classic dataset in the MASS package. msleep is an updated version of the data that includes more numeric data (e.g. hours of sleep) and categorical data (e.g. if a species is endangered)

data(msleep)

Make a basic ggpubr plot

Let’s make a boxplot in ggpubr. ggpubr has hand functions like ggboxplot, gghistogram, and ggscatter.

ggpubr syntax

ggpubr does NOT use formula notation like base R function. You have to explicitly define a y and an x variable. Additionally, the variables MUST be in quotes.

Let’s make a boxplot of the amount of sleep an organism gets (sleep_total) and what it eats (vore).

Note: Ignore any errors; these are due to NAs in the data.

ggboxplot(y = "sleep_total",
          x = "vore",
          data = msleep)

Color-coding data

The x-axis is “vore” and is labeled. A general principle of data visualization is that its always good to vary color, size and shape whenever possible, even if its redundant. This helps reinforce the groupings or patterns in the data.

Change the color of the lines of the boxes:

ggboxplot(y = "sleep_total",
          x = "vore",
          color = "vore",
          data = msleep)

Change the fill inside the boxes:

ggboxplot(y = "sleep_total",
          x = "vore",
          fill = "vore",
          data = msleep)

Scatter plots in ggpubr

Ignore any errors.

ggscatter(y = "sleep_rem",
          x = "sleep_total",
          data = msleep)
## Warning: Removed 22 rows containing missing values (geom_point).

Coloring by a categorical variable

Note that almost everything goes in quotes

y = “sleep_rem”, x = “sleep_total”, color = “vore”,

Ignore any errors.

ggscatter(y = "sleep_rem",
          x = "sleep_total",
          color = "vore",
          data = msleep)
## Warning: Removed 22 rows containing missing values (geom_point).

Coloring by a continuous numeric variable

A third dimension can be added by color-coding the scatterplot.

Note that almost everything goes in quotes

y = “sleep_rem”, x = “sleep_total”, color = “sleep_cycle”,

Ignore any errors.

ggscatter(y = "sleep_rem",
          x = "sleep_total",
          color = "sleep_cycle",
          data = msleep)
## Warning: Removed 22 rows containing missing values (geom_point).

Adding a line of best fit

The line has the general form

y = m*x + b

Which stats folks write as

y = B0 + B1*x

The distance from the line to each data point is called the residual.

Ignore any errors.

ggscatter(y = "sleep_rem",
          x = "sleep_total",
          add = "reg.line",  # line of best fit
          data = msleep)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 22 rows containing non-finite values (stat_smooth).
## Warning: Removed 22 rows containing missing values (geom_point).

Adding a data ellipse

ggscatter(y = "sleep_rem",
          x = "sleep_total",
          ellipse = TRUE,   # data ellipse
          data = msleep)
## Warning: Removed 22 rows containing non-finite values (stat_ellipse).
## Warning: Removed 22 rows containing missing values (geom_point).

Add a correlation coefficient

By adding "cor.coef = TRUE’ correlation coefficient, as well as a p-value for the significance of the correlation coefficient (testing the hypothesis that it is 0).

ggscatter(y = "sleep_rem",
          x = "sleep_total",
          cor.coef = TRUE,
          data = msleep)
## Warning: Removed 22 rows containing non-finite values (stat_cor).
## Warning: Removed 22 rows containing missing values (geom_point).

TASK

Make a scatter plot with the elements indicate below an upload the image (not the code) to the assignment found here: https://canvas.pitt.edu/courses/45284/assignments/460717

The figure should have these elements:

If this doesn’t work, check that everything is in quotes and that there is a comma at the end of each line. If it doesn’t work, email your code to your UTA and CC me, and/or come to office hours.

# put your code below
ggscatter(y = "sleep_cycle",
          x = "sleep_total",
          add = "reg.line",  # line of best fit
          ellipse = TRUE,   # data ellipse
          cor.coef = TRUE,
          data = msleep)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 51 rows containing non-finite values (stat_smooth).
## Warning: Removed 51 rows containing non-finite values (stat_ellipse).
## Warning: Removed 51 rows containing non-finite values (stat_cor).
## Warning: Removed 51 rows containing missing values (geom_point).