ggplot2 is one of the most well-known and widely used R packages. It is arguably the most powerful and flexible package for creating plots in R.
ggplot2 has a very steep learning curve. ggpubr is a package that creates “wrappers” for much of ggplot2’s core functionality. A wrapper is a function that runs another function for you, often making its use easier while also making certain decisions for you by setting certain defaults.
ggpubr is really cool, but unfortunately its syntax is different from both base R plotting and ggplot2. I will provide you ggpubr code whenever we need it; feel free to experiment, but you’ll need to tinker with it or read the help file.
This Software Check point will have you do the following things to get you set up to use these packages
Only do this once, then comment out of the script.
# install.packages("ggplot2")
# install.packages("ggpubr")
library(ggplot2)
library(ggpubr)
The mammals dataset is a classic dataset in the MASS package. msleep is an updated version of the data that includes more numeric data (e.g. hours of sleep) and categorical data (e.g. if a species is endangered)
data(msleep)
Let’s make a boxplot in ggpubr. ggpubr has hand functions like ggboxplot, gghistogram, and ggscatter.
ggpubr does NOT use formula notation like base R function. You have to explicitly define a y and an x variable. Additionally, the variables MUST be in quotes.
Let’s make a boxplot of the amount of sleep an organism gets (sleep_total) and what it eats (vore).
Note: Ignore any errors; these are due to NAs in the data.
ggboxplot(y = "sleep_total",
x = "vore",
data = msleep)
The x-axis is “vore” and is labeled. A general principle of data visualization is that its always good to vary color, size and shape whenever possible, even if its redundant. This helps reinforce the groupings or patterns in the data.
Change the color of the lines of the boxes:
ggboxplot(y = "sleep_total",
x = "vore",
color = "vore",
data = msleep)
Change the fill inside the boxes:
ggboxplot(y = "sleep_total",
x = "vore",
fill = "vore",
data = msleep)
Ignore any errors.
ggscatter(y = "sleep_rem",
x = "sleep_total",
data = msleep)
## Warning: Removed 22 rows containing missing values (geom_point).
Note that almost everything goes in quotes
y = “sleep_rem”, x = “sleep_total”, color = “vore”,
Ignore any errors.
ggscatter(y = "sleep_rem",
x = "sleep_total",
color = "vore",
data = msleep)
## Warning: Removed 22 rows containing missing values (geom_point).
A third dimension can be added by color-coding the scatterplot.
Note that almost everything goes in quotes
y = “sleep_rem”, x = “sleep_total”, color = “sleep_cycle”,
Ignore any errors.
ggscatter(y = "sleep_rem",
x = "sleep_total",
color = "sleep_cycle",
data = msleep)
## Warning: Removed 22 rows containing missing values (geom_point).
The line has the general form
y = m*x + b
Which stats folks write as
y = B0 + B1*x
The distance from the line to each data point is called the residual.
Ignore any errors.
ggscatter(y = "sleep_rem",
x = "sleep_total",
add = "reg.line", # line of best fit
data = msleep)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 22 rows containing non-finite values (stat_smooth).
## Warning: Removed 22 rows containing missing values (geom_point).
ggscatter(y = "sleep_rem",
x = "sleep_total",
ellipse = TRUE, # data ellipse
data = msleep)
## Warning: Removed 22 rows containing non-finite values (stat_ellipse).
## Warning: Removed 22 rows containing missing values (geom_point).
By adding "cor.coef = TRUE’ correlation coefficient, as well as a p-value for the significance of the correlation coefficient (testing the hypothesis that it is 0).
ggscatter(y = "sleep_rem",
x = "sleep_total",
cor.coef = TRUE,
data = msleep)
## Warning: Removed 22 rows containing non-finite values (stat_cor).
## Warning: Removed 22 rows containing missing values (geom_point).
Make a scatter plot with the elements indicate below an upload the image (not the code) to the assignment found here: https://canvas.pitt.edu/courses/45284/assignments/460717
The figure should have these elements:
If this doesn’t work, check that everything is in quotes and that there is a comma at the end of each line. If it doesn’t work, email your code to your UTA and CC me, and/or come to office hours.
# put your code below
ggscatter(y = "sleep_cycle",
x = "sleep_total",
add = "reg.line", # line of best fit
ellipse = TRUE, # data ellipse
cor.coef = TRUE,
data = msleep)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 51 rows containing non-finite values (stat_smooth).
## Warning: Removed 51 rows containing non-finite values (stat_ellipse).
## Warning: Removed 51 rows containing non-finite values (stat_cor).
## Warning: Removed 51 rows containing missing values (geom_point).