If you haven’t already, please read the Introduction to this
set of lab activities, which includes instructions on where to find the
.csv files referenced in each lab.
ggplot2This section assumes that you’ve read this section from the ggplot2 book(Wickham et al. 2024), and explored some of the R Graph Gallery.
For Lab 2, install the viridis package now for some
additional color options in ggplot. You also need the
tidyverse package.
Go ahead and run
library(tidyverse)
library(viridis)
to load the tidyverse and viridispackages.
Before we jump back in to plotting, we’re going to get a little more
practice tidying data! Remember,
The dataset we’ll start with today is the built-in, CO2 data, as used
and manipulated in a book on machine learning (Rhys 2020) - since this is part of base
R you just need to call it using data() as below, and then
we’re going to create a tibble out of it.
data("CO2")
co2tib <- as_tibble(CO2) #I just used lower case to name it to save keyboard strokes
co2tib
## # A tibble: 84 × 5
## Plant Type Treatment conc uptake
## <ord> <fct> <fct> <dbl> <dbl>
## 1 Qn1 Quebec nonchilled 95 16
## 2 Qn1 Quebec nonchilled 175 30.4
## 3 Qn1 Quebec nonchilled 250 34.8
## 4 Qn1 Quebec nonchilled 350 37.2
## 5 Qn1 Quebec nonchilled 500 35.3
## 6 Qn1 Quebec nonchilled 675 39.2
## 7 Qn1 Quebec nonchilled 1000 39.7
## 8 Qn2 Quebec nonchilled 95 13.6
## 9 Qn2 Quebec nonchilled 175 27.3
## 10 Qn2 Quebec nonchilled 250 37.1
## # ℹ 74 more rows
Any time you’re not familiar with a built-in dataset, you can run
help() to get more information about it.
Just for practice, look for the R Documentation on the CO2 dataset
using help("CO2"). Do the specifications of the co2tib you
just created match the variables listed under
Format?
In Lab 1 you used select() and filter()
with a tibble. Now let’s do a little more with some of the other
functions from dplyr. First, let’s say we only want to
focus on the uptake rates in an analysis, and in addition only for
plants whose uptake rate was > 16.
Create a new tibble by selecting columns 1,2,3, and 5, and filtering
based on uptake > 16. If you need help coding this, 1) check the end
of Lab 1 and/or 2) use help(filter) or
help(select). HINT, the resulting tibble should be 66 rows
x 4 columns.
Next, we would like to group by individual plants and summarize the
data to get the mean and standard deviation of uptake within each group.
To be clear, we can do these calculations in base R. However, if we
wanted to have the output for later use (or save all of the calculations
in as its own dataset) we can achieve this efficiently in
dplyr using the group_by() and
summarize() functions, respectively. Run this code:
co2tib2.grp <- group_by(co2tib2, Plant) #first the group
co2tib2.summ <- summarize(co2tib2.grp, #then the summary stats
meanUp = mean(uptake), sdUp = sd(uptake))
(NOTE, my code uses the name ‘co2tib2’ for the tibble created in #2 above, but you can name it anything you want.)
As one last function in one last step, let’s now use
mutate to create a new variable from existing ones. Using
co2tib.summ from #3, we’ll create the new variable (column)
“CV” for the coefficient
of variation.
co2tib2.mutated <- mutate(co2tib2.summ, CV = (sdUp / meanUp) * 100)
If you’re thinking to yourself, ‘hey, what happened to the pipes?’ -
good! Everything we just did in #2-4 above can be coded where the output
of each line is ‘piped’ into the next one and you get a single object (a
tibble) as output. This is more efficient and eliminates creating all
the intermediate datasets. For good measure we’re adding one more
function to the end, arrange() which in this case arranges
the data in order from low to high, by the variable CV (but in general
you don’t need this function). Run the code below and include the output
of co2tib3 too.
co2tib3 <- co2tib %>% #using the input data co2tib,
select(c(1:3, 5)) %>%
filter(uptake > 16) %>%
group_by(Plant) %>%
summarize(meanUp = mean(uptake), sdUp = sd(uptake)) %>%
mutate(CV = (sdUp / meanUp) * 100) %>%
arrange(CV) #the last 'new' function
If you run summary(co2tib2.mutated) and
summary(co2tib3) (do it!), the output should be the same,
since the data is the same. The only difference is that in
co2tib3 the rows are sorted by the CV column
(take a look at them to see).
On to ggplot2! We’re going to use the built-in dataset
iris for this exercise. You can find scatterplot examples
using both the iris data and the mtcars data
on the R Graph
Gallery.
Run help("iris") to read the Description of the data. If
you haven’t ever looked up what a flower’s sepal and petal are, google
that now (lots of pictures of iris sepal/petals out there).
Since we’re trying to work in the tidyverse now let’s convert the
data.frame iris to a tibble (NOTE that you CAN use ggplot()
with a data.frame - it’s not required to be a tibble).
iris.tb <- as_tibble(iris)
iris.tb
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ℹ 140 more rows
We’re going to start with some simple plots, like in Lab 1, but this
time using geom_point() which, of course, plots points.
iris.tb %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
Of course since we know there are three species of iris in the
dataset, we probably shouldn’t ignore that in our initial plotting. With
a simple addition to the aes() line we can ask R to plot
each species with a different color:
iris.tb %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
geom_point()
There are a LOT of options for specifying color in
ggplot2, either manually or using a pre-existing palette.
I’m going to present one flexible palette here; I like this Data
Novia blog post for much more on color, if/when you’re interested.
We’ll do more in future labs too, but for now (quoting from the
blog):
The
viridisR package provides color palettes to make beautiful plots that are: printer-friendly, perceptually uniform and easy to read by those with colorblindness.
You should have run this already (from above) but JIC
library(viridis)
In this next code chunk, notice that we keep the
col = Species in aes(), but add
scale_color_viridis to the end.
iris.tb %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
geom_point() +
scale_color_viridis(discrete = TRUE, option = "viridis")
Run help("scale_color_viridis"). In the code above, what
does discrete = TRUE do? How many ‘options’ are there? Note
that for the code above, I could have used option = "D" but
that seems kind of boring given the different color map names.
If you want to recreate the plot with different color maps change the
option = "___" to something else.
Another way to look at the relationship of sepal length and width for
each iris species is to split the data by species and plot each one
separately - we’re going to use the function facet_wrap()
to do this. I’m keeping the distinct colors for each one here…
iris.tb %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
geom_point() +
facet_wrap(~Species) +
scale_color_viridis(discrete = TRUE, option = "viridis")
A great feature of ggplot() is the ability to create a
new object in R, and then add on to it with more plot layers. This is
especially nice when you’re exploring different options. In the code
below I’m naming the new object ‘iris.base’. I’m using
option = "viridis" but you can decide on your own color map
to use (including from other packages if you’re ambitious) if you want
to - just know that you’ll need to use the same color map for additional
plotting in this exercise.
The one addition here (you might have noticed this in other ggplot
examples from the online readings) is adding the line at the end for
theme_bw(). There are a number of preset themes (the
default is theme_grey() which the plots above have used)
and I’m just introducing another one here. Run
help("theme_bw") for more choices.
iris.base<-iris.tb %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
geom_point() +
scale_color_viridis(discrete = TRUE, option = "viridis") +
theme_bw()
Note that there’s nothing plotted (yet). Run
class(iris.base) to what R identifies this object as. In
your ‘Environment’ window (default is upper right), under
Data you should see ‘iris.base’ as a ‘List of 11’.
Click on the iris.base object - the first object in the list should be
labeled ‘data’. What type is that data? If you’re not sure you can ask R
to report it using class(iris.base$data)
Now just run the code iris.base to actually plot it!
At this point in the book Sec
2.6 Plot geoms they go through a number of ‘smoothing line’ options.
To me this is more appropriately part of data analysis (what is a good
model for your data?) so we’re just going to consider the simple linear
regression line using lm() for this Exercise.
Note that since we created iris.base previously, we’re
adding on to it with geom_smooth() rather than creating a
plot from scratch. The reason I had you look at iris.base
in #6 is to see for yourself that this object has all the data that went
into creating the plot! So you don’t need to call the data,
iris.tb, in the code below.
iris.base + geom_smooth(formula = y ~ x, method = "lm", se = TRUE)
OK that’s pretty slick and the color of the lines carried over from
the aes() in the original plot. If you didn’t want the
confidence intervals around the regression line you could add
se=FALSE instead. But let’s make one more tweak for the
final version - coloring the fill (and lines and points as before) by
‘Species’ (turn in the plot produced by this code).
iris.base +
geom_smooth(aes(color = Species, fill = Species), formula = y ~ x,
method = "lm", se = TRUE) +
scale_fill_viridis(discrete = TRUE)