If you haven’t already, please read the Introduction to this set of lab activities, which includes instructions on where to find the .csv files referenced in each lab.

Data and ggplot2

This section assumes that you’ve read this section from the ggplot2 book(Wickham et al. 2024), and explored some of the R Graph Gallery.

For Lab 2, install the viridis package now for some additional color options in ggplot. You also need the tidyverse package.

Go ahead and run

library(tidyverse)
library(viridis)

to load the tidyverse and viridispackages. Before we jump back in to plotting, we’re going to get a little more practice tidying data! Remember,

illustration from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst
illustration from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

Exercise 1

The dataset we’ll start with today is the built-in, CO2 data, as used and manipulated in a book on machine learning (Rhys 2020) - since this is part of base R you just need to call it using data() as below, and then we’re going to create a tibble out of it.

data("CO2")
co2tib <- as_tibble(CO2)    #I just used lower case to name it to save keyboard strokes
co2tib
## # A tibble: 84 × 5
##    Plant Type   Treatment   conc uptake
##    <ord> <fct>  <fct>      <dbl>  <dbl>
##  1 Qn1   Quebec nonchilled    95   16  
##  2 Qn1   Quebec nonchilled   175   30.4
##  3 Qn1   Quebec nonchilled   250   34.8
##  4 Qn1   Quebec nonchilled   350   37.2
##  5 Qn1   Quebec nonchilled   500   35.3
##  6 Qn1   Quebec nonchilled   675   39.2
##  7 Qn1   Quebec nonchilled  1000   39.7
##  8 Qn2   Quebec nonchilled    95   13.6
##  9 Qn2   Quebec nonchilled   175   27.3
## 10 Qn2   Quebec nonchilled   250   37.1
## # ℹ 74 more rows

Any time you’re not familiar with a built-in dataset, you can run help() to get more information about it.

1.1

Just for practice, look for the R Documentation on the CO2 dataset using help("CO2"). Do the specifications of the co2tib you just created match the variables listed under Format?

In Lab 1 you used select() and filter() with a tibble. Now let’s do a little more with some of the other functions from dplyr. First, let’s say we only want to focus on the uptake rates in an analysis, and in addition only for plants whose uptake rate was > 16.

1.2

Create a new tibble by selecting columns 1,2,3, and 5, and filtering based on uptake > 16. If you need help coding this, 1) check the end of Lab 1 and/or 2) use help(filter) or help(select). HINT, the resulting tibble should be 66 rows x 4 columns.

1.3

Next, we would like to group by individual plants and summarize the data to get the mean and standard deviation of uptake within each group. To be clear, we can do these calculations in base R. However, if we wanted to have the output for later use (or save all of the calculations in as its own dataset) we can achieve this efficiently in dplyr using the group_by() and summarize() functions, respectively. Run this code:

co2tib2.grp <- group_by(co2tib2, Plant)     #first the group

co2tib2.summ <- summarize(co2tib2.grp,      #then the summary stats
                    meanUp = mean(uptake), sdUp = sd(uptake))

(NOTE, my code uses the name ‘co2tib2’ for the tibble created in #2 above, but you can name it anything you want.)

1.4

As one last function in one last step, let’s now use mutate to create a new variable from existing ones. Using co2tib.summ from #3, we’ll create the new variable (column) “CV” for the coefficient of variation.

co2tib2.mutated <- mutate(co2tib2.summ,  CV = (sdUp / meanUp) * 100)

1.5

If you’re thinking to yourself, ‘hey, what happened to the pipes?’ - good! Everything we just did in #2-4 above can be coded where the output of each line is ‘piped’ into the next one and you get a single object (a tibble) as output. This is more efficient and eliminates creating all the intermediate datasets. For good measure we’re adding one more function to the end, arrange() which in this case arranges the data in order from low to high, by the variable CV (but in general you don’t need this function). Run the code below and include the output of co2tib3 too.

co2tib3 <- co2tib %>%         #using the input data co2tib,
  select(c(1:3, 5)) %>%
  filter(uptake > 16) %>%
  group_by(Plant) %>%
  summarize(meanUp = mean(uptake), sdUp = sd(uptake)) %>%
  mutate(CV = (sdUp / meanUp) * 100) %>%
  arrange(CV)                 #the last 'new' function

1.6

If you run summary(co2tib2.mutated) and summary(co2tib3) (do it!), the output should be the same, since the data is the same. The only difference is that in co2tib3 the rows are sorted by the CV column (take a look at them to see).

Exercise 2

On to ggplot2! We’re going to use the built-in dataset iris for this exercise. You can find scatterplot examples using both the iris data and the mtcars data on the R Graph Gallery.

Run help("iris") to read the Description of the data. If you haven’t ever looked up what a flower’s sepal and petal are, google that now (lots of pictures of iris sepal/petals out there).

Plotting Points

2.1

Since we’re trying to work in the tidyverse now let’s convert the data.frame iris to a tibble (NOTE that you CAN use ggplot() with a data.frame - it’s not required to be a tibble).

iris.tb <- as_tibble(iris)
iris.tb
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows

2.2

We’re going to start with some simple plots, like in Lab 1, but this time using geom_point() which, of course, plots points.

iris.tb %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width)) + 
  geom_point()

2.3

Of course since we know there are three species of iris in the dataset, we probably shouldn’t ignore that in our initial plotting. With a simple addition to the aes() line we can ask R to plot each species with a different color:

iris.tb %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width, col = Species)) + 
  geom_point()

Color in ggplot2

There are a LOT of options for specifying color in ggplot2, either manually or using a pre-existing palette. I’m going to present one flexible palette here; I like this Data Novia blog post for much more on color, if/when you’re interested. We’ll do more in future labs too, but for now (quoting from the blog):

The viridis R package provides color palettes to make beautiful plots that are: printer-friendly, perceptually uniform and easy to read by those with colorblindness.

You should have run this already (from above) but JIC

library(viridis)

In this next code chunk, notice that we keep the col = Species in aes(), but add scale_color_viridis to the end.

iris.tb %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width, col = Species)) + 
  geom_point() +
  scale_color_viridis(discrete = TRUE, option = "viridis")

2.4

Run help("scale_color_viridis"). In the code above, what does discrete = TRUE do? How many ‘options’ are there? Note that for the code above, I could have used option = "D" but that seems kind of boring given the different color map names.

If you want to recreate the plot with different color maps change the option = "___" to something else.

Another way to look at the relationship of sepal length and width for each iris species is to split the data by species and plot each one separately - we’re going to use the function facet_wrap() to do this. I’m keeping the distinct colors for each one here…

iris.tb %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width, col = Species)) + 
  geom_point() +
  facet_wrap(~Species) +
  scale_color_viridis(discrete = TRUE, option = "viridis")

Adding to a saved plot object

A great feature of ggplot() is the ability to create a new object in R, and then add on to it with more plot layers. This is especially nice when you’re exploring different options. In the code below I’m naming the new object ‘iris.base’. I’m using option = "viridis" but you can decide on your own color map to use (including from other packages if you’re ambitious) if you want to - just know that you’ll need to use the same color map for additional plotting in this exercise.

The one addition here (you might have noticed this in other ggplot examples from the online readings) is adding the line at the end for theme_bw(). There are a number of preset themes (the default is theme_grey() which the plots above have used) and I’m just introducing another one here. Run help("theme_bw") for more choices.

iris.base<-iris.tb %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width, col = Species)) + 
  geom_point() +
  scale_color_viridis(discrete = TRUE, option = "viridis") +
  theme_bw()

2.5

Note that there’s nothing plotted (yet). Run class(iris.base) to what R identifies this object as. In your ‘Environment’ window (default is upper right), under Data you should see ‘iris.base’ as a ‘List of 11’. Click on the iris.base object - the first object in the list should be labeled ‘data’. What type is that data? If you’re not sure you can ask R to report it using class(iris.base$data)

Now just run the code iris.base to actually plot it!

At this point in the book Sec 2.6 Plot geoms they go through a number of ‘smoothing line’ options. To me this is more appropriately part of data analysis (what is a good model for your data?) so we’re just going to consider the simple linear regression line using lm() for this Exercise.

Note that since we created iris.base previously, we’re adding on to it with geom_smooth() rather than creating a plot from scratch. The reason I had you look at iris.base in #6 is to see for yourself that this object has all the data that went into creating the plot! So you don’t need to call the data, iris.tb, in the code below.

iris.base + geom_smooth(formula = y ~ x, method = "lm", se = TRUE)

2.6

OK that’s pretty slick and the color of the lines carried over from the aes() in the original plot. If you didn’t want the confidence intervals around the regression line you could add se=FALSE instead. But let’s make one more tweak for the final version - coloring the fill (and lines and points as before) by ‘Species’ (turn in the plot produced by this code).

iris.base +
  geom_smooth(aes(color = Species, fill = Species), formula = y ~ x, 
              method = "lm", se = TRUE) +
  scale_fill_viridis(discrete = TRUE)

Citations

Rhys, H. 2020. Machine learning with r, the tidyverse, and mlr. Simon; Schuster.
Wickham, H., D. Navarro, and T. L. Pederson. 2024. ggplot2: Elegant graphics for data analysis (3e). 3rd edition. Springer-Verlag.