18 September, 2019

Before we start

Let’s get set up

1. Set up RStudio for today

If you use RStudio projects, choose where you’ll work, set up a project for today’s session and start a new R script. Otherwise, choose where you’ll work, set the working directory to that location, and start a new script.

Don’t put setwd() into your scripts! Anyone know why?

2. Load the dplyr, readr and ggplot2 packages

library(dplyr)
library(readr)
library(ggplot2)

Hopefully these are already installed. If not, install them and then add the library lines above to your script.

3. get the Central American storms data

Download the STORMS.csv data set from this link and place it into your working directory (or a subdirectory called data).

Warmup exercise

Comparing read.csv (base R) and readr::read_csv

Use base R read.csv to read in STORMS.CSV, call the resulting object storms, and then print the storms object to the console. Examine it using str and any other useful functions you know about.

Then repeat the exercise using the read_csv function from the readr package.

What kind of objects do read.csv and read_csv make? How are they similar? How are they different? Pay attention to the type of each column.

Making and using tbl objects

A tbl object (pronounced “tibble”) is essentially a special kind of data frame. Tidyverse functions tend to produce these. They work the same as a data frame, but with a few small differences… e.g. compact printing:

storms <- read_csv("STORMS.CSV")
storms
## # A tibble: 2,747 x 11
##    name    year month   day  hour   lat  long pressure  wind type   seasday
##    <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl> <dbl> <chr>    <dbl>
##  1 Allis…  1995     6     3     0  17.4 -84.3     1005    30 Tropi…       3
##  2 Allis…  1995     6     3     6  18.3 -84.9     1004    30 Tropi…       3
##  3 Allis…  1995     6     3    12  19.3 -85.7     1003    35 Tropi…       3
##  4 Allis…  1995     6     3    18  20.6 -85.8     1001    40 Tropi…       3
##  5 Allis…  1995     6     4     0  22   -86        997    50 Tropi…       4
##  6 Allis…  1995     6     4     6  23.3 -86.3      995    60 Tropi…       4
##  7 Allis…  1995     6     4    12  24.7 -86.2      987    65 Hurri…       4
##  8 Allis…  1995     6     4    18  26.2 -86.2      988    65 Hurri…       4
##  9 Allis…  1995     6     5     0  27.6 -86.1      988    65 Hurri…       5
## 10 Allis…  1995     6     5     6  28.5 -85.6      990    60 Tropi…       5
## # … with 2,737 more rows

Convert storms and iris to tibbles

Next in your script, add these lines to convert storms and iris to tibbles…

storms <- as_tibble(storms)

iris <- as_tibble(iris)

You don’t have to do this (dplyr is fine with normal data frames) but it will ensure your output matches the presentation.

Looking at your data

In addition to printing a tbl or data.frame, we can use the glimpse function to obtain different summary information about variables:

glimpse(storms)
## Observations: 2,747
## Variables: 11
## $ name     <chr> "Allison", "Allison", "Allison", "Allison", "Allison", …
## $ year     <dbl> 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1…
## $ month    <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6…
## $ day      <dbl> 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7…
## $ hour     <dbl> 0, 6, 12, 18, 0, 6, 12, 18, 0, 6, 12, 18, 0, 6, 12, 18,…
## $ lat      <dbl> 17.4, 18.3, 19.3, 20.6, 22.0, 23.3, 24.7, 26.2, 27.6, 2…
## $ long     <dbl> -84.3, -84.9, -85.7, -85.8, -86.0, -86.3, -86.2, -86.2,…
## $ pressure <dbl> 1005, 1004, 1003, 1001, 997, 995, 987, 988, 988, 990, 9…
## $ wind     <dbl> 30, 30, 35, 40, 50, 60, 65, 65, 65, 60, 60, 45, 30, 35,…
## $ type     <chr> "Tropical Depression", "Tropical Depression", "Tropical…
## $ seasday  <dbl> 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7…

This is similar to str — the glimpse function tells us what variables are in storm as well as the type of each variable. (The glimpse function actually belongs to the tibble package, but we don’t need to worry about that…)

Overview of dplyr and tbls

Getting to grips with the basics

Why use dplyr?

dplyr implements a grammar of data manipulation to enable you manipulate data and summarise the information in a data set (e.g. group means).

Advantages of using dplyr

  • Provides a consistent framework for data manipulation
  • Designed to work well with the ggplot2 plotting system
  • Fast compared to many base R functions
  • Allows you to work with data stored in many ways (e.g. in a database)

Five key verbs

dplyr has five main “verbs” (i.e. functions):

  • select: Extract a subset of variables
  • filter: Extract a subset of rows
  • arrange: Reorder rows
  • mutate: Construct new variables
  • summarise: Calculate information about groups

We’ll also explore a few more useful functions such as slice, rename, transmute, and group_by. There are many others…


Five key verbs

It is helpful to classify the verbs according to what they work on:

  • observations (rows): filter & slice & arrange
  • variables (columns): select & rename & mutate
  • summarise: summarise (or summarize)

(This classification only works if your data are tidy, i.e. there is one observation per row and one column per variable. Make sure you read about this idea in the course book)

Using select

Extracting a subset of variables

Basic usage

We use select to extract a subset of variables for further analysis. Using select looks like this:

select(data, Variable1, Variable2, ...)

Arguments

  • data: a data.frame or tbl object
  • VariableX: names of variables in data

Exercise

Selecting two variables

Use the select function with the storms data set to make a new data set containing only name and year. Assign this new data set a name, and then check that it contains the right variables using the glimpse function.

storms_simple <- select(storms, name, year)
glimpse(storms_simple)
## Observations: 2,747
## Variables: 2
## $ name <chr> "Allison", "Allison", "Allison", "Allison", "Allison", "All…
## $ year <dbl> 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995,…

Selecting & dropping variable ranges

The select function makes selecting/removing groups of variables easy:

  • Use : to select a sequence of variables
  • Use - to drop a sequence of variables

The sequence can be specified using numbers (for position) or names.

Usage:

# a range of variable to keep
select(data, Variable1:Variable5)
# a range of variable to drop
select(data, -(Variable1:Variable5))

Example:

iris_fewer <- select(iris, Petal.Length:Species)
iris_fewer
## # A tibble: 150 x 3
##    Petal.Length Petal.Width Species
##           <dbl>       <dbl> <fct>  
##  1          1.4         0.2 setosa 
##  2          1.4         0.2 setosa 
##  3          1.3         0.2 setosa 
##  4          1.5         0.2 setosa 
##  5          1.4         0.2 setosa 
##  6          1.7         0.4 setosa 
##  7          1.4         0.3 setosa 
##  8          1.5         0.2 setosa 
##  9          1.4         0.2 setosa 
## 10          1.5         0.1 setosa 
## # … with 140 more rows

Exercise

Selecting a range of variables

Use the select function with the storms data set to select just the variables name, year and month variables.

storms_fewer <- select(storms, name:month)
glimpse(storms_fewer)
## Observations: 2,747
## Variables: 3
## $ name  <chr> "Allison", "Allison", "Allison", "Allison", "Allison", "Al…
## $ year  <dbl> 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995…
## $ month <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6…
# alternatively
x <- select(storms, -(day:seasday))

Helper functions

There are several helper functions that work with select to simplify common variable selection tasks:

  • starts_with("xyz"): every name that starts with "xyz"
  • ends_with("xyz"): every name that ends with "xyz"
  • contains("xyz"): every name that contains "xyz"
  • matches("xyz"): every name that matches "xyz"
  • one_of(names): every name that appears in names (character vector).

Usage:

select(data, help_func("xyz"))

Example:

iris_petal <- select(iris, starts_with("Petal"))
glimpse(iris_petal)
## Observations: 150
## Variables: 2
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0…

Exercise

Using select with helper functions

Use the select function with the storms data set to create a new data set containing just the lat and long variables. Do this using the starts_with helper function inside select.

storms_fewer <- select(storms, starts_with("l"))
glimpse(storms_fewer)
## Observations: 2,747
## Variables: 2
## $ lat  <dbl> 17.4, 18.3, 19.3, 20.6, 22.0, 23.3, 24.7, 26.2, 27.6, 28.5,…
## $ long <dbl> -84.3, -84.9, -85.7, -85.8, -86.0, -86.3, -86.2, -86.2, -86…

Using select and rename

Renaming variables

Renaming while selecting

We can use select to rename variables as we select them using the newName = varName construct.

Usage:

select(data, newName1 = Var1, newName2 = Var2, ...)

Example:

iris_select <- select(iris, PetalLength = Petal.Length)
glimpse(iris_select)
## Observations: 150
## Variables: 1
## $ PetalLength <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.…

Renaming without selecting

Use rename to rename variables while keeping all variables using the newName = varName construct.

Usage:

rename(data, newName1 = Var1, newName2 = Var2, ...)

Example:

iris_renamed <- rename(iris, 
                       PetalLength = Petal.Length, 
                       PetalWidth  = Petal.Width)
glimpse(iris_renamed)
## Observations: 150
## Variables: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3…
## $ PetalLength  <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1…
## $ PetalWidth   <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, set…

Exercise

Renaming

Extract just the lat and long columns from the storms data set and rename these as latitude and longitude respectively.

storms_renamed <- select(storms, latitude = lat, longitude = long)
glimpse(storms_renamed)
## Observations: 2,747
## Variables: 2
## $ latitude  <dbl> 17.4, 18.3, 19.3, 20.6, 22.0, 23.3, 24.7, 26.2, 27.6, …
## $ longitude <dbl> -84.3, -84.9, -85.7, -85.8, -86.0, -86.3, -86.2, -86.2…

Using mutate

Making new variables

Basic usage

We use mutate to add or change variables for further analysis. This is how we use mutate:

mutate(data, NewVar = <expression>, ...)

Arguments

  • data: a data.frame or tbl object
  • NewVar: name of a new variable to create
  • <expression>: an R expression that references variables in data

Comments

  • The <expression> which appears on the right hand side of the = can be any valid R expression that uses variables in data.

  • You may use more than one NewVar = <expression> at a time if you need to construct several new variables.

Exercise

Making a new variable

Use the mutate function with the iris data set to make a new variable which is the petal area, \((Area = Length \times Width)\).

iris_area <- mutate(iris, Petal.Area = Petal.Length * Petal.Width)
glimpse(iris_area)
## Observations: 150
## Variables: 6
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, set…
## $ Petal.Area   <dbl> 0.28, 0.28, 0.26, 0.30, 0.28, 0.68, 0.42, 0.30, 0.2…

Multiple calculations

We can add more than one variable at a time using mutate. Each new variable can also use one or more variables created in a previous step.

Usage:

mutate(data, NewVar1 = <expression1>, 
             NewVar2 = <expression2 using NewVar1>)

Example:

iris_new_vars <- 
  mutate(iris, Sepal.Eccentricity = Sepal.Length / Sepal.Width,
               Petal.Eccentricity = Petal.Length / Petal.Width,
               Eccentricity.Diff  = Sepal.Eccentricity - Petal.Eccentricity)
glimpse(iris_new_vars)
## Observations: 150
## Variables: 8
## $ Sepal.Length       <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, …
## $ Sepal.Width        <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, …
## $ Petal.Length       <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, …
## $ Petal.Width        <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, …
## $ Species            <fct> setosa, setosa, setosa, setosa, setosa, setos…
## $ Sepal.Eccentricity <dbl> 1.457143, 1.633333, 1.468750, 1.483871, 1.388…
## $ Petal.Eccentricity <dbl> 7.000000, 7.000000, 6.500000, 7.500000, 7.000…
## $ Eccentricity.Diff  <dbl> -5.542857, -5.366667, -5.031250, -6.016129, -…

Exercise

Making several new variables

Use the mutate function with the iris data set to make two new area variables, one for petal and one for sepal. Create a third variable which is the ratio of the petal and sepal areas.

Do all of this in one call to mutate, i.e. use mutate only once to do all of this.

iris_ratio <- mutate(iris, Sepal.Area = Sepal.Length * Sepal.Width,
                           Petal.Area = Petal.Length * Petal.Width,
                           PS.Area.Ratio = Petal.Area / Sepal.Area)
glimpse(iris_ratio)
## Observations: 150
## Variables: 8
## $ Sepal.Length  <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, …
## $ Sepal.Width   <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, …
## $ Petal.Length  <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, …
## $ Petal.Width   <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, …
## $ Species       <fct> setosa, setosa, setosa, setosa, setosa, setosa, se…
## $ Sepal.Area    <dbl> 17.85, 14.70, 15.04, 14.26, 18.00, 21.06, 15.64, 1…
## $ Petal.Area    <dbl> 0.28, 0.28, 0.26, 0.30, 0.28, 0.68, 0.42, 0.30, 0.…
## $ PS.Area.Ratio <dbl> 0.015686275, 0.019047619, 0.017287234, 0.021037868…

Using filter

Selecting subsets of observations

Basic usage

We use filter to select a subset of rows for further analysis, based on the result(s) of one or more logical comparisons. Using filter looks like this:

filter(data, <expression>)

Arguments

  • data: a data.frame or tbl object
  • <expression>: an R expression that implements a logical comparison using variables in data

Comments

  • The <expression> can be any valid R expression that uses variables in data and returns a logical vector of TRUE / FALSE values.
  • The <expression> typically uses a combination of relational (e.g. < and ==) and logical (e.g. & and |) operators

Exercise

Subsetting observations on one variable

Use the filter function with the storms data set to create a new data set containing just the observations associated with storms classified as Hurricanes.

Hint: use glimpse to remind yourself of the variable names in storms. You need to work out which one contains information about the storm category.

filter(storms, type == "Hurricane")
## # A tibble: 896 x 11
##    name    year month   day  hour   lat  long pressure  wind type   seasday
##    <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl> <dbl> <chr>    <dbl>
##  1 Allis…  1995     6     4    12  24.7 -86.2      987    65 Hurri…       4
##  2 Allis…  1995     6     4    18  26.2 -86.2      988    65 Hurri…       4
##  3 Allis…  1995     6     5     0  27.6 -86.1      988    65 Hurri…       5
##  4 Erin    1995     8     1     0  23.6 -74.9      992    70 Hurri…      62
##  5 Erin    1995     8     1     6  24.3 -75.7      988    75 Hurri…      62
##  6 Erin    1995     8     1    12  25.5 -76.3      985    75 Hurri…      62
##  7 Erin    1995     8     1    18  26.3 -77.7      980    75 Hurri…      62
##  8 Erin    1995     8     2     0  26.9 -79        982    75 Hurri…      63
##  9 Erin    1995     8     2     6  27.7 -80.4      985    75 Hurri…      63
## 10 Erin    1995     8     3     0  28.8 -84.7      985    65 Hurri…      64
## # … with 886 more rows

Exercise

Subsetting observations on more than one variable

Repeat the last exercise, but now extract the observations associated with Hurricanes that took place in 1997 or later.

filter(storms, 
       type == "Hurricane", year >= 1997) 
## # A tibble: 501 x 11
##    name   year month   day  hour   lat  long pressure  wind type    seasday
##    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl> <dbl> <chr>     <dbl>
##  1 Bill   1997     7    12    12  37.9 -61.1      987    65 Hurric…      42
##  2 Bill   1997     7    12    18  39.6 -58.4      987    65 Hurric…      42
##  3 Danny  1997     7    18     6  29.2 -89.9      992    65 Hurric…      48
##  4 Danny  1997     7    18    12  29.5 -89.4      990    70 Hurric…      48
##  5 Danny  1997     7    18    18  29.7 -89        988    70 Hurric…      48
##  6 Danny  1997     7    19     0  29.8 -88.4      984    70 Hurric…      49
##  7 Danny  1997     7    19     6  30.1 -88.1      987    65 Hurric…      49
##  8 Danny  1997     7    19    12  30.3 -88        984    70 Hurric…      49
##  9 Danny  1997     7    19    18  30.4 -87.9      986    65 Hurric…      49
## 10 Erika  1997     9     4    18  15.2 -53.7      999    65 Hurric…      96
## # … with 491 more rows
# or use: filter(storms, type == "Hurricane", year >= 1997)

Using arrange

Reordering observations

Basic usage

We use arrange to reorder the rows of our data set. This can help us see associations among our variables. Using arrange looks like this:

arrange(data, Variable1, Variable2, ...)

Arguments

  • data: a data.frame or tbl object
  • VariableX: names of variables in data

Comments

  • The order of sorting corresponds to the order that variables appear in arrange, meaning data is sorted according to Variable1, then Variable2, then Variable3, etc
  • The sort order is from smallest to largest (ascending). If you want to reverse the sort order to go from largest to smallest (descending) use desc(VariableX).

Exercise

Reording observations

Use the arrange function to reorder the observations in the storms data set, according to the pressure variable. Store the resulting data set and use then use the View function to examine it. What can you say about the association between atmospheric pressure and storm category?

storm.sort <- arrange(storms, pressure)
View(storm.sort)

Using summarise

Calculating summaries of variables

Basic usage

We use summarise to calculate summary statistics for further analysis. This is how to use summarise:

summarise(data, SummaryVar = <expression>, ...)

Arguments

  • data: a data.frame or tbl object
  • SummaryVar: name of your summary variable
  • <expression>: an R expression that references variables in data and returns to a single value

Comments

  • The <expression> which appears on the right hand side of the = can be any valid R expression that uses variables in data. However, <expression> should return a single value.
  • Although summarise looks a little like mutate, it is designed to construct a completely new dataset containing summaries of one or more variables.
  • You may use more than one SummaryStat = <expression> at a time if you need to construct several summaries.

Exercise

Calculating the mean of two variables

Use the summarise function with the iris dataset to calculate the mean sepal length and the mean sepal width.

Hint: You need to work out which R function calculates a mean. The clue is in the name.

summarise(iris, 
          mean_sl = mean(Sepal.Length), 
          mean_sw = mean(Sepal.Width))
## # A tibble: 1 x 2
##   mean_sl mean_sw
##     <dbl>   <dbl>
## 1    5.84    3.06

Exercise

Calculating a more complex summary of a variable

Use the summarise function with the iris dataset to calculate the mean area of sepals.

summarise(iris, mean_sl = mean(Sepal.Length * Sepal.Width))
## # A tibble: 1 x 1
##   mean_sl
##     <dbl>
## 1    17.8
summarise(iris, mean_sl = mean(Sepal.Length) * mean(Sepal.Width))
## # A tibble: 1 x 1
##   mean_sl
##     <dbl>
## 1    17.9

Which one is right?

Using group_by

Making summaries for groups of observations

Basic usage

We use group_by to add grouping information to a data frame or tibble. This is how we use group_by:

group_by(data, GroupVar1, GroupVar2, ...)

Arguments

  • data: a data.frame or tbl object
  • GroupVar: name of grouping variable(s)

Comments

  • The group_by does not do anything other than add grouping information to a tbl. It is only useful when used with summarise or mutate.
  • Using group_by with summarise enables us to calculate numerical summaries on a per group basis.
  • Using group_by with mutate enables us to add variables on a per group basis.

Exercise

Using group_by to calculate group-specific means

Use the group_by function and the summarise functions with the storms dataset to calculate the mean wind speed associated with each storm type.

Hint: This is a two step exercise: 1) Use group_by to add some information to storms, remembering to assign the result a name; 2) These use summarise on this new dataset.

# 1. make a grouped tibble
storms_grouped <- group_by(storms, type)
# 2. use summarise on the grouped data
summarise(storms_grouped, mean_wind = mean(wind))
## # A tibble: 4 x 2
##   type                mean_wind
##   <chr>                   <dbl>
## 1 Extratropical            40.1
## 2 Hurricane                84.7
## 3 Tropical Depression      27.4
## 4 Tropical Storm           47.3

Exercise

Using group_by to group by more than one variable

Use the group_by and summarise functions with the storms dataset to calculate the mean and maximum wind speed associated with each combination of month and year. Assign the result a name and then use View to examine it. Which month in which year saw the largest maximum wind speed?

Hint: You can guess the names of the two functions that calculate the mean and max from a numeric vector.

# 1. make a grouped tibble
storms_grouped <- group_by(storms, year, month)
# 2. use summarise on the grouped data
summarise(storms_grouped, mean_speed = mean(wind), max_speed = max(wind))
## # A tibble: 30 x 4
## # Groups:   year [6]
##     year month mean_speed max_speed
##    <dbl> <dbl>      <dbl>     <dbl>
##  1  1995     6       44.4        65
##  2  1995     7       41.4        60
##  3  1995     8       54.2       120
##  4  1995     9       64.4       120
##  5  1995    10       50.5       130
##  6  1995    11       55          70
##  7  1996     6       35.2        45
##  8  1996     7       56.8       100
##  9  1996     8       57.2       125
## 10  1996     9       60.9       120
## # … with 20 more rows

Exercise

Using group_by and mutate

The group_by function works with any dplyr verb that operates on variables (columns). Use the group_by and mutate functions with the iris dataset to calculate a “mean centred” version of sepal length. A centred variable is just one that has had its overall mean subtracted from every value.

Do you understand the different behaviour of summarise and mutate when used alongside group_by?

# 1. group iris by species identity
iris_grouped <- group_by(iris, Species)
# 2. use mutate on the grouped data
mutate(iris_grouped, sl_centred = Sepal.Length - mean(Sepal.Length))
## # A tibble: 150 x 6
## # Groups:   Species [3]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species sl_centred
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>        <dbl>
##  1          5.1         3.5          1.4         0.2 setosa      0.0940
##  2          4.9         3            1.4         0.2 setosa     -0.106 
##  3          4.7         3.2          1.3         0.2 setosa     -0.306 
##  4          4.6         3.1          1.5         0.2 setosa     -0.406 
##  5          5           3.6          1.4         0.2 setosa     -0.006 
##  6          5.4         3.9          1.7         0.4 setosa      0.394 
##  7          4.6         3.4          1.4         0.3 setosa     -0.406 
##  8          5           3.4          1.5         0.2 setosa     -0.006 
##  9          4.4         2.9          1.4         0.2 setosa     -0.606 
## 10          4.9         3.1          1.5         0.1 setosa     -0.106 
## # … with 140 more rows

Using %>%

piping or chaining

Motivating example

We often need to perform a sequence of calculations. We do this by applying a series of function in sequence. Here are two ways to do this:

Method 1: Store intermediate results…

x <- 10
x <- sqrt(x)
x <- exp(x)
round(x, 2)
## [1] 23.62

Method 2: Use function nesting…

round(exp(sqrt(10)), 2)
## [1] 23.62

These do the same thing. Method 1 is easy to read, but is very verbose. Method 2 is concise, but not at all easy to read.

A third way, using %>%…

The dplyr package includes a special operator called “the pipe”. The pipe operator looks like this: %>%.

This allows us to avoid storing intermediate results (method 1), while reading a sequence of function calls from left to right. For example:

10 %>% sqrt(.) %>% exp(.) %>% round(., 2)
## [1] 23.62

Or equivalently, and even simpler…

10 %>% sqrt() %>% exp() %>% round(2)
## [1] 23.62

Why is this handy?

We can use %>% with any function we like. Look at these examples of a group_by and summarise that use the “traditional” methods of dealing with a sequence of calculations:

# method 1 -- store intermediate results
iris_grouped <- group_by(iris, Species)
summarise(iris_grouped, meanSL = mean(Sepal.Length))
# method 2 -- one step with nested functions
summarise(group_by(iris, Species), meanSL = mean(Sepal.Length))

The “piped” equivalent is more natural to read…

iris %>% 
  group_by(Species) %>% 
  summarise(meanSL = mean(Sepal.Length))

Exercise

Using group_by to calculate group specific summaries

We just used group_by and summarise with the storms data set to calculate the mean wind speed associated with each type of storm. Repeat this exercise but now use the pipe. Print the results directly to the Console.

storms %>% 
  group_by(type) %>%
  summarise(mean_speed = mean(wind))
## # A tibble: 4 x 2
##   type                mean_speed
##   <chr>                    <dbl>
## 1 Extratropical             40.1
## 2 Hurricane                 84.7
## 3 Tropical Depression       27.4
## 4 Tropical Storm            47.3

Exercise

Using group_by to calculate group specific summaries

Last week we used group_by and summarise with the storms data set to calculate the mean wind speed associated with each type of storm. Repeat this exercise but now use the pipe. Store the results and then use glimpse to summarise them.

storms_summary <- 
  storms %>% 
  group_by(type) %>%
  summarise(mean_speed = mean(wind))

glimpse(storms_summary)
## Observations: 4
## Variables: 2
## $ type       <chr> "Extratropical", "Hurricane", "Tropical Depression", …
## $ mean_speed <dbl> 40.06068, 84.65960, 27.35867, 47.32181

Overview of ggplot2

The grammar of graphics

Why use ggplot2?

Roughly speaking, there are three commonly used plotting frameworks in R.

  • base graphics
  • lattice package
  • ggplot2 package

Advantages of using ggplot2

  • Consistent and intuitive framework for plotting
  • Flexible enough to make every plot you will need
  • Works well with dplyr

Disadvantages of using ggplot2

  • You have to learn “the grammar” to use it well
  • Vast package, can be intimidating
  • More than one way to do things

Key concepts

You need to wrap your head around a few ideas to start using ggplot2 effectively:

  • layers: We build ggplot2 objects by adding one or more layers together in a stepwise way. We only plot the object when we are ready. Each layer is made up of things like data, aesthetics, geometric objects, etc.
  • aesthetics: The word aesthetics refers to the information in a plot. For example, which variables are associated with the x and y axes? We specify this using the aes function.
  • geometric objects: Geometric objects (“geoms”) determine how the information is displayed. For example, will it be a scatter plot or a bar plot? We can specify geoms by adding a layer via functions beginning with geom_.

We build up a plot by combining different functions using the + operator. This has nothing to do with numeric addition!

Illustrative example

Set up the basic object–define a default data frame (my_df) and aesthetic mappings (aes(...)):

ggplot_object <- ggplot(my_df, aes(x = var1, y = var2))

Add a layer using the point ‘geom’…

ggplot_object <- ggplot_object + geom_point()

Show the plot–just ‘print’ the object to the console

ggplot_object

Real example: scatter plots

Scatter plots are used to show the relationship between 2 continuous variables. Using the iris dataset, let’s examine the relationship between petal length and petal width.

STEP 1:

We use the aes function inside the ggplot function to specify which variables we plan to display. We also have to specify where the data are:

plt <- ggplot(iris, aes(x = Petal.Width, y = Petal.Length))

All we did here was make a ggplot object.

Real example: scatter plots

We can try to print the plot to the screen:

plt

This produces an empty plot because we haven’t added a layer using a geom_ function yet.

Real example: scatter plots

STEP 2:

We want to make a scatter plot so we need to use the geom_point function:

plt <- plt + geom_point()

Notice that all we do is “add” the required layer. Now we have something to plot:

plt

Real example: scatter plots

STEP 3:

Maybe we should improve the axis labels? To do this, we need to “add” labels information using the labs function

plt <- plt + labs(x = "Petal Width (cm)", y = "Petal Length (cm)")
plt

This just adds some new information about labelling to the prexisting ggplot object. Now it prints with improved axis labels:

Example: Scatter plots

Doing it all in one go… We don’t have to build a plot object up in separate steps and then explicitly “print” it to the Console. If we just want to make the plot in one go we can do it like this:

ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) + 
  geom_point() + 
  labs(x = "Petal Width (cm)", y = "Petal Length (cm)")

Exercise

Customising your plot

Repeat the example we just stepped through, but now try to customise the point colours and their size. If that’s too easy, see if you can make the points semi-transparent. An example of suitable output is given below.

Hint: The geom_point function is responsible for altering these features. It has arguments that control properties like point colour and size.

Answer

ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) + 
  geom_point(colour = "blue", size = 3, alpha = 0.5) + 
  labs(x = "Petal Width (cm)", y = "Petal Length (cm)")

Adding more information

Q: The last graph was quite nice, but what information was missing?

ggplot(iris, aes(x = Petal.Width, y = Petal.Length, colour = Species)) + 
  geom_point(size = 3, alpha = 0.5) + 
  labs(x = "Petal Width", y = "Petal Length")

Exercise

Aesthetic mappings vs. arguments to geom_

Notice that we can set “colour” in two places: the aesthetic mapping (aes) or via an argument to a geom (geom_). What happens if we set the colour in both places at once?

Experiment with the iris petal length vs. petal width scatter plot example to work this out. Which one—the aesthetic mapping or geom argument—has precedence?

Exercise

ggplot(iris, aes(x = Petal.Width, y = Petal.Length, colour = Species)) + 
  geom_point(colour = "blue", size = 3, alpha = 0.5) + 
  labs(x = "Petal Width", y = "Petal Length")

Putting it all together (dplyr and ggplot2)

We want to make the following scatter plot. It shows mean wind speed against mean pressure, where the means are calculated for each combination of storm name and type. The storm type of each point is delineated by its colour.

Exercise

Using dplyr and ggplot2 together (part 1)

The first step is to work out how to use dplyr to calcuate the mean wind speed and mean pressure for each combination of storm name and type. Do this with the pipe (%>%) operator, and give the resulting data the name storms_summary.

storms_summary <-
  storms %>% 
  group_by(name, type) %>%
  summarise(wind = mean(wind), pressure = mean(pressure))

Exercise

Using dplyr and ggplot2 together (part 2)

The next step uses the storms_summary data to plot the mean wind speed and mean pressure for each name-type case. Remember to colour the points by type.

Exercise

ggplot(storms_summary, 
       aes(x = pressure, y = wind, col = type)) + 
  geom_point(alpha = 0.7) + 
  labs(x = "Mean pressure (mbar)", y = "Mean wind speed (mph)")

Exercise

Using dplyr and ggplot2 together (part 3)

Finally, see if you can combine the solutions to part 1 and 2 into a single “piped” operation. That is, instead of storing the intermediate data in storms_summary, use the pipe (%>%) to send the data straight to ggplot.

Exercise

storms %>% 
  group_by(name, type) %>%
  summarise(wind = mean(wind), pressure = mean(pressure)) %>%
  ggplot(aes(x = pressure, y = wind, col = type)) + 
    geom_point(alpha = 0.7) + 
    labs(x = "Mean pressure (mbar)", y = "Mean wind speed (mph)")

Histograms

Visualising a single variable

What are histograms?

Histograms summarise the relative frequency of different values of a variable. Look at the first 56 values of pressure variable in storms:

storms $ pressure[1:56]
##  [1] 1005 1004 1003 1001  997  995  987  988  988  990  990  993  993  994
## [15]  995  995  992  990  988  984  982  984  989  993  995  996  997 1000
## [29]  997  990  992  992  993 1019 1019 1018 1017 1016 1013 1011 1009 1007
## [43] 1004 1001  997  997  997  997  996  995  993  991  990  989 1012 1012

To get a sense of how frequent different values are we can “bin” the data. Here are the frequencies of pressure variable values, using 8 bins:

table(cut(storms $ pressure, breaks = 8))
## 
##        (905,919]        (919,934]        (934,948]        (948,962] 
##                5               23               80              164 
##        (962,976]        (976,990]      (990,1e+03] (1e+03,1.02e+03] 
##              307              558              943              667

(You don’t need to remember this R code)

What are histograms?

We use histograms to understand the distribution of a variable. They summarise the number of observations occuring in a contiguous series of bins. We can use geom_histogram to construct a histogram. Here is an example:

ggplot(storms, aes(x = pressure)) + 
  geom_histogram(colour = "darkgrey", fill = "grey", binwidth=10) + 
  labs(x = "Pressure", y = "Count")  

Exercise

Plotting histograms

Working with the iris dataset, construct a histogram of the ratio of petal length to petal width. See if you can make you histogram look like the one below. Hint: you can carry out the calculation with Petal.Length and Petal.Width inside aes (you don’t have to use mutate from dplyr)

Answer

ggplot(iris, aes(x = Petal.Length / Petal.Width)) + 
  geom_histogram(binwidth=0.5) + 
  labs(x = "Petal Eccentricity", y = "Count")  

Alternative to histograms

visualising ‘small’ data

Dot plots

We use dot plots to explore the distribution of variables when we have relatively few observations (e.g. < 100). Here is an example:

setosa <- filter(iris, Species == "setosa")
ggplot(setosa, aes(x = Sepal.Length)) + 
  geom_dotplot(binwidth = 0.1)  

N.B. — The y-scale is meaingless in this plot!

Exercise

Tweaking a dot plot

Make the dot plot of the sepal length variable for the Setosa species but now remove the y axis labels and the grid lines. You don’t know how to do this yet! You’ll need a hint:

  1. Look at the examples in the help file for geom_dotplot to work out what to do with scale_y_continuous (read the comments)

  2. Experiment with the options presented by RStudio after you type theme_. You need to find eth right theme.

A code outline is given below. The <????> are placeholders that show the bits to complete.

setosa <- filter(iris, Species == "setosa")
ggplot(setosa, aes(x = Sepal.Length)) + 
  geom_dotplot(binwidth = 0.1) +
  scale_y_continuous( <????> ) +
  theme_<????>()

Answer

setosa <- filter(iris, Species == "setosa")
ggplot(setosa, aes(x = Sepal.Length)) + 
  geom_dotplot(binwidth = 0.1) +
  scale_y_continuous(NULL, breaks = NULL) + # <- remove the y-axis
  theme_classic()                           # <- remove the grid lines

Boxplots

Relationships between categorical and continuous data

What are box and whiskers plots?

Box and whisker plots summarise the distributions of a variable at different levels of a categorical variable. Here is an example:

Each box-and-whisker shows the group median (line) and the interquartile range (“boxes”). The vertical lines (“whiskers”) highlight the range of the rest of the data in each group. Potential outliers are plotted individually.

Making box and whiskers plots

You can guess which geom_ function we use to make a boxplot…

ggplot(iris, aes(x = Species, y = Petal.Length/Petal.Width)) + 
  geom_boxplot() + 
  labs(x = "Species", y = "Eccentricty") + 
  theme_minimal(base_size = 14)

Exercise

Box and whiskers plots

Working with the storms dataset, construct a box and whiskers plot to summarise wind speed for each type of storm. Customise the fill colour of the boxes, get rid of the grey in the plot background, and increase the size of the text on the graph.

Answer

ggplot(storms, aes(x = type, y = wind)) + 
  geom_boxplot(fill= "lightgrey") + 
  labs(x = "Type of storm", y = "Wind Speed (mph)") + 
  theme_classic(base_size = 14)

Saving plots (version 1)

We can save a plot using the ggsave function with + when we’re building our plot…

# version 1. 
ggplot(setosa, aes(x = Sepal.Length)) + 
  geom_dotplot(binwidth = 0.1) + 
  ggsave("Sepal_dotplot.pdf") # <- use ggsave as part of a ggplot construct

Maybe don’t use this method.

Saving plots (version 2)

Or we can save a plot using the ggsave function on its own after we make the figure…

# version 2. 
ggplot(setosa, aes(x = Sepal.Length)) + 
  geom_dotplot(binwidth = 0.1)
# use ggsave on its own *after* making the figure
ggsave("Sepal_dotplot.pdf")

This method is probably better.

Exercise

Saving plots

Use ggsave to save the box and whiskers plot that you just made. Can you work out where R has saved your plot to (i.e. which folder on your computer)? Can you change the dimensions of the saved plot so that these are 4 inches x 4 inches?

Answer

# make the plot 
ggplot(storms, aes(x = type, y = wind)) + 
  geom_boxplot() + 
  labs(x = "Type of storm", y = "Wind Speed")
# save it
ggsave("Windspeed_boxplot.pdf", height = 4, width = 4)

Barplots

Summary statistics for groups

What are bar plots?

We typically use a barplot to summarise differences among groups. We can use geom_bar to make barplots but it will simply count the number of observations by default:

ggplot(storms, aes(x = factor(year))) + # <- what is the 'factor' function doing?
  geom_bar() + 
  labs(x = "Year", y = "Number of Storms")  

This is not really very useful. Usually we use a bar plot to compare summary statistics (e.g. the mean).

Using bar plots to compare means [step 1]

First we need to calculate the summary statistic. We can do this with the group_by and summarise function from the dplyr package. For example:

# step 1
pl_stats <- 
  iris %>%
  group_by(Species) %>% 
  summarise(mean_pl = mean(Petal.Length))

Using bar plots to compare means [step 2]

Second, make the plot. If we want to use a bar plot to compare a summary statistic (e.g. the mean) across groups we should use the geom_col function:

# step 2 
ggplot(pl_stats, aes(x = Species, y = mean_pl)) + 
  geom_col() + 
  labs(y = "Mean Petal Length (cm)")

Exercise

Making a barplot of means

Working with the storms dataset, construct a bar plot to summarises the mean wind speed (wind) associated with storms in each year (year). If that was too easy, see if you can change the colour of the bars to grey.

Answer

# step 1 - use dplyr to calculate the means
wind.means <- 
  storms %>% group_by(year) %>% 
  summarise(mean= mean(wind))
# step 2 - make the plot
ggplot(wind.means, aes(x = factor(year), y = mean)) + 
  geom_col(fill="darkgrey") + 
  labs(x = "Year", y = "Wind speed (mph)") 

Adding multiple layers (1)

We can build more complex figures by adding more than one layer with the geom_ functions. For example, we should always add an error bar of some kind to summaries of means.


The standard error is one option here:

\[ \text{Standard Error} = \frac{\text{Standard Deviation}}{\sqrt{\text{Sample Size}}} \]

Adding multiple layers (1)

The standard error is one option here:

\[ \text{Standard Error} = \frac{\text{Standard Deviation}}{\sqrt{\text{Sample Size}}} \]

We need to repeat the dplyr, but now include a calculation of the standard errors along with the means:

# step 1
pl_stats <- 
  iris %>%
  group_by(Species) %>% 
  summarise(mean_pl = mean(Petal.Length),
            se = sd(Petal.Length) / sqrt(n())) # <- New calculation

Using multiple geoms (2)

Once we have the two bits of information, we include these by adding two layers via two different geom_ functions: geom_col and geom_errorbar. We also need to define a couple of new aesthetics…

# step 2 
ggplot(pl_stats, 
       aes(x = Species, y = mean_pl, 
           ymin = mean_pl - se, ymax = mean_pl + se)) + 
  geom_col(fill = "grey", width = 0.7) + 
  geom_errorbar(width = 0.25) + 
  labs(y = "Mean Petal Length (cm)")

Exercise

Adding error bars to your plot

Go back to the bar plot you just made using the storms data set and add error bars showing the standard errors of wind speed.

Answer

# step 1 - use dplyr to calculate the means
wind.means <- 
  storms %>% group_by(year) %>% 
  summarise(mean = mean(wind), 
            se = sd(wind)/sqrt(n()))

# step 2 - make the plot
ggplot(wind.means, aes(x = factor(year), y = mean, 
                       ymin = mean - se, ymax = mean + se)) + 
  geom_col(fill="darkgrey") + 
  geom_errorbar(width = 0.25) + 
  labs(x = "Year", y = "Wind speed (mph)")

Exercise

Putting it all together

Have a look at the builtin data set on R called ChickWeight. Make sure you understand what variables it contains, then try to make two plots below. Think about a) what these two graphs tell you about the effectiveness of the four diets and b) what other information it might be useful to include.

Answer

## Calculate the mean and standard errors for each diet at each time point
pltdata<- group_by(ChickWeight, Time, Diet) %>% 
  summarise(mn = mean(weight), se = sd(weight)/sqrt(n()))
## Plot the means over time - remembering to colour by the diet
plta <- ggplot(pltdata, aes(x=Time, y = mn, colour = Diet)) + 
  geom_point() + 
  geom_line() + ## Unsurprisingly the function for adding lines to our plot is geom_line
  theme_classic() + 
  labs(y = "Mean weight (g)", x = "Time (days)")
plta

Answer

## Filter the summary data to only include the final weights
pltdata2 <- ungroup(pltdata) %>% 
  filter(Time==max(Time))
## Make a bar plot of the means and standard errors
pltb <- ggplot(pltdata2, aes(x=Diet, y = mn, ymin = mn-se, ymax = mn+se)) + 
  geom_col(fill = 'cornflowerblue', colour = "black") + 
  geom_errorbar(width = 0.3) + 
  labs(y = "Final weight (g)") + 
  theme_classic()
pltb

A final trick…

We can then use the plot_grid function to make a panel plot containing both graphs. This function is from the cowplot package so you’ll need to have that loaded (remember if you haven’t used it before then you’ll need to install it first using install.packages).

library(cowplot)
plot_grid(plta, pltb, labels = c("a)", "b)"))