Navigate to this site:
http://rpubs.com/dzchilds/ct_tidy_intro
It contains a copy of this presentation.
18 September, 2019
Navigate to this site:
http://rpubs.com/dzchilds/ct_tidy_intro
It contains a copy of this presentation.
1. Set up RStudio for today
If you use RStudio projects, choose where you’ll work, set up a project for today’s session and start a new R script. Otherwise, choose where you’ll work, set the working directory to that location, and start a new script.
Don’t put setwd() into your scripts! Anyone know why?
2. Load the dplyr, readr and ggplot2 packages
library(dplyr) library(readr) library(ggplot2)
Hopefully these are already installed. If not, install them and then add the library lines above to your script.
3. get the Central American storms data
Download the STORMS.csv data set from this link and place it into your working directory (or a subdirectory called data).
read.csv (base R) and readr::read_csv
Use base R read.csv to read in STORMS.CSV, call the resulting object storms, and then print the storms object to the console. Examine it using str and any other useful functions you know about.
Then repeat the exercise using the read_csv function from the readr package.
What kind of objects do read.csv and read_csv make? How are they similar? How are they different? Pay attention to the type of each column.
tbl objectsA tbl object (pronounced “tibble”) is essentially a special kind of data frame. Tidyverse functions tend to produce these. They work the same as a data frame, but with a few small differences… e.g. compact printing:
storms <- read_csv("STORMS.CSV")
storms
## # A tibble: 2,747 x 11 ## name year month day hour lat long pressure wind type seasday ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> ## 1 Allis… 1995 6 3 0 17.4 -84.3 1005 30 Tropi… 3 ## 2 Allis… 1995 6 3 6 18.3 -84.9 1004 30 Tropi… 3 ## 3 Allis… 1995 6 3 12 19.3 -85.7 1003 35 Tropi… 3 ## 4 Allis… 1995 6 3 18 20.6 -85.8 1001 40 Tropi… 3 ## 5 Allis… 1995 6 4 0 22 -86 997 50 Tropi… 4 ## 6 Allis… 1995 6 4 6 23.3 -86.3 995 60 Tropi… 4 ## 7 Allis… 1995 6 4 12 24.7 -86.2 987 65 Hurri… 4 ## 8 Allis… 1995 6 4 18 26.2 -86.2 988 65 Hurri… 4 ## 9 Allis… 1995 6 5 0 27.6 -86.1 988 65 Hurri… 5 ## 10 Allis… 1995 6 5 6 28.5 -85.6 990 60 Tropi… 5 ## # … with 2,737 more rows
storms and iris to tibblesNext in your script, add these lines to convert storms and iris to tibbles…
storms <- as_tibble(storms) iris <- as_tibble(iris)
You don’t have to do this (dplyr is fine with normal data frames) but it will ensure your output matches the presentation.
In addition to printing a tbl or data.frame, we can use the glimpse function to obtain different summary information about variables:
glimpse(storms)
## Observations: 2,747 ## Variables: 11 ## $ name <chr> "Allison", "Allison", "Allison", "Allison", "Allison", … ## $ year <dbl> 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1… ## $ month <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6… ## $ day <dbl> 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7… ## $ hour <dbl> 0, 6, 12, 18, 0, 6, 12, 18, 0, 6, 12, 18, 0, 6, 12, 18,… ## $ lat <dbl> 17.4, 18.3, 19.3, 20.6, 22.0, 23.3, 24.7, 26.2, 27.6, 2… ## $ long <dbl> -84.3, -84.9, -85.7, -85.8, -86.0, -86.3, -86.2, -86.2,… ## $ pressure <dbl> 1005, 1004, 1003, 1001, 997, 995, 987, 988, 988, 990, 9… ## $ wind <dbl> 30, 30, 35, 40, 50, 60, 65, 65, 65, 60, 60, 45, 30, 35,… ## $ type <chr> "Tropical Depression", "Tropical Depression", "Tropical… ## $ seasday <dbl> 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7…
This is similar to str — the glimpse function tells us what variables are in storm as well as the type of each variable. (The glimpse function actually belongs to the tibble package, but we don’t need to worry about that…)
tbls dplyr implements a grammar of data manipulation to enable you manipulate data and summarise the information in a data set (e.g. group means).
Advantages of using dplyr
ggplot2 plotting systemdplyr has five main “verbs” (i.e. functions):
select: Extract a subset of variablesfilter: Extract a subset of rowsarrange: Reorder rowsmutate: Construct new variablessummarise: Calculate information about groupsWe’ll also explore a few more useful functions such as slice, rename, transmute, and group_by. There are many others…
It is helpful to classify the verbs according to what they work on:
filter & slice & arrangeselect & rename & mutatesummarise (or summarize)(This classification only works if your data are tidy, i.e. there is one observation per row and one column per variable. Make sure you read about this idea in the course book)
select We use select to extract a subset of variables for further analysis. Using select looks like this:
select(data, Variable1, Variable2, ...)
Arguments
data: a data.frame or tbl objectVariableX: names of variables in dataUse the select function with the storms data set to make a new data set containing only name and year. Assign this new data set a name, and then check that it contains the right variables using the glimpse function.
storms_simple <- select(storms, name, year) glimpse(storms_simple)
## Observations: 2,747 ## Variables: 2 ## $ name <chr> "Allison", "Allison", "Allison", "Allison", "Allison", "All… ## $ year <dbl> 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995,…
The select function makes selecting/removing groups of variables easy:
: to select a sequence of variables- to drop a sequence of variablesThe sequence can be specified using numbers (for position) or names.
Usage:
# a range of variable to keep select(data, Variable1:Variable5) # a range of variable to drop select(data, -(Variable1:Variable5))
iris_fewer <- select(iris, Petal.Length:Species) iris_fewer
## # A tibble: 150 x 3 ## Petal.Length Petal.Width Species ## <dbl> <dbl> <fct> ## 1 1.4 0.2 setosa ## 2 1.4 0.2 setosa ## 3 1.3 0.2 setosa ## 4 1.5 0.2 setosa ## 5 1.4 0.2 setosa ## 6 1.7 0.4 setosa ## 7 1.4 0.3 setosa ## 8 1.5 0.2 setosa ## 9 1.4 0.2 setosa ## 10 1.5 0.1 setosa ## # … with 140 more rows
Use the select function with the storms data set to select just the variables name, year and month variables.
storms_fewer <- select(storms, name:month) glimpse(storms_fewer)
## Observations: 2,747 ## Variables: 3 ## $ name <chr> "Allison", "Allison", "Allison", "Allison", "Allison", "Al… ## $ year <dbl> 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995… ## $ month <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6…
# alternatively x <- select(storms, -(day:seasday))
There are several helper functions that work with select to simplify common variable selection tasks:
starts_with("xyz"): every name that starts with "xyz"ends_with("xyz"): every name that ends with "xyz"contains("xyz"): every name that contains "xyz"matches("xyz"): every name that matches "xyz"one_of(names): every name that appears in names (character vector).Usage:
select(data, help_func("xyz"))Example:
iris_petal <- select(iris, starts_with("Petal"))
glimpse(iris_petal)
## Observations: 150 ## Variables: 2 ## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1… ## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0…
Use the select function with the storms data set to create a new data set containing just the lat and long variables. Do this using the starts_with helper function inside select.
storms_fewer <- select(storms, starts_with("l"))
glimpse(storms_fewer)
## Observations: 2,747 ## Variables: 2 ## $ lat <dbl> 17.4, 18.3, 19.3, 20.6, 22.0, 23.3, 24.7, 26.2, 27.6, 28.5,… ## $ long <dbl> -84.3, -84.9, -85.7, -85.8, -86.0, -86.3, -86.2, -86.2, -86…
select and rename We can use select to rename variables as we select them using the newName = varName construct.
Usage:
select(data, newName1 = Var1, newName2 = Var2, ...)
Example:
iris_select <- select(iris, PetalLength = Petal.Length) glimpse(iris_select)
## Observations: 150 ## Variables: 1 ## $ PetalLength <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.…
Use rename to rename variables while keeping all variables using the newName = varName construct.
Usage:
rename(data, newName1 = Var1, newName2 = Var2, ...)
Example:
iris_renamed <- rename(iris,
PetalLength = Petal.Length,
PetalWidth = Petal.Width)
glimpse(iris_renamed)
## Observations: 150 ## Variables: 5 ## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5… ## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3… ## $ PetalLength <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1… ## $ PetalWidth <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0… ## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, set…
Extract just the lat and long columns from the storms data set and rename these as latitude and longitude respectively.
storms_renamed <- select(storms, latitude = lat, longitude = long) glimpse(storms_renamed)
## Observations: 2,747 ## Variables: 2 ## $ latitude <dbl> 17.4, 18.3, 19.3, 20.6, 22.0, 23.3, 24.7, 26.2, 27.6, … ## $ longitude <dbl> -84.3, -84.9, -85.7, -85.8, -86.0, -86.3, -86.2, -86.2…
mutate We use mutate to add or change variables for further analysis. This is how we use mutate:
mutate(data, NewVar = <expression>, ...)
Arguments
data: a data.frame or tbl objectNewVar: name of a new variable to create<expression>: an R expression that references variables in dataComments
The <expression> which appears on the right hand side of the = can be any valid R expression that uses variables in data.
NewVar = <expression> at a time if you need to construct several new variables.Use the mutate function with the iris data set to make a new variable which is the petal area, \((Area = Length \times Width)\).
iris_area <- mutate(iris, Petal.Area = Petal.Length * Petal.Width) glimpse(iris_area)
## Observations: 150 ## Variables: 6 ## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5… ## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3… ## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1… ## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0… ## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, set… ## $ Petal.Area <dbl> 0.28, 0.28, 0.26, 0.30, 0.28, 0.68, 0.42, 0.30, 0.2…
We can add more than one variable at a time using mutate. Each new variable can also use one or more variables created in a previous step.
Usage:
mutate(data, NewVar1 = <expression1>,
NewVar2 = <expression2 using NewVar1>)Example:
iris_new_vars <-
mutate(iris, Sepal.Eccentricity = Sepal.Length / Sepal.Width,
Petal.Eccentricity = Petal.Length / Petal.Width,
Eccentricity.Diff = Sepal.Eccentricity - Petal.Eccentricity)
glimpse(iris_new_vars)
## Observations: 150 ## Variables: 8 ## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, … ## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, … ## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, … ## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, … ## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setos… ## $ Sepal.Eccentricity <dbl> 1.457143, 1.633333, 1.468750, 1.483871, 1.388… ## $ Petal.Eccentricity <dbl> 7.000000, 7.000000, 6.500000, 7.500000, 7.000… ## $ Eccentricity.Diff <dbl> -5.542857, -5.366667, -5.031250, -6.016129, -…
Use the mutate function with the iris data set to make two new area variables, one for petal and one for sepal. Create a third variable which is the ratio of the petal and sepal areas.
Do all of this in one call to mutate, i.e. use mutate only once to do all of this.
iris_ratio <- mutate(iris, Sepal.Area = Sepal.Length * Sepal.Width,
Petal.Area = Petal.Length * Petal.Width,
PS.Area.Ratio = Petal.Area / Sepal.Area)
glimpse(iris_ratio)
## Observations: 150 ## Variables: 8 ## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, … ## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, … ## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, … ## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, … ## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, se… ## $ Sepal.Area <dbl> 17.85, 14.70, 15.04, 14.26, 18.00, 21.06, 15.64, 1… ## $ Petal.Area <dbl> 0.28, 0.28, 0.26, 0.30, 0.28, 0.68, 0.42, 0.30, 0.… ## $ PS.Area.Ratio <dbl> 0.015686275, 0.019047619, 0.017287234, 0.021037868…
filter We use filter to select a subset of rows for further analysis, based on the result(s) of one or more logical comparisons. Using filter looks like this:
filter(data, <expression>)
Arguments
data: a data.frame or tbl object<expression>: an R expression that implements a logical comparison using variables in dataComments
<expression> can be any valid R expression that uses variables in data and returns a logical vector of TRUE / FALSE values.<expression> typically uses a combination of relational (e.g. < and ==) and logical (e.g. & and |) operatorsUse the filter function with the storms data set to create a new data set containing just the observations associated with storms classified as Hurricanes.
Hint: use glimpse to remind yourself of the variable names in storms. You need to work out which one contains information about the storm category.
filter(storms, type == "Hurricane")
## # A tibble: 896 x 11 ## name year month day hour lat long pressure wind type seasday ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> ## 1 Allis… 1995 6 4 12 24.7 -86.2 987 65 Hurri… 4 ## 2 Allis… 1995 6 4 18 26.2 -86.2 988 65 Hurri… 4 ## 3 Allis… 1995 6 5 0 27.6 -86.1 988 65 Hurri… 5 ## 4 Erin 1995 8 1 0 23.6 -74.9 992 70 Hurri… 62 ## 5 Erin 1995 8 1 6 24.3 -75.7 988 75 Hurri… 62 ## 6 Erin 1995 8 1 12 25.5 -76.3 985 75 Hurri… 62 ## 7 Erin 1995 8 1 18 26.3 -77.7 980 75 Hurri… 62 ## 8 Erin 1995 8 2 0 26.9 -79 982 75 Hurri… 63 ## 9 Erin 1995 8 2 6 27.7 -80.4 985 75 Hurri… 63 ## 10 Erin 1995 8 3 0 28.8 -84.7 985 65 Hurri… 64 ## # … with 886 more rows
Repeat the last exercise, but now extract the observations associated with Hurricanes that took place in 1997 or later.
filter(storms,
type == "Hurricane", year >= 1997)
## # A tibble: 501 x 11 ## name year month day hour lat long pressure wind type seasday ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> ## 1 Bill 1997 7 12 12 37.9 -61.1 987 65 Hurric… 42 ## 2 Bill 1997 7 12 18 39.6 -58.4 987 65 Hurric… 42 ## 3 Danny 1997 7 18 6 29.2 -89.9 992 65 Hurric… 48 ## 4 Danny 1997 7 18 12 29.5 -89.4 990 70 Hurric… 48 ## 5 Danny 1997 7 18 18 29.7 -89 988 70 Hurric… 48 ## 6 Danny 1997 7 19 0 29.8 -88.4 984 70 Hurric… 49 ## 7 Danny 1997 7 19 6 30.1 -88.1 987 65 Hurric… 49 ## 8 Danny 1997 7 19 12 30.3 -88 984 70 Hurric… 49 ## 9 Danny 1997 7 19 18 30.4 -87.9 986 65 Hurric… 49 ## 10 Erika 1997 9 4 18 15.2 -53.7 999 65 Hurric… 96 ## # … with 491 more rows
# or use: filter(storms, type == "Hurricane", year >= 1997)
arrange We use arrange to reorder the rows of our data set. This can help us see associations among our variables. Using arrange looks like this:
arrange(data, Variable1, Variable2, ...)
Arguments
data: a data.frame or tbl objectVariableX: names of variables in dataComments
arrange, meaning data is sorted according to Variable1, then Variable2, then Variable3, etcdesc(VariableX).Use the arrange function to reorder the observations in the storms data set, according to the pressure variable. Store the resulting data set and use then use the View function to examine it. What can you say about the association between atmospheric pressure and storm category?
storm.sort <- arrange(storms, pressure)
View(storm.sort)
summarise We use summarise to calculate summary statistics for further analysis. This is how to use summarise:
summarise(data, SummaryVar = <expression>, ...)
Arguments
data: a data.frame or tbl objectSummaryVar: name of your summary variable<expression>: an R expression that references variables in data and returns to a single valueComments
<expression> which appears on the right hand side of the = can be any valid R expression that uses variables in data. However, <expression> should return a single value.summarise looks a little like mutate, it is designed to construct a completely new dataset containing summaries of one or more variables.SummaryStat = <expression> at a time if you need to construct several summaries.Use the summarise function with the iris dataset to calculate the mean sepal length and the mean sepal width.
Hint: You need to work out which R function calculates a mean. The clue is in the name.
summarise(iris,
mean_sl = mean(Sepal.Length),
mean_sw = mean(Sepal.Width))
## # A tibble: 1 x 2 ## mean_sl mean_sw ## <dbl> <dbl> ## 1 5.84 3.06
Use the summarise function with the iris dataset to calculate the mean area of sepals.
summarise(iris, mean_sl = mean(Sepal.Length * Sepal.Width))
## # A tibble: 1 x 1 ## mean_sl ## <dbl> ## 1 17.8
summarise(iris, mean_sl = mean(Sepal.Length) * mean(Sepal.Width))
## # A tibble: 1 x 1 ## mean_sl ## <dbl> ## 1 17.9
Which one is right?
group_by We use group_by to add grouping information to a data frame or tibble. This is how we use group_by:
group_by(data, GroupVar1, GroupVar2, ...)
Arguments
data: a data.frame or tbl objectGroupVar: name of grouping variable(s)Comments
group_by does not do anything other than add grouping information to a tbl. It is only useful when used with summarise or mutate.group_by with summarise enables us to calculate numerical summaries on a per group basis.group_by with mutate enables us to add variables on a per group basis.group_by to calculate group-specific means
Use the group_by function and the summarise functions with the storms dataset to calculate the mean wind speed associated with each storm type.
Hint: This is a two step exercise: 1) Use group_by to add some information to storms, remembering to assign the result a name; 2) These use summarise on this new dataset.
# 1. make a grouped tibble storms_grouped <- group_by(storms, type) # 2. use summarise on the grouped data summarise(storms_grouped, mean_wind = mean(wind))
## # A tibble: 4 x 2 ## type mean_wind ## <chr> <dbl> ## 1 Extratropical 40.1 ## 2 Hurricane 84.7 ## 3 Tropical Depression 27.4 ## 4 Tropical Storm 47.3
group_by to group by more than one variable
Use the group_by and summarise functions with the storms dataset to calculate the mean and maximum wind speed associated with each combination of month and year. Assign the result a name and then use View to examine it. Which month in which year saw the largest maximum wind speed?
Hint: You can guess the names of the two functions that calculate the mean and max from a numeric vector.
# 1. make a grouped tibble storms_grouped <- group_by(storms, year, month) # 2. use summarise on the grouped data summarise(storms_grouped, mean_speed = mean(wind), max_speed = max(wind))
## # A tibble: 30 x 4 ## # Groups: year [6] ## year month mean_speed max_speed ## <dbl> <dbl> <dbl> <dbl> ## 1 1995 6 44.4 65 ## 2 1995 7 41.4 60 ## 3 1995 8 54.2 120 ## 4 1995 9 64.4 120 ## 5 1995 10 50.5 130 ## 6 1995 11 55 70 ## 7 1996 6 35.2 45 ## 8 1996 7 56.8 100 ## 9 1996 8 57.2 125 ## 10 1996 9 60.9 120 ## # … with 20 more rows
group_by and mutate
The group_by function works with any dplyr verb that operates on variables (columns). Use the group_by and mutate functions with the iris dataset to calculate a “mean centred” version of sepal length. A centred variable is just one that has had its overall mean subtracted from every value.
Do you understand the different behaviour of summarise and mutate when used alongside group_by?
# 1. group iris by species identity iris_grouped <- group_by(iris, Species) # 2. use mutate on the grouped data mutate(iris_grouped, sl_centred = Sepal.Length - mean(Sepal.Length))
## # A tibble: 150 x 6 ## # Groups: Species [3] ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species sl_centred ## <dbl> <dbl> <dbl> <dbl> <fct> <dbl> ## 1 5.1 3.5 1.4 0.2 setosa 0.0940 ## 2 4.9 3 1.4 0.2 setosa -0.106 ## 3 4.7 3.2 1.3 0.2 setosa -0.306 ## 4 4.6 3.1 1.5 0.2 setosa -0.406 ## 5 5 3.6 1.4 0.2 setosa -0.006 ## 6 5.4 3.9 1.7 0.4 setosa 0.394 ## 7 4.6 3.4 1.4 0.3 setosa -0.406 ## 8 5 3.4 1.5 0.2 setosa -0.006 ## 9 4.4 2.9 1.4 0.2 setosa -0.606 ## 10 4.9 3.1 1.5 0.1 setosa -0.106 ## # … with 140 more rows
%>% We often need to perform a sequence of calculations. We do this by applying a series of function in sequence. Here are two ways to do this:
Method 1: Store intermediate results…
x <- 10 x <- sqrt(x) x <- exp(x) round(x, 2)
## [1] 23.62
Method 2: Use function nesting…
round(exp(sqrt(10)), 2)
## [1] 23.62
These do the same thing. Method 1 is easy to read, but is very verbose. Method 2 is concise, but not at all easy to read.
%>%…The dplyr package includes a special operator called “the pipe”. The pipe operator looks like this: %>%.
This allows us to avoid storing intermediate results (method 1), while reading a sequence of function calls from left to right. For example:
10 %>% sqrt(.) %>% exp(.) %>% round(., 2)
## [1] 23.62
Or equivalently, and even simpler…
10 %>% sqrt() %>% exp() %>% round(2)
## [1] 23.62
We can use %>% with any function we like. Look at these examples of a group_by and summarise that use the “traditional” methods of dealing with a sequence of calculations:
# method 1 -- store intermediate results iris_grouped <- group_by(iris, Species) summarise(iris_grouped, meanSL = mean(Sepal.Length)) # method 2 -- one step with nested functions summarise(group_by(iris, Species), meanSL = mean(Sepal.Length))
The “piped” equivalent is more natural to read…
iris %>% group_by(Species) %>% summarise(meanSL = mean(Sepal.Length))
group_by to calculate group specific summaries
We just used group_by and summarise with the storms data set to calculate the mean wind speed associated with each type of storm. Repeat this exercise but now use the pipe. Print the results directly to the Console.
storms %>% group_by(type) %>% summarise(mean_speed = mean(wind))
## # A tibble: 4 x 2 ## type mean_speed ## <chr> <dbl> ## 1 Extratropical 40.1 ## 2 Hurricane 84.7 ## 3 Tropical Depression 27.4 ## 4 Tropical Storm 47.3
group_by to calculate group specific summaries
Last week we used group_by and summarise with the storms data set to calculate the mean wind speed associated with each type of storm. Repeat this exercise but now use the pipe. Store the results and then use glimpse to summarise them.
storms_summary <- storms %>% group_by(type) %>% summarise(mean_speed = mean(wind)) glimpse(storms_summary)
## Observations: 4 ## Variables: 2 ## $ type <chr> "Extratropical", "Hurricane", "Tropical Depression", … ## $ mean_speed <dbl> 40.06068, 84.65960, 27.35867, 47.32181
Roughly speaking, there are three commonly used plotting frameworks in R.
Advantages of using ggplot2
Disadvantages of using ggplot2
You need to wrap your head around a few ideas to start using ggplot2 effectively:
aes function.geom_.We build up a plot by combining different functions using the + operator. This has nothing to do with numeric addition!
Set up the basic object–define a default data frame (my_df) and aesthetic mappings (aes(...)):
ggplot_object <- ggplot(my_df, aes(x = var1, y = var2))
Add a layer using the point ‘geom’…
ggplot_object <- ggplot_object + geom_point()
Show the plot–just ‘print’ the object to the console
ggplot_object
Scatter plots are used to show the relationship between 2 continuous variables. Using the iris dataset, let’s examine the relationship between petal length and petal width.
STEP 1:
We use the aes function inside the ggplot function to specify which variables we plan to display. We also have to specify where the data are:
plt <- ggplot(iris, aes(x = Petal.Width, y = Petal.Length))
All we did here was make a ggplot object.
We can try to print the plot to the screen:
plt
This produces an empty plot because we haven’t added a layer using a
geom_ function yet.
STEP 2:
We want to make a scatter plot so we need to use the geom_point function:
plt <- plt + geom_point()
Notice that all we do is “add” the required layer. Now we have something to plot:
plt
STEP 3:
Maybe we should improve the axis labels? To do this, we need to “add” labels information using the labs function
plt <- plt + labs(x = "Petal Width (cm)", y = "Petal Length (cm)") plt
This just adds some new information about labelling to the prexisting ggplot object. Now it prints with improved axis labels:
Doing it all in one go… We don’t have to build a plot object up in separate steps and then explicitly “print” it to the Console. If we just want to make the plot in one go we can do it like this:
ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) + geom_point() + labs(x = "Petal Width (cm)", y = "Petal Length (cm)")
Repeat the example we just stepped through, but now try to customise the point colours and their size. If that’s too easy, see if you can make the points semi-transparent. An example of suitable output is given below.
Hint: The geom_point function is responsible for altering these features. It has arguments that control properties like point colour and size.
ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) + geom_point(colour = "blue", size = 3, alpha = 0.5) + labs(x = "Petal Width (cm)", y = "Petal Length (cm)")
Q: The last graph was quite nice, but what information was missing?
ggplot(iris, aes(x = Petal.Width, y = Petal.Length, colour = Species)) + geom_point(size = 3, alpha = 0.5) + labs(x = "Petal Width", y = "Petal Length")
geom_
Notice that we can set “colour” in two places: the aesthetic mapping (aes) or via an argument to a geom (geom_). What happens if we set the colour in both places at once?
Experiment with the iris petal length vs. petal width scatter plot example to work this out. Which one—the aesthetic mapping or geom argument—has precedence?
ggplot(iris, aes(x = Petal.Width, y = Petal.Length, colour = Species)) + geom_point(colour = "blue", size = 3, alpha = 0.5) + labs(x = "Petal Width", y = "Petal Length")
We want to make the following scatter plot. It shows mean wind speed against mean pressure, where the means are calculated for each combination of storm name and type. The storm type of each point is delineated by its colour.
The first step is to work out how to use dplyr to calcuate the mean wind speed and mean pressure for each combination of storm name and type. Do this with the pipe (%>%) operator, and give the resulting data the name storms_summary.
storms_summary <- storms %>% group_by(name, type) %>% summarise(wind = mean(wind), pressure = mean(pressure))
The next step uses the storms_summary data to plot the mean wind speed and mean pressure for each name-type case. Remember to colour the points by type.
ggplot(storms_summary,
aes(x = pressure, y = wind, col = type)) +
geom_point(alpha = 0.7) +
labs(x = "Mean pressure (mbar)", y = "Mean wind speed (mph)")
Finally, see if you can combine the solutions to part 1 and 2 into a single “piped” operation. That is, instead of storing the intermediate data in storms_summary, use the pipe (%>%) to send the data straight to ggplot.
storms %>%
group_by(name, type) %>%
summarise(wind = mean(wind), pressure = mean(pressure)) %>%
ggplot(aes(x = pressure, y = wind, col = type)) +
geom_point(alpha = 0.7) +
labs(x = "Mean pressure (mbar)", y = "Mean wind speed (mph)")
Histograms summarise the relative frequency of different values of a variable. Look at the first 56 values of pressure variable in storms:
storms $ pressure[1:56]
## [1] 1005 1004 1003 1001 997 995 987 988 988 990 990 993 993 994 ## [15] 995 995 992 990 988 984 982 984 989 993 995 996 997 1000 ## [29] 997 990 992 992 993 1019 1019 1018 1017 1016 1013 1011 1009 1007 ## [43] 1004 1001 997 997 997 997 996 995 993 991 990 989 1012 1012
To get a sense of how frequent different values are we can “bin” the data. Here are the frequencies of pressure variable values, using 8 bins:
table(cut(storms $ pressure, breaks = 8))
## ## (905,919] (919,934] (934,948] (948,962] ## 5 23 80 164 ## (962,976] (976,990] (990,1e+03] (1e+03,1.02e+03] ## 307 558 943 667
(You don’t need to remember this R code)
We use histograms to understand the distribution of a variable. They summarise the number of observations occuring in a contiguous series of bins. We can use geom_histogram to construct a histogram. Here is an example:
ggplot(storms, aes(x = pressure)) + geom_histogram(colour = "darkgrey", fill = "grey", binwidth=10) + labs(x = "Pressure", y = "Count")
Working with the iris dataset, construct a histogram of the ratio of petal length to petal width. See if you can make you histogram look like the one below. Hint: you can carry out the calculation with Petal.Length and Petal.Width inside aes (you don’t have to use mutate from dplyr)
ggplot(iris, aes(x = Petal.Length / Petal.Width)) + geom_histogram(binwidth=0.5) + labs(x = "Petal Eccentricity", y = "Count")
We use dot plots to explore the distribution of variables when we have relatively few observations (e.g. < 100). Here is an example:
setosa <- filter(iris, Species == "setosa") ggplot(setosa, aes(x = Sepal.Length)) + geom_dotplot(binwidth = 0.1)
N.B. — The y-scale is meaingless in this plot!
Make the dot plot of the sepal length variable for the Setosa species but now remove the y axis labels and the grid lines. You don’t know how to do this yet! You’ll need a hint:
Look at the examples in the help file for geom_dotplot to work out what to do with scale_y_continuous (read the comments)
Experiment with the options presented by RStudio after you type theme_. You need to find eth right theme.
A code outline is given below. The <????> are placeholders that show the bits to complete.
setosa <- filter(iris, Species == "setosa") ggplot(setosa, aes(x = Sepal.Length)) + geom_dotplot(binwidth = 0.1) + scale_y_continuous( <????> ) + theme_<????>()
setosa <- filter(iris, Species == "setosa") ggplot(setosa, aes(x = Sepal.Length)) + geom_dotplot(binwidth = 0.1) + scale_y_continuous(NULL, breaks = NULL) + # <- remove the y-axis theme_classic() # <- remove the grid lines
Box and whisker plots summarise the distributions of a variable at different levels of a categorical variable. Here is an example:
Each box-and-whisker shows the group median (line) and the interquartile range (“boxes”). The vertical lines (“whiskers”) highlight the range of the rest of the data in each group. Potential outliers are plotted individually.
You can guess which geom_ function we use to make a boxplot…
ggplot(iris, aes(x = Species, y = Petal.Length/Petal.Width)) + geom_boxplot() + labs(x = "Species", y = "Eccentricty") + theme_minimal(base_size = 14)
Working with the storms dataset, construct a box and whiskers plot to summarise wind speed for each type of storm. Customise the fill colour of the boxes, get rid of the grey in the plot background, and increase the size of the text on the graph.
ggplot(storms, aes(x = type, y = wind)) + geom_boxplot(fill= "lightgrey") + labs(x = "Type of storm", y = "Wind Speed (mph)") + theme_classic(base_size = 14)
We can save a plot using the ggsave function with + when we’re building our plot…
# version 1.
ggplot(setosa, aes(x = Sepal.Length)) +
geom_dotplot(binwidth = 0.1) +
ggsave("Sepal_dotplot.pdf") # <- use ggsave as part of a ggplot construct
Maybe don’t use this method.
Or we can save a plot using the ggsave function on its own after we make the figure…
# version 2.
ggplot(setosa, aes(x = Sepal.Length)) +
geom_dotplot(binwidth = 0.1)
# use ggsave on its own *after* making the figure
ggsave("Sepal_dotplot.pdf")
This method is probably better.
Use ggsave to save the box and whiskers plot that you just made. Can you work out where R has saved your plot to (i.e. which folder on your computer)? Can you change the dimensions of the saved plot so that these are 4 inches x 4 inches?
# make the plot
ggplot(storms, aes(x = type, y = wind)) +
geom_boxplot() +
labs(x = "Type of storm", y = "Wind Speed")
# save it
ggsave("Windspeed_boxplot.pdf", height = 4, width = 4)
We typically use a barplot to summarise differences among groups. We can use geom_bar to make barplots but it will simply count the number of observations by default:
ggplot(storms, aes(x = factor(year))) + # <- what is the 'factor' function doing? geom_bar() + labs(x = "Year", y = "Number of Storms")
This is not really very useful. Usually we use a bar plot to compare summary statistics (e.g. the mean).
First we need to calculate the summary statistic. We can do this with the group_by and summarise function from the dplyr package. For example:
# step 1 pl_stats <- iris %>% group_by(Species) %>% summarise(mean_pl = mean(Petal.Length))
Second, make the plot. If we want to use a bar plot to compare a summary statistic (e.g. the mean) across groups we should use the geom_col function:
# step 2 ggplot(pl_stats, aes(x = Species, y = mean_pl)) + geom_col() + labs(y = "Mean Petal Length (cm)")
Working with the storms dataset, construct a bar plot to summarises the mean wind speed (wind) associated with storms in each year (year). If that was too easy, see if you can change the colour of the bars to grey.
# step 1 - use dplyr to calculate the means wind.means <- storms %>% group_by(year) %>% summarise(mean= mean(wind)) # step 2 - make the plot ggplot(wind.means, aes(x = factor(year), y = mean)) + geom_col(fill="darkgrey") + labs(x = "Year", y = "Wind speed (mph)")
We can build more complex figures by adding more than one layer with the geom_ functions. For example, we should always add an error bar of some kind to summaries of means.
The standard error is one option here:
\[ \text{Standard Error} = \frac{\text{Standard Deviation}}{\sqrt{\text{Sample Size}}} \]
The standard error is one option here:
\[ \text{Standard Error} = \frac{\text{Standard Deviation}}{\sqrt{\text{Sample Size}}} \]
We need to repeat the dplyr, but now include a calculation of the standard errors along with the means:
# step 1
pl_stats <-
iris %>%
group_by(Species) %>%
summarise(mean_pl = mean(Petal.Length),
se = sd(Petal.Length) / sqrt(n())) # <- New calculation
Once we have the two bits of information, we include these by adding two layers via two different geom_ functions: geom_col and geom_errorbar. We also need to define a couple of new aesthetics…
# step 2
ggplot(pl_stats,
aes(x = Species, y = mean_pl,
ymin = mean_pl - se, ymax = mean_pl + se)) +
geom_col(fill = "grey", width = 0.7) +
geom_errorbar(width = 0.25) +
labs(y = "Mean Petal Length (cm)")
Go back to the bar plot you just made using the storms data set and add error bars showing the standard errors of wind speed.
# step 1 - use dplyr to calculate the means
wind.means <-
storms %>% group_by(year) %>%
summarise(mean = mean(wind),
se = sd(wind)/sqrt(n()))
# step 2 - make the plot
ggplot(wind.means, aes(x = factor(year), y = mean,
ymin = mean - se, ymax = mean + se)) +
geom_col(fill="darkgrey") +
geom_errorbar(width = 0.25) +
labs(x = "Year", y = "Wind speed (mph)")
Have a look at the builtin data set on R called ChickWeight. Make sure you understand what variables it contains, then try to make two plots below. Think about a) what these two graphs tell you about the effectiveness of the four diets and b) what other information it might be useful to include.
## Calculate the mean and standard errors for each diet at each time point pltdata<- group_by(ChickWeight, Time, Diet) %>% summarise(mn = mean(weight), se = sd(weight)/sqrt(n())) ## Plot the means over time - remembering to colour by the diet plta <- ggplot(pltdata, aes(x=Time, y = mn, colour = Diet)) + geom_point() + geom_line() + ## Unsurprisingly the function for adding lines to our plot is geom_line theme_classic() + labs(y = "Mean weight (g)", x = "Time (days)") plta
## Filter the summary data to only include the final weights pltdata2 <- ungroup(pltdata) %>% filter(Time==max(Time)) ## Make a bar plot of the means and standard errors pltb <- ggplot(pltdata2, aes(x=Diet, y = mn, ymin = mn-se, ymax = mn+se)) + geom_col(fill = 'cornflowerblue', colour = "black") + geom_errorbar(width = 0.3) + labs(y = "Final weight (g)") + theme_classic() pltb
We can then use the plot_grid function to make a panel plot containing both graphs. This function is from the cowplot package so you’ll need to have that loaded (remember if you haven’t used it before then you’ll need to install it first using install.packages).
library(cowplot)
plot_grid(plta, pltb, labels = c("a)", "b)"))