Note: These exercises and examples have been partially adapted from materials put together by Sean Anderson (http//seananderson.ca).
In R, there’s always more than one way to get something done. Pipe functions, and the tidyverse, were implemented by R users that wanted R to be a more “natural” programming language. For people who have used R for a long time, these functions might seem a little complicated - but for people just starting in R, these functions are GREAT.
The format of today’s workshop activities is a series of short “challenges”. Hopefully these force you to make some mistakes.
For reference, we’re going to work with a large dataset of mammal life-history, geography, and ecology traits from the PanTHERIA database:
Jones, K.E., et al. PanTHERIA: a species-level database of life history, ecology, and geography of extant and recently extinct mammals. Ecology 90:2648. http://esapubs.org/archive/ecol/E090/184/
We are also going to be using functions from the packages dplyr and magritrr. However, both of these packages are included in the “tidyverse”.
if (!require(tidyverse)) { install.packages("tidyverse"); library(tidyverse) } # If "tidyverse" isn't installed, install it; and load it.
## Loading required package: tidyverse
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
Now we will download the data we will be using, and getting it ready for the exercises.
pantheria <- "http://esapubs.org/archive/ecol/E090/184/PanTHERIA_1-0_WR05_Aug2008.txt" # Save a variable with the URL of the data
download.file(pantheria, destfile = "mammals.txt") # Use download.file() to download the data to a new text file.
mammals <- read.table("mammals.txt", sep = "\t", header = TRUE,
stringsAsFactors = FALSE) # Import that text file and name the data "mammals". The data is tab separated ("\t"), has a header, and we don't want string (words) to be read as factors.
Below we are going to prepare the data for what we are going to do today. Don’t worry too much about what this code is doing, but you will need to run it.
names(mammals) <- sub("X[0-9._]+", "", names(mammals))
names(mammals) <- sub("MSW05_", "", names(mammals))
mammals <- dplyr::select(mammals, Order, Binomial, AdultBodyMass_g,
AdultHeadBodyLen_mm, HomeRange_km2, LitterSize)
names(mammals) <- gsub("([A-Z])", "_\\L\\1", names(mammals), perl = TRUE)
names(mammals) <- gsub("^_", "", names(mammals), perl = TRUE)
mammals[mammals == -999] <- NA
names(mammals)[names(mammals) == "binomial"] <- "species"
mammals <- dplyr::tbl_df(mammals) # for prettier printing
Take a look at the data
head(mammals)
mammals
str(mammals)
glimpse(mammals) # This is new!
Previously we’ve used the $ operator and indexing brackets to specify which columns we’d like to see or use in a dataset. Now we will be using the select() function from dplyr. The below methods should yield identical results.
select(mammals, adult_body_mass_g) # Select displays specific rows
mammals$adult_body_mass_g # This indexing example does essentially the same thing
mammals[,3] # As does this one
And now, pipe functions. What are these? They are a simple way to make R code easier to read. Pipe functions are simply the operator %>% to “pipe” together commands into chains.
mammals %>% # And this pipe function is also equivalent
select(adult_body_mass_g) # Notice that I didn't need to declare the dataset in the select() function - the pipe function usually does that for us.
The pipe function can be a read as “and then”. “I have the mammals dataset”, “and then” “select adult body mass”.
You can use select() to show more than one column:
select(mammals, adult_body_mass_g, litter_size)
Challenge 1a: Can you do this using indexing?
Challenge 1b: With pipe functions?
Challenge 2: Repeat the result below using the pipe function and select()
mammals[,3:6]
Challenge 3: Repeat the result below using the pipe function and select()
mammals[,-3]
Challenge 4: Use the pipe function, select() and contains() to keep only columns that contain the word “adult”. Feel free to ?contains.
Question: Can the same result be easily obtained using indexing brackets?
Challenge 5: Use the pipe function, select() and starts_with() to keep only columns that start with the word “adult”. Feel free to ?starts_with().
Challenge 6: Use the pipe function, select() and ends_with() to keep only columns that are measured in grams. Feel free to ?ends_with().
Challenge 7: Select all columns containing “adult”, but exclude those measured in grams.
Challenge 8: Can you select all columns containing “adult”, but keep only those measured in grams? Hint, use two pipes…
What if you want to select specific rows from the data based on cell values?
filter(mammals, adult_body_mass_g > 15000) # See data for mammals larger than 15 kg
mammals[which(mammals$adult_body_mass_g > 15000),] # Does the same thing using which() and indexing
mammals %>%
filter(adult_body_mass_g > 15000)
Challenge 9: Can you show only data on the litter size of animals larger than 15 kg?
Challenge 10: Show only data for the order Cetacea. Use filter(), and then do it again with indexing with which().
You can use arrange() to order rows in ascending or descending order:
arrange(mammals, adult_body_mass_g) # Show all data, sorted by ascending adult body mass
## # A tibble: 5,416 × 6
## order species adult_body_mass_g
## <chr> <chr> <dbl>
## 1 Chiroptera Craseonycteris thonglongyai 1.96
## 2 Chiroptera Kerivoula minuta 2.03
## 3 Soricomorpha Suncus etruscus 2.26
## 4 Soricomorpha Sorex minutissimus 2.46
## 5 Soricomorpha Suncus madagascariensis 2.47
## 6 Soricomorpha Crocidura lusitania 2.48
## 7 Soricomorpha Crocidura planiceps 2.50
## 8 Chiroptera Pipistrellus nanulus 2.51
## 9 Soricomorpha Sorex nanus 2.57
## 10 Soricomorpha Sorex arizonae 2.70
## # ... with 5,406 more rows, and 3 more variables:
## # adult_head_body_len_mm <dbl>, home_range_km2 <dbl>, litter_size <dbl>
mammals %>%
arrange(desc(adult_body_mass_g)) # reverse the order, largest animals first
## # A tibble: 5,416 × 6
## order species adult_body_mass_g adult_head_body_len_mm
## <chr> <chr> <dbl> <dbl>
## 1 Cetacea Balaenoptera musculus 154321304 30480.00
## 2 Cetacea Balaena mysticetus 79691179 12187.12
## 3 Cetacea Balaenoptera physalus 47506008 20641.06
## 4 Cetacea Caperea marginata 32000000 NA
## 5 Cetacea Megaptera novaeangliae 30000000 12856.21
## 6 Cetacea Eschrichtius robustus 27324024 11831.89
## 7 Cetacea Eubalaena australis 23000000 14150.00
## 8 Cetacea Eubalaena glacialis 23000000 13907.08
## 9 Cetacea Balaenoptera borealis 22106252 18382.16
## 10 Cetacea Balaenoptera edeni 20000000 13846.17
## # ... with 5,406 more rows, and 2 more variables: home_range_km2 <dbl>,
## # litter_size <dbl>
You can use multiple conditions - they are read in order:
arrange(mammals, order, adult_body_mass_g) # First sort by order, then by body mass
## # A tibble: 5,416 × 6
## order species adult_body_mass_g
## <chr> <chr> <dbl>
## 1 Afrosoricida Microgale pusilla 3.40
## 2 Afrosoricida Microgale parvula 3.53
## 3 Afrosoricida Geogale aurita 6.69
## 4 Afrosoricida Microgale fotsifotsy 7.70
## 5 Afrosoricida Microgale longicaudata 8.08
## 6 Afrosoricida Microgale brevicaudata 8.99
## 7 Afrosoricida Microgale principula 10.20
## 8 Afrosoricida Microgale drouhardi 10.50
## 9 Afrosoricida Microgale cowani 12.27
## 10 Afrosoricida Microgale taiva 12.40
## # ... with 5,406 more rows, and 3 more variables:
## # adult_head_body_len_mm <dbl>, home_range_km2 <dbl>, litter_size <dbl>
There are many ways to add a new column to a dataset:
x <- runif(10,1,10) # Random vector x
y <- x*2 # Dependent vector y
data <- data.frame(x,y) # Combine into dataframe
data$z <- data$y/1000 # Add a new column, z.
data
## x y z
## 1 7.507229 15.014458 0.015014458
## 2 2.458193 4.916386 0.004916386
## 3 6.268312 12.536624 0.012536624
## 4 5.902656 11.805313 0.011805313
## 5 3.770172 7.540344 0.007540344
## 6 2.864069 5.728137 0.005728137
## 7 3.776291 7.552582 0.007552582
## 8 7.370617 14.741234 0.014741234
## 9 1.489315 2.978629 0.002978629
## 10 5.206470 10.412939 0.010412939
mutate() is an interesting tool that simplifies the process:
mutate(data, z2 = z^2) # Notice that this does not SAVE or PRESERVE your new column. The results are written, not saved to the dataframe. If you'd like to save it, use the <- to save to an object.
## x y z z2
## 1 7.507229 15.014458 0.015014458 2.254339e-04
## 2 2.458193 4.916386 0.004916386 2.417086e-05
## 3 6.268312 12.536624 0.012536624 1.571669e-04
## 4 5.902656 11.805313 0.011805313 1.393654e-04
## 5 3.770172 7.540344 0.007540344 5.685679e-05
## 6 2.864069 5.728137 0.005728137 3.281156e-05
## 7 3.776291 7.552582 0.007552582 5.704149e-05
## 8 7.370617 14.741234 0.014741234 2.173040e-04
## 9 1.489315 2.978629 0.002978629 8.872232e-06
## 10 5.206470 10.412939 0.010412939 1.084293e-04
# The same can be done with a pipe function
data %>%
mutate(z2 = z^2)
## x y z z2
## 1 7.507229 15.014458 0.015014458 2.254339e-04
## 2 2.458193 4.916386 0.004916386 2.417086e-05
## 3 6.268312 12.536624 0.012536624 1.571669e-04
## 4 5.902656 11.805313 0.011805313 1.393654e-04
## 5 3.770172 7.540344 0.007540344 5.685679e-05
## 6 2.864069 5.728137 0.005728137 3.281156e-05
## 7 3.776291 7.552582 0.007552582 5.704149e-05
## 8 7.370617 14.741234 0.014741234 2.173040e-04
## 9 1.489315 2.978629 0.002978629 8.872232e-06
## 10 5.206470 10.412939 0.010412939 1.084293e-04
IMPORTANT QUESTION: Is the new column “z2” “saved” to the dataframe?
Challenge 11: Use the pipe function and mutate() to add columns to the mammals dataset that describe both g body mass/mm of length, and kg body mass/mm of length.
Sometimes you just want quick summaries, such as the mean of a variable, or the mean of a variable in a subset of the data. summarise() is a great tool.
summarise(mammals, mean_mass = mean(adult_body_mass_g, na.rm = TRUE)) # It's important to tell mean() to remove NAs...
## # A tibble: 1 × 1
## mean_mass
## <dbl>
## 1 177810.2
Challenge 12: Do the same with indexing. And then with the pipe function.
But when we use group_by(), things get really fun (?group_by):
mammals %>%
group_by(order) %>%
summarise(mean_mass = mean(adult_body_mass_g, na.rm = TRUE), sd_mass = sd(adult_body_mass_g, na.rm = TRUE))
## # A tibble: 29 × 3
## order mean_mass sd_mass
## <chr> <dbl> <dbl>
## 1 Afrosoricida 9.475564e+01 1.818012e+02
## 2 Artiodactyla 1.213294e+05 1.984613e+05
## 3 Carnivora 4.738645e+04 1.534111e+05
## 4 Cetacea 7.373065e+06 2.114403e+07
## 5 Chiroptera 5.772033e+01 1.348601e+02
## 6 Cingulata 4.699230e+03 8.761466e+03
## 7 Dasyuromorphia 7.960107e+02 3.965933e+03
## 8 Dermoptera 1.181100e+03 9.743931e+01
## 9 Didelphimorphia 2.012880e+02 3.933471e+02
## 10 Diprotodontia 5.436913e+03 7.545352e+03
## # ... with 19 more rows
Challenge 13: Can you recreate the above output without the pipe function?
Challenge 14: Can you use what we’ve covered to find which order have a median litter size greater than 4? Use pipe functions.
Challenge 15: Once found, can you then find the median body mass of those orders? Hint: You may want to use two pipe chains, using -> to save an object after the first.
These “verbs” from the package dplyr may change how you think about attacking your data. Note that you can put plots directly into pipe chains:
require(scales)
mammals %>%
group_by(order) %>%
summarise(mean_mass = mean(adult_body_mass_g, na.rm = TRUE), sd_mass = sd(adult_body_mass_g, na.rm = TRUE)) %>%
arrange(desc(mean_mass)) %>%
ggplot(aes(x = reorder(order, mean_mass), y = mean_mass), data = .) + # Create a ggplot - reorder "order" according to body mass
geom_bar(stat = 'identity') + # We want a bar graph
geom_errorbar(aes(ymax = mean_mass + sd_mass, ymin = mean_mass - sd_mass)) + # Add errorbars for standard deviation
coord_flip() + # Flip the coordiantes, so Order is on the vertical axis
scale_y_continuous(limits = c(0,3e+07), oob = rescale_none) + # Set the limits of the "mass" axis. oob = rescale_none ensures that the errorbars plot, even if they extend below 0 g. This is from the "scales" package, loaded above.
xlab("Order") +
ylab("Mean mass (g)")
That’s it for this session. Check out this session for more on using the functions in the Tidyverse.