Note: These exercises and examples have been partially adapted from materials put together by Sean Anderson (http//seananderson.ca).


Data manipulation

In R, there’s always more than one way to get something done. Pipe functions, and the tidyverse, were implemented by R users that wanted R to be a more “natural” programming language. For people who have used R for a long time, these functions might seem a little complicated - but for people just starting in R, these functions are GREAT.

The format of today’s workshop activities is a series of short “challenges”. Hopefully these force you to make some mistakes.


For reference, we’re going to work with a large dataset of mammal life-history, geography, and ecology traits from the PanTHERIA database:

Jones, K.E., et al. PanTHERIA: a species-level database of life history, ecology, and geography of extant and recently extinct mammals. Ecology 90:2648. http://esapubs.org/archive/ecol/E090/184/


We are also going to be using functions from the packages dplyr and magritrr. However, both of these packages are included in the “tidyverse”.

if (!require(tidyverse)) { install.packages("tidyverse"); library(tidyverse) } # If "tidyverse" isn't installed, install it; and load it.
## Loading required package: tidyverse
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats


Now we will download the data we will be using, and getting it ready for the exercises.

pantheria <- "http://esapubs.org/archive/ecol/E090/184/PanTHERIA_1-0_WR05_Aug2008.txt" # Save a variable with the URL of the data
download.file(pantheria, destfile = "mammals.txt") # Use download.file() to download the data to a new text file.
mammals <- read.table("mammals.txt", sep = "\t", header = TRUE, 
                      stringsAsFactors = FALSE) # Import that text file and name the data "mammals". The data is tab separated ("\t"), has a header, and we don't want string (words) to be read as factors.


Below we are going to prepare the data for what we are going to do today. Don’t worry too much about what this code is doing, but you will need to run it.

names(mammals) <- sub("X[0-9._]+", "", names(mammals))
names(mammals) <- sub("MSW05_", "", names(mammals))
mammals <- dplyr::select(mammals, Order, Binomial, AdultBodyMass_g, 
                         AdultHeadBodyLen_mm, HomeRange_km2, LitterSize)
names(mammals) <- gsub("([A-Z])", "_\\L\\1", names(mammals), perl = TRUE)
names(mammals) <- gsub("^_", "", names(mammals), perl = TRUE)
mammals[mammals == -999] <- NA
names(mammals)[names(mammals) == "binomial"] <- "species"
mammals <- dplyr::tbl_df(mammals) # for prettier printing


Take a look at the data

head(mammals)
mammals

str(mammals)
glimpse(mammals) # This is new!


1. Selecting columns

Previously we’ve used the $ operator and indexing brackets to specify which columns we’d like to see or use in a dataset. Now we will be using the select() function from dplyr. The below methods should yield identical results.

select(mammals, adult_body_mass_g) # Select displays specific rows
mammals$adult_body_mass_g # This indexing example does essentially the same thing
mammals[,3] # As does this one


And now, pipe functions. What are these? They are a simple way to make R code easier to read. Pipe functions are simply the operator %>% to “pipe” together commands into chains.

mammals %>% # And this pipe function is also equivalent
  select(adult_body_mass_g) # Notice that I didn't need to declare the dataset in the select() function - the pipe function usually does that for us.


The pipe function can be a read as “and then”. “I have the mammals dataset”, “and then” “select adult body mass”.


You can use select() to show more than one column:

select(mammals, adult_body_mass_g, litter_size)


Challenge 1a: Can you do this using indexing?


Challenge 1b: With pipe functions?


Challenge 2: Repeat the result below using the pipe function and select()

mammals[,3:6]


Challenge 3: Repeat the result below using the pipe function and select()

mammals[,-3]


Challenge 4: Use the pipe function, select() and contains() to keep only columns that contain the word “adult”. Feel free to ?contains.


Question: Can the same result be easily obtained using indexing brackets?


Challenge 5: Use the pipe function, select() and starts_with() to keep only columns that start with the word “adult”. Feel free to ?starts_with().


Challenge 6: Use the pipe function, select() and ends_with() to keep only columns that are measured in grams. Feel free to ?ends_with().


Challenge 7: Select all columns containing “adult”, but exclude those measured in grams.


Challenge 8: Can you select all columns containing “adult”, but keep only those measured in grams? Hint, use two pipes…


2. Filtering rows

What if you want to select specific rows from the data based on cell values?

filter(mammals, adult_body_mass_g > 15000) # See data for mammals larger than 15 kg
mammals[which(mammals$adult_body_mass_g > 15000),] # Does the same thing using which() and indexing

mammals %>%
  filter(adult_body_mass_g > 15000)


Challenge 9: Can you show only data on the litter size of animals larger than 15 kg?


Challenge 10: Show only data for the order Cetacea. Use filter(), and then do it again with indexing with which().


3. Arrange (sort)

You can use arrange() to order rows in ascending or descending order:

arrange(mammals, adult_body_mass_g) # Show all data, sorted by ascending adult body mass
## # A tibble: 5,416 × 6
##           order                     species adult_body_mass_g
##           <chr>                       <chr>             <dbl>
## 1    Chiroptera Craseonycteris thonglongyai              1.96
## 2    Chiroptera            Kerivoula minuta              2.03
## 3  Soricomorpha             Suncus etruscus              2.26
## 4  Soricomorpha          Sorex minutissimus              2.46
## 5  Soricomorpha     Suncus madagascariensis              2.47
## 6  Soricomorpha         Crocidura lusitania              2.48
## 7  Soricomorpha         Crocidura planiceps              2.50
## 8    Chiroptera        Pipistrellus nanulus              2.51
## 9  Soricomorpha                 Sorex nanus              2.57
## 10 Soricomorpha              Sorex arizonae              2.70
## # ... with 5,406 more rows, and 3 more variables:
## #   adult_head_body_len_mm <dbl>, home_range_km2 <dbl>, litter_size <dbl>
mammals %>%
  arrange(desc(adult_body_mass_g)) # reverse the order, largest animals first
## # A tibble: 5,416 × 6
##      order                species adult_body_mass_g adult_head_body_len_mm
##      <chr>                  <chr>             <dbl>                  <dbl>
## 1  Cetacea  Balaenoptera musculus         154321304               30480.00
## 2  Cetacea     Balaena mysticetus          79691179               12187.12
## 3  Cetacea  Balaenoptera physalus          47506008               20641.06
## 4  Cetacea      Caperea marginata          32000000                     NA
## 5  Cetacea Megaptera novaeangliae          30000000               12856.21
## 6  Cetacea  Eschrichtius robustus          27324024               11831.89
## 7  Cetacea    Eubalaena australis          23000000               14150.00
## 8  Cetacea    Eubalaena glacialis          23000000               13907.08
## 9  Cetacea  Balaenoptera borealis          22106252               18382.16
## 10 Cetacea     Balaenoptera edeni          20000000               13846.17
## # ... with 5,406 more rows, and 2 more variables: home_range_km2 <dbl>,
## #   litter_size <dbl>


You can use multiple conditions - they are read in order:

arrange(mammals, order, adult_body_mass_g) # First sort by order, then by body mass
## # A tibble: 5,416 × 6
##           order                species adult_body_mass_g
##           <chr>                  <chr>             <dbl>
## 1  Afrosoricida      Microgale pusilla              3.40
## 2  Afrosoricida      Microgale parvula              3.53
## 3  Afrosoricida         Geogale aurita              6.69
## 4  Afrosoricida   Microgale fotsifotsy              7.70
## 5  Afrosoricida Microgale longicaudata              8.08
## 6  Afrosoricida Microgale brevicaudata              8.99
## 7  Afrosoricida   Microgale principula             10.20
## 8  Afrosoricida    Microgale drouhardi             10.50
## 9  Afrosoricida       Microgale cowani             12.27
## 10 Afrosoricida        Microgale taiva             12.40
## # ... with 5,406 more rows, and 3 more variables:
## #   adult_head_body_len_mm <dbl>, home_range_km2 <dbl>, litter_size <dbl>


4. Adding or removing columns - Mutate

There are many ways to add a new column to a dataset:

x <- runif(10,1,10) # Random vector x
y <- x*2 # Dependent vector y
data <- data.frame(x,y) # Combine into dataframe

data$z <- data$y/1000 # Add a new column, z.

data
##           x         y           z
## 1  7.507229 15.014458 0.015014458
## 2  2.458193  4.916386 0.004916386
## 3  6.268312 12.536624 0.012536624
## 4  5.902656 11.805313 0.011805313
## 5  3.770172  7.540344 0.007540344
## 6  2.864069  5.728137 0.005728137
## 7  3.776291  7.552582 0.007552582
## 8  7.370617 14.741234 0.014741234
## 9  1.489315  2.978629 0.002978629
## 10 5.206470 10.412939 0.010412939


mutate() is an interesting tool that simplifies the process:

mutate(data, z2 = z^2) # Notice that this does not SAVE or PRESERVE your new column. The results are written, not saved to the dataframe. If you'd like to save it, use the <- to save to an object.
##           x         y           z           z2
## 1  7.507229 15.014458 0.015014458 2.254339e-04
## 2  2.458193  4.916386 0.004916386 2.417086e-05
## 3  6.268312 12.536624 0.012536624 1.571669e-04
## 4  5.902656 11.805313 0.011805313 1.393654e-04
## 5  3.770172  7.540344 0.007540344 5.685679e-05
## 6  2.864069  5.728137 0.005728137 3.281156e-05
## 7  3.776291  7.552582 0.007552582 5.704149e-05
## 8  7.370617 14.741234 0.014741234 2.173040e-04
## 9  1.489315  2.978629 0.002978629 8.872232e-06
## 10 5.206470 10.412939 0.010412939 1.084293e-04
# The same can be done with a pipe function
data %>%
  mutate(z2 = z^2)
##           x         y           z           z2
## 1  7.507229 15.014458 0.015014458 2.254339e-04
## 2  2.458193  4.916386 0.004916386 2.417086e-05
## 3  6.268312 12.536624 0.012536624 1.571669e-04
## 4  5.902656 11.805313 0.011805313 1.393654e-04
## 5  3.770172  7.540344 0.007540344 5.685679e-05
## 6  2.864069  5.728137 0.005728137 3.281156e-05
## 7  3.776291  7.552582 0.007552582 5.704149e-05
## 8  7.370617 14.741234 0.014741234 2.173040e-04
## 9  1.489315  2.978629 0.002978629 8.872232e-06
## 10 5.206470 10.412939 0.010412939 1.084293e-04


IMPORTANT QUESTION: Is the new column “z2” “saved” to the dataframe?


Challenge 11: Use the pipe function and mutate() to add columns to the mammals dataset that describe both g body mass/mm of length, and kg body mass/mm of length.


5. Summarising data

Sometimes you just want quick summaries, such as the mean of a variable, or the mean of a variable in a subset of the data. summarise() is a great tool.

summarise(mammals, mean_mass = mean(adult_body_mass_g, na.rm = TRUE)) # It's important to tell mean() to remove NAs...
## # A tibble: 1 × 1
##   mean_mass
##       <dbl>
## 1  177810.2


Challenge 12: Do the same with indexing. And then with the pipe function.


But when we use group_by(), things get really fun (?group_by):

mammals %>%
  group_by(order) %>%
  summarise(mean_mass = mean(adult_body_mass_g, na.rm = TRUE), sd_mass = sd(adult_body_mass_g, na.rm = TRUE))
## # A tibble: 29 × 3
##              order    mean_mass      sd_mass
##              <chr>        <dbl>        <dbl>
## 1     Afrosoricida 9.475564e+01 1.818012e+02
## 2     Artiodactyla 1.213294e+05 1.984613e+05
## 3        Carnivora 4.738645e+04 1.534111e+05
## 4          Cetacea 7.373065e+06 2.114403e+07
## 5       Chiroptera 5.772033e+01 1.348601e+02
## 6        Cingulata 4.699230e+03 8.761466e+03
## 7   Dasyuromorphia 7.960107e+02 3.965933e+03
## 8       Dermoptera 1.181100e+03 9.743931e+01
## 9  Didelphimorphia 2.012880e+02 3.933471e+02
## 10   Diprotodontia 5.436913e+03 7.545352e+03
## # ... with 19 more rows


Challenge 13: Can you recreate the above output without the pipe function?


6. Synthesis


Challenge 14: Can you use what we’ve covered to find which order have a median litter size greater than 4? Use pipe functions.


Challenge 15: Once found, can you then find the median body mass of those orders? Hint: You may want to use two pipe chains, using -> to save an object after the first.


These “verbs” from the package dplyr may change how you think about attacking your data. Note that you can put plots directly into pipe chains:

require(scales)
mammals %>%
  group_by(order) %>%
  summarise(mean_mass = mean(adult_body_mass_g, na.rm = TRUE), sd_mass = sd(adult_body_mass_g, na.rm = TRUE)) %>%
  arrange(desc(mean_mass)) %>%
  ggplot(aes(x = reorder(order, mean_mass), y = mean_mass), data = .) + # Create a ggplot - reorder "order" according to body mass
    geom_bar(stat = 'identity') + # We want a bar graph
    geom_errorbar(aes(ymax = mean_mass + sd_mass, ymin = mean_mass - sd_mass)) + # Add errorbars for standard deviation
    coord_flip() + # Flip the coordiantes, so Order is on the vertical axis
    scale_y_continuous(limits = c(0,3e+07), oob = rescale_none) + # Set the limits of the "mass" axis. oob = rescale_none ensures that the errorbars plot, even if they extend below 0 g. This is from the "scales" package, loaded above.
    xlab("Order") +
    ylab("Mean mass (g)")


That’s it for this session. Check out this session for more on using the functions in the Tidyverse.