Efficient Data manipulation

Using pipes to streamine your code

A tutorial guide from Our Coding Club, you can see more of it in this link: https://ourcodingclub.github.io/tutorials/data-manip-efficient/

The datasets of the tutorial on this link: https://github.com/ourcodingclub/CC-data-manip-2.git

Tutorial aims:

Chain together multiple lines of codes with pipes %>%
Use dplyr to its full potential
Automate advanced tasks like plotting without writing a loop

# packages needed

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.1

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

# Set working directory
setwd("D:\\R-tutoriales ECOLOGY\\Efficient data manipulation") # I'm on windows so I need to include an extra backslash to give the rigth path, otherwise the tool doesn't work. On Mac and Linux you use the slash in the path.

# load data
trees <- read.csv("trees.csv", header = TRUE)
head(trees)

##                      Site           LatinName CommonName          Height
## 1 Craigmillar Castle Park Acer pseudoplatanus   Sycamore 10 to 15 meters
## 2 Craigmillar Castle Park Acer pseudoplatanus   Sycamore 10 to 15 meters
## 3 Craigmillar Castle Park Acer pseudoplatanus   Sycamore 10 to 15 meters
## 4 Craigmillar Castle Park Acer pseudoplatanus   Sycamore 10 to 15 meters
## 5 Craigmillar Castle Park Acer pseudoplatanus   Sycamore 10 to 15 meters
## 6 Craigmillar Castle Park Acer pseudoplatanus   Sycamore 10 to 15 meters
##          Spread    AgeGroup DiameterAt Easting Northing
## 1 6 to 9 meters Middle Aged    30 - 40  328092   671054
## 2 6 to 9 meters Middle Aged    30 - 40  328099   671061
## 3 6 to 9 meters Middle Aged    30 - 40  328106   671067
## 4 6 to 9 meters Middle Aged    30 - 40  328109   671071
## 5 6 to 9 meters Middle Aged    30 - 40  328127   671089
## 6 6 to 9 meters Middle Aged    30 - 40  328131   671093

Let’s say we want to know how many trees of each species are found in the dataset. This is a task made for the functions group_by() and summarise(). So we could do this:

trees_group <- group_by(trees, CommonName) # Create an internal grouping structure, so that the next function acts on groups (here, species) separately.

trees_summary <- summarise(trees_group, count = length(CommonName)) # here we use length to count the number of rows (trees) for each group (species). We could have used any row name.

# Alternatively, dplyr has a tally function that does the counts for you!
trees_summ <- tally(trees_group)

This works well, but notice how we had to create an extra data frame, trees_group, before achieving our desired output of trees_summary. For a larger, complex analysis, this would rapidly clutter your environment with lots of objects you don’t really need!

This is where the pipe comes in to save the day. It takes the data frame created on its left side, and passes it to the function on its right side. This saves you the need for creating intermediary objects, and also avoids repeating the object name in every function: the tidyverse functions “know” that the object that is passed through the pipe is the data = argument of that function.

# Count the number of trees for each species, with a pipe!

trees.summary <- trees %>%                   # the data frame object that will be passed in the pipe
                 group_by(CommonName) %>%    # see how we don't need to name the object, just the grouping variable?
                 tally()                     # and we don't need anything at all here, it has been passed through the pipe!

See how we go from trees to trees.summary while running one single chunk of code?

Important notes: Pipes only work on data frame objects, and functions outside the tidyverse often require that you specify the data source with a full stop dot .. But as we will see later, you can still do advanced things while keeping these limitations in mind!

We’re not lazy, but we love shortcuts! In RStudio, you can use Ctrl + Shift + M (or Cmd + Shift + M on a Mac) to create the %>% operator.

Let’s use some more of our favourite dplyr functions in pipe chains. Can you guess what this does?

trees.subset <- trees %>%
                filter(CommonName %in% c('Common Ash', 'Rowan', 'Scots Pine')) %>% 
                group_by(CommonName, AgeGroup) %>% 
                tally()
head(trees.subset)

## # A tibble: 6 × 3
## # Groups:   CommonName [3]
##   CommonName AgeGroup        n
##   <chr>      <chr>       <int>
## 1 Common Ash Juvenile        4
## 2 Common Ash Mature          2
## 3 Common Ash Middle Aged    36
## 4 Common Ash Semi-mature    17
## 5 Rowan      Juvenile        7
## 6 Scots Pine Mature          2

Here we are first subsetting the data frame to only three species, and counting the number of trees for each species, but also breaking them down by age group. The intuitive names of dplyr’s actions make the code very readable for your colleagues, too.

Neat, uh? Now let’s play around with other functions that dplyr has to offer.

2. More functions of dplyr

An extension of the core dplyr functions is summarise_all(): you may have guessed, it will run a summary function of your choice over ALL the columns. Not meaningful here, but could be if all values were numeric, for instance.

2.a summarise_all() quickly generate a summary dataframe

summ.all <- summarise_all(trees, mean)

## Warning: There were 7 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `Site = (function (x, ...) ...`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 6 remaining warnings.

print(summ.all)

##   Site LatinName CommonName Height Spread AgeGroup DiameterAt  Easting Northing
## 1   NA        NA         NA     NA     NA       NA         NA 328453.4 670745.6

As only two of the columns had numeric values over which a mean could be calculated, the other columns have missing values.

Now let’s move on to a truly exciting function that not so many people know about.

2.b case_when() - a favourite for re-classifying values or factors

But first, it seems poor form to introduce this function without also introducing the simpler function upon which it builds, ifelse(). You give ifelse() a conditional statement which it will evaluate, and the values it should return when this statement is true or false. Let’s do a very simple example to begin with:

vector <- c(4, 13, 15, 6)      # create a vector to evaluate

ifelse(vector < 10, "A", "B")  # give the conditions: if inferior to 10, return A, if not, return B

## [1] "A" "B" "B" "A"

# Congrats, you're a dancing queen! (Or king!)

The super useful case_when() is a generalisation of ifelse() that lets you assign more than two outcomes. All logical operators are available, and you assign the new value with a tilde ~. For instance:

vector2 <- c("What am I?", "A", "B", "C", "D")

case_when(vector2 == "What am I?" ~ "I am the walrus",
          vector2 %in% c("A", "B") ~ "goo",
          vector2 == "C" ~ "ga",
          vector2 == "D" ~ "joob")

## [1] "I am the walrus" "goo"             "goo"             "ga"             
## [5] "joob"

3 changing factor levels or create categorical variables

The use of mutate() together with case_when() is a great way to change the names of factor levels, or create a new variable based on existing ones. We see from the LatinName columns that there are many tree species belonging to some genera, like birches (Betula), or willows (Salix), for example. We may want to create a Genus column using mutate() that will hold that information.

We will do this using a character string search with the grepl function, which looks for patterns in the data, and specify what to return for each genus. Before we do that, we may want the full list of species occuring in the data!

unique(trees$LatinName)  # Shows all the species names

##  [1] "Acer pseudoplatanus"    "Fraxinus excelsior"     "Sorbus aucuparia"      
##  [4] "Betula pendula"         "Populus spp."           "Acer platanoides"      
##  [7] "Betula spp."            "Laburnum spp."          "Aesculus hippocastanum"
## [10] "Fagus sylvatica"        "Prunus spp."            "Pinus sylvestris"      
## [13] "Sambucus nigra"         "Crataegus monogyna"     "Ilex aquifolium"       
## [16] "Pinus nigra 'Maritima'" "Aesculus spp."          "Quercus spp."          
## [19] "Sorbus intermedia"      "Larix spp."             "Prunus avium"          
## [22] "Acer spp."              "Salix caprea"           "Alnus glutinosa"       
## [25] "Populus balsamifera"    "Populus tremula"

# Create a new column with the tree genera

trees.genus <- trees %>%
               mutate(Genus = case_when(               # creates the genus column and specifies conditions
                  grepl("Acer", LatinName) ~ "Acer",
                  grepl("Fraxinus", LatinName) ~ "Fraxinus",
                  grepl("Sorbus", LatinName) ~ "Sorbus",
                  grepl("Betula", LatinName) ~ "Betula",
                  grepl("Populus", LatinName) ~ "Populus",
                  grepl("Laburnum", LatinName) ~ "Laburnum",
                  grepl("Aesculus", LatinName) ~ "Aesculus",
                  grepl("Fagus", LatinName) ~ "Fagus",
                  grepl("Prunus", LatinName) ~ "Prunus",
                  grepl("Pinus", LatinName) ~ "Pinus",
                  grepl("Sambucus", LatinName) ~ "Sambucus",
                  grepl("Crataegus", LatinName) ~ "Crataegus",
                  grepl("Ilex", LatinName) ~ "Ilex",
                  grepl("Quercus", LatinName) ~ "Quercus",
                  grepl("Larix", LatinName) ~ "Larix",
                  grepl("Salix", LatinName) ~ "Salix",
                  grepl("Alnus", LatinName) ~ "Alnus")
               )

We have searched through the LatinNamecolumn for each genus name, and specified a value to put in the new Genus column for each case. It’s a lot of typing, but still quicker than specifying the genus individually for related trees (e.g. Acer pseudoplatanus, Acer platanoides, Acer spp.). BONUS FUNCTION! In our specific case, we could have achieved the same result much quicker. The genus is always the first word of the LatinName column, and always separated from the next word by a space. We could use the separate() function from the tidyr package to split the column into several new columns filled with the words making up the species names, and keep only the first one.

library(tidyr)

## Warning: package 'tidyr' was built under R version 4.3.1

trees.genus.2 <- trees %>%
                  tidyr::separate(LatinName, c("Genus", "Species"), sep = " ", remove = FALSE) %>%  
                  dplyr::select(-Species)

## Warning: Expected 2 pieces. Additional pieces discarded in 2 rows [151, 170].

# we're creating two new columns in a vector (genus name and species name), "sep" refers to the separator, here space between the words, and remove = FALSE means that we want to keep the original column LatinName in the data frame

Mind blowing! Of course, sometimes you have to be typing more, so here is another example of how we can reclassify a factor. The Height factor has 5 levels representing brackets of tree heights, but let’s say three categories would be enough for our purposes. We create a new height category variable Height.cat:

trees.genus <- trees.genus %>%   # overwriting our data frame
               mutate(Height.cat =   # creating our new column
                         case_when(Height %in% c("Up to 5 meters", "5 to 10 meters") ~ "Short",
                                   Height %in% c("10 to 15 meters", "15 to 20 meters") ~ "Medium",
                                   Height == "20 to 25 meters" ~ "Tall")
                      )

## Reordering a factor's levels

levels(trees.genus$Height.cat)  # shows the different factor levels in their default order

## NULL

trees.genus$Height.cat <- factor(trees.genus$Height.cat,
                                 levels = c('Short', 'Medium', 'Tall'),   # whichever order you choose will be reflected in plots etc
                                 labels = c('SHORT', 'MEDIUM', 'TALL')    # Make sure you match the new names to the original levels!
                                 )   

levels(trees.genus$Height.cat)  # a new order and new names for the levels

## [1] "SHORT"  "MEDIUM" "TALL"

4. Advanced piping

Earlier in the tutorial, we used pipes to gradually transform our dataframes by adding new columns or transforming the variables they contain. But sometimes you may want to use the really neat grouping functionalities of dplyr with non native dplyr functions, for instance to run series of models or produce plots. It can be tricky, but it’s sometimes easier to write than a loop. (You can learn to write loops here.)

First, we’ll subset our dataset to just a few tree genera to keep things light. Pick your favourite five, or use those we have defined here! Then we’ll map them to see how they are distributed.

# Subset data frame to fewer genera

trees.five <- trees.genus %>%
               filter(Genus %in% c("Acer", "Fraxinus", "Salix", "Aesculus", "Pinus"))

# Map all the trees

(map.all <- ggplot(trees.five) +
            geom_point(aes(x = Easting, y = Northing, size = Height.cat, colour = Genus), alpha = 0.5) +
            theme_bw() +
            theme(panel.grid = element_blank(),
                  axis.text = element_text(size = 12),
                  legend.text = element_text(size = 12))
)

## Warning: Using size for a discrete variable is not advised.

Don’t worry too much about all the arguments in the ggplot code, they are there to make the graph prettier. The interesting bits are the x and y axis, and the other two parameters we put in the aes() call: we’re telling the plot to colour the dots according to genus, and to make them bigger or smaller according to our tree height factor. We’ll explain everything else in our data visualisation tutorial.

Now, let’s say we want to save a separate map for each genus (so 5 maps in total). You could filter the data frame five times for each individual genus, and copy and paste the plotting code five times too, but imagine we kept all 17 genera! This is where pipes and dplyr come to the rescue again. (If you’re savvy with ggplot2, you’ll know that facetting is often a better option, but sometimes you do want to save things as separate files.) The do() function allows us to use pretty much any R function within a pipe chain, provided that we supply the data as data = . where the function requires it.

# Plotting a map for each genus

tree.plots <-  
   trees.five  %>%      # the data frame
   group_by(Genus) %>%  # grouping by genus
   do(plots =           # the plotting call within the do function
         ggplot(data = .) +
         geom_point(aes(x = Easting, y = Northing, size = Height.cat), alpha = 0.5) +
         labs(title = paste("Map of", .$Genus, "at Craigmillar Castle", sep = " ")) +
         theme_bw() +
         theme(panel.grid = element_blank(),
               axis.text = element_text(size = 14),
               legend.text = element_text(size = 12),
               plot.title = element_text(hjust = 0.5),
               legend.position = "bottom")
   )

# You can view the graphs before saving them
tree.plots$plots

# Saving the plots to file

tree.plots %>%              # the saving call within the do function
   do(.,
      ggsave(.$plots, filename = paste(getwd(), "\\", "map-", .$Genus, ".png", sep = ""), device = "png", height = 12, width = 16, units = "cm"))

You should get five different plots looking something like the one above.

Phew! This could even be chained in one long call without creating the tree.plots object, but take a moment to explore this object: the plots are saved as lists within the plots column that we created. The do() function allows to use a lot of external functions within dplyr pipe chains. However, it is sometimes tricky to use and is becoming deprecated. This page shows an alternative solution using the purr package to save the files.

Sticking things together with paste()

Did you notice how we used the paste() function to define the filename= argument of the last piece of code? (We did the same to define the titles that appear on the graphs.) It’s a useful function that lets you combine text strings as well as outputs from functions or object names in the environment. Let’s take apart that last piece of code here:

paste(getwd(), '/', 'map-', .$Genus, '.png', sep = '')

getwd(): You are familiar with this call: try it in the console now! It writes the path to your working directory, i.e. the root folder where we want to save the plots. ’/’: we want to add a slash after the directory folder and before writing the name of the plot ‘map-‘: a custom text bit that will be shared by all the plots. We’re drawing maps after all!’.$Genus’: accesses the Genus name of the tree.plots object, so each plot will bear a different name according to the tree genus. ‘.png’: the file extension; we could also have chosen a pdf, jpg, etc. ‘sep = ‘’’: we want all the previous bits to be pasted together with nothing separating them So, in the end, the whole string could read something like: ‘C:/Coding_Club/map-Acer.png’.

We hope you’ve learned new hacks that will simplify your code and make it more efficient! Let’s see if you can use what we learned today to accomplish a last data task.

5. Challenge!

The Craigmillar Castle team would like a summary of the different species found within its grounds, but broken down in four quadrants (NE, NW, SE, SW). You can start from the trees.genus object created earlier.

Can you calculate the species richness (e.g. the number of different species) in each quadrant?
They would also like to know how abundant the genus Acer is (as a % of the total number of trees) in each quadrant.
Finally, they would like, for each quadrant separately, a bar plot showing counts of Acer trees in the different age classes, ordered so they read from Young (lumping together juvenile and semi-mature trees), Middle Aged, and Mature.

# Find the center coordinates that will divide the data (adding half of the range in longitude and latitude to the smallest value)

lon <- (max(trees.genus$Easting) - min(trees.genus$Easting))/2 + min(trees.genus$Easting)
lat <- (max(trees.genus$Northing) - min(trees.genus$Northing))/2 + min(trees.genus$Northing)

# Create the column

trees.genus <- trees.genus %>%
   mutate(Quadrant = case_when(
                     Easting < lon & Northing > lat ~ 'NW',
                     Easting < lon & Northing < lat ~ 'SW',
                     Easting > lon & Northing > lat ~ 'NE',
                     Easting > lon & Northing < lat ~ 'SE')
   )

# We can check that it worked
ggplot(trees.genus) +
   geom_point(aes(x = Easting, y = Northing, colour = Quadrant)) +
   theme_bw()

It did work, but there is a NA value (check the legend)! Probably this point has the exact integer value of middle Easting, and should be attributed to one side or the other (your choice).

trees.genus <- trees.genus %>%
   mutate(Quadrant = case_when(
      Easting <= lon & Northing > lat ~ 'NW',  # using inferior OR EQUAL ensures that no point is forgotten
      Easting <= lon & Northing < lat ~ 'SW',
      Easting > lon & Northing > lat ~ 'NE',
      Easting > lon & Northing < lat ~ 'SE')
   )

To answer the first question, a simple pipeline combining group_by() and summarise() is what we need.

sp.richness <- trees.genus %>%
group_by(Quadrant) %>%
summarise(richness = length(unique(LatinName)))

There we are! We have 7, 15, 8 and 21 species for the NE, NW, SE, and SW corners respectively!

There are different ways to calculate the proportion of Acer trees, here is one (maybe base R would have been less convoluted in this case!):

acer.percent <- trees.genus %>%
   group_by(Quadrant, Genus) %>%
   tally() %>%                      # get the count of trees in each quadrant x genus
   group_by(Quadrant) %>%           # regroup only by quadrant
   mutate(total = sum(n)) %>%       # sum the total of trees in a new column
   filter(Genus == 'Acer') %>%      # keep only acer
   mutate(percent = n/total)        # calculate the proportion

# We can make a plot representing the %

ggplot(acer.percent) +
   geom_col(aes(x = Quadrant, y = percent)) +
   labs(x = 'Quadrant', y = 'Proportion of Acer') +
   theme_bw()

And finally, we can use our manipulation skills to subset the data frame to Acer only and change the age factor, and then use our pipes to create the four plots.

# Create an Acer-only data frame

acer <- trees.genus %>%
   filter(Genus == 'Acer')


# Rename and reorder age factor

acer$AgeGroup <- factor(acer$AgeGroup,
                        levels = c('Juvenile', 'Semi-mature', 'Middle Aged', 'Mature'),
                        labels = c('Young', 'Young', 'Middle Aged', 'Mature'))


# Plot the graphs for each quadrant

acer.plots <- acer %>%
   group_by(Quadrant) %>%
   do(plots =           # the plotting call within the do function
         ggplot(data = .) +
         geom_bar(aes(x = AgeGroup)) +
         labs(title = paste('Age distribution of Acer in ', .$Quadrant, ' corner', sep = ''),
              x = 'Age group', y = 'Number of trees') +
         theme_bw() +
         theme(panel.grid = element_blank(),
               axis.title = element_text(size = 14),
               axis.text = element_text(size = 14),
               plot.title = element_text(hjust = 0.5))
   )

# View the plots (use the arrows on the Plots viewer)
acer.plots$plots

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

Efficient data manipulation

Sandra from Our Coding Club - Felipe Wolff

2023-07-20