From Chapter 3 of Beckerman, Childs and Petchey: Getting Started with R
Data Management, manipulation and exploration using commands from dplyr(), part of the tidyverse() package
In this exercise we will use the compensation data set, which has 40 observations of the root stock mass and mass of fruit harvested, for apple trees in both grazed and ungrazed conditions.
Find this on the Teams/Files/RStuff/data folder and save it into your own RStuff/data folder.
We will see how a few commands from the dplyr() package enable us to select portions of the data, or to manipulate it in some way. This stage of data analysis is often the most time consuming. The dplyr() commands make it straightforward to do in R.
Start a new script and save it into your RStuff/scripts folder
Begin with these commented lines, and don’t forget to add more comments to your own script as you go through it, so that you later understand what each part is doing.
# dplyr exercises
# your name
# the date
Set the working directory to be your Rstuff/data folder
See if you can do this using the Files window.
Clear R’s brain
rm(list=ls())
Load tidyverse into your session
library(tidyverse)
If this throws an error, then you need to type install.packages('tidyverse') into your console window, not in your script. This will install tidyverse(). You should then run the library(tidyverse) line again.
Import the compensation data set into a data frame called ‘compensation’
compensation<-read_csv('compensation.csv')
Inspect the data
glimpse(compensation)
Note that you can also inspect the data by clicking on the arrow against its name in the Environment pane.
Use summary() to find the mean values of the Fruit and Root columns.
summary(compensation)
summary() gives useful summary statistics of each column of a data set.
Subsetting the data
dplyr() provides several commands that let you extract subests of a larger data set.
Select a subset of columns
select() allows you to choose or exclude whichever columns you want
Use select() to pick out the Fruit column.
Save the output as an object called ‘Fruits’.
Fruits<-select(compensation,Fruit)
As is always the case with commands from dplyr(), the first argument of select() is the name of the data frame from which you want to select some columns. This is followed in this case by the names of the columns you wish to select.
Use select() to pick out all the columns except the Root column.
select() will leave out any column prefaced with a -sign.
Save the output as an object called ‘notRoot’.
notRoot<-select(compensation,-Root)
Choose particular rows
We use slice() to pick out particular rows (but see also filter())
Use slice() to grab the second row of compensation
row_2<-slice(compensation,2)
Use slice() to grab the second to the 10th rows
row_2_to_10<-slice(compensation,2:10)
Use slice() and to grab rows 2, 3 and 10.
row_2310<-slice(compensation,2,3,10)
What kind of object does slice() return?
Are the row numbers still 2, 3, and 10?
Choose rows that satisfy some condition
We use filter() to do this.
Use filter() to pick out those rows for which Fruit is greater than 50.
big_fruit<-filter(compensation,Fruit>50)
Use filter() and the logical OR symbol ‘|’ to pick rows where Fruit is greater than 80 or less than 20
extreme_fruit<-filter(compensation,Fruit>60 | Fruit<20)
Use filter() and the logical AND symbol ‘&’ to pick rows where Fruit is less than 80 AND greater than 20
medium_fruit<-filter(compensation,Fruit<80 & Fruit>20)
Note that we have saved each of these selections to a named object, so now can use them.
Sorting
Use arrange() to sort the data by the Fruit column in ascending order.
comp_fruit_ascending<-arrange(compensation,Fruit)
Use arrange() to sort the data by the Fruit column in descending order
comp_fruit_descending<-arrange(compensation,-Fruit)
TopTip 1: you can use more than one dplyr() command in one line of code.
Use select() and filter () in one line of code to pick out the rootstock column, but only for those rows where the fruit production is > 80. Put the result into an obect called largeFruit
largeRoot<-select(filter(compensation,Fruit>80),Root)
TopTip 2: Use the pipe symbol %>%
We can do the same thing using the really handy pipe symbol %>% like this:
largeFruit<-compensation %>%
filter(Fruit>80) %>%
select(Root)
You can think of the pipe symbol as meaning ‘and then’. It feeds the result of each stage into the next stage. Check that this gives the same result as the select(), filter () combination in one line of code. Which is easier to read?
Grouping and summarising
summary() gave us global means for Root and Fruit. But what if want to know if the means for each differ depending on the grazing conditions?
A combination of group() and summarise() can be used to do this:
Use group_by and summarise() to find the means of Root and Fruit in both Grazed and Ungrazed
means<-compensation %>%
group_by(Grazing) %>%
summarise(mean(Root),mean(Fruit))
