dplyr exercises

Adapted and extended from Chapter 3 of Getting Started with R by Beckerman, Childs and Petchey

Data Management, manipulation and exploration using functions from `dplyr`

In this exercise we will use the compensation data set, which has 40 observations of the root stock mass and mass of fruit harvested, for apple trees in both grazed and ungrazed conditions.

As before, we emphasise a workflow where you use Projects and make heavy use of the tidyverse. You don’t have to do either of these things in your R life, but that life is much easier in the long run, and mostly in the short run too, if you do.

We will see how a few functions from the dplyr package within tidyverse enable us to select portions of the data, or to manipulate it in some way. This stage of data analysis is often the most time consuming. The dplyr functions make this phase of your analysis as straightforward to do in R as it could possibly be.

Handy References

Get the data

Find the compensation dataset in the Teams/Files/RStuff/data folder or elsewhere and save it into your own RStuff/data folder.

Open up your RStuff Project

This is how you always begin. In RStudio, do File/Open Project then navigate to your RStuff folder and click on the file RStuff.RProj.

You should find that you end up with nothing in your Environment Pane (that’s a good thing, it means that R’s brain is clear) and that the Files tab in the bottom right window shows you the contents of your RStuff folder.

New script

Start a new notebook script and save it into your RStuff/scripts folder with whatever name you like. You could call it war_and_peace (but not war and peace because R does not like spaces in titles), but something like dplyr_exercise would be more suitable.

Normally each R markdown document is composed of 3 main components, 1) a YAML header, 2) formatted text and 3) one or more code chunks.

The YAML header contains some metadata about the script. It is the few lines of text at the top between two lines of three dashes: ---. Eventually you may want to alter this in all sorts of ways, and you can do this any time, but for now you might just want to alter the title and add lines for author: and date:. You should never delete this section entirely. Beneath it is where you write your script. Begin by deleting all the exemplar text that is presented to you on first creating a new notebook. All of it, so that all you have beneath the yaml is white space.

We will now write our script code-chunk by code-chunk, interspersing these with helpful commentary that will at least serve to remind us what each chunk does, but could also be extensive blocks of text if we wanted that, all of it formatted according to the simple rules of Markdown, which you can find in the Help menu.

At any point you can ‘knit’ or render your script by pressing the Preview/Knit button at the top of the script pane. The text you have written will be formatted according to the simple rules of Markdown, which, remember you can see at a glance in the Help window, having called them up from the Help/Markdown Quick Reference menu.

This very document that you are reading now is the rendered version of a .rmd script such as you are about to write, so you can see what kinds of formatting are possible.

The very first code chunk in your script

When you render your script, R will include all the messages and warnings that are normally printed out in the console window when you run lines of code. Normally, we do not want to see these in our fancy, rendered document. We can suppress them for all code chunks, and affect the behaviour of all chunks in other ways, by including this ‘set-up’ chunk first in our script:

```{r, echo=FALSE}
knitr::opts_chunk$set(message=FALSE,warning=FALSE,echo=TRUE)
```

Do you see that within the curly brackets, after the r, we have written echo=FALSE? This is an example of a ‘chunk option’ that affects the way this chunk is rendered and other aspects of its output behaviour. There are dozens of these options. You can read about them here. Many are useful from time to time, but normally you don’t have to worry about them. Here we have written echo=FALSE because we don’t want this code chunk to be visible in the rendered document. The default behaviour is echo=TRUE.

To have to write this and other chunk options at the top of every code chunk would be tedious. This set-up chunk allows you to set options globally for your whole document. You can override these globally set options for any individual chunk just by including whatever options you want at the top of that chunk.

Apart from suppressing warnings and messages as we have here, a common decision is as to whether to show your code in the rendered document or not. That decision usually depends on who the document is to be read by. Managers? The Public? Maybe leave the code out. Research collaborators? Maybe leave it in. Either way, you achieve what you want by including the option echo=TRUE (code left in) or echo=FALSE (code left out) in the set-up chunk.

Having included this chunk, try altering the various TRUE/FALSE settings and see what difference that makes to the rendered version of your document.

Set the working directory to be your Rstuff/data folder

Or maybe not….Ha! You don’t need to do this because you are working within a Project. The working directory is automatically set to be the top level of the Project. Check that this is so by typing getwd() into the working directory.

Load the `tidyverse` and `here` packages into your session

After the set-up chunk, we often start a script by loading whatever packages we want to use, all in one code chunk. We will always use the tidyverse and here packages, so start by writing this header and code chunk:

### Load packages
```{r, message=FALSE, warning=FALSE}
library(tidyverse)
library(here)
library(kableExtra)
```

We have not met the kableExtra package yet, but will use it lower down in this worksheet. It is for creating nice tables.

To implement these lines, select them then press Ctrl-Enter, or Cmd-Enter if you are on a Mac, or press the little green arrow at the top right of the chunk. To implement them line by line, just place the cursor anywhere in a line and press Ctrl/Cmd-Enter

If any of these lines throw an error, it will most likely be because you have not yet installed one or more of these packages. The error message will tell you which. If this happens you need to type install.packages('tidyverse') or install.packages('here')or install.packages('kableExtra') into your console window. You do it there rather than in your script because you only need to do this successfully once, not every time you run your script.

This will install tidyverse and/or here or kableExtra onto your machine. You should then run the library lines again.

Import the compensation data set into a data frame called ‘compensation’

```{r}
filepath<-here("data","compensation.csv")
compensation<-read_csv(filepath)
```

This is where we make use of the here() package and exploit our decision to have designated the RStuff directory as a Project. here() thinks that ‘here’ is the top level of the Project. It now does not matter where that Project sits on your computer. To find your data file you just need to give the here() function the hierarchical sequence of directories within your Project to the file, separated by commas. In our case we just had to go into the data folder, and then there was the file we wanted, compensation.csv. Have you any idea how much more opaque the R you need to use is to both find this file AND organise your RStuff folder sensibly without the use of here()? If not, take note: here() takes away a lot of the need for a ‘techie’ understanding of how to move between folders. It also helps you to write code for finding files that will still work when you reorganise your computer or share your Project with someone else.

Inspect the data

Having entered data into R we should inspect it, usually in several ways, to check that all is as we expect, to see if there is anything we need to take note of such as missing values, maybe even to get a sense of the statistics of the data.

One basic and useful way to begin this process is to see how many columns and rows there are, what each column is called and what kind of data they each contain:

```{r}
glimpse(compensation)
```

Note that you can also inspect the data by clicking on the arrow against its name in the Environment pane.

How many rows have you got?
How many columns?
What are the columns called?
What kinds of data do they contain?

<dbl> is R-speak for numerical data, not necessarily integers.
<chr> is R-speak for text.

We might then use summary() to find the mean values of the Fruit and Root columns.

```{r}
summary(compensation)
```

summary() gives useful summary statistics of each column of numerical data within a data set. It is no use for columns of categorical data, such as the Grazing column here.

What though if we had wanted the Root and Fruit summaries for each grazing condition? That is where the the pair of functions group_by() and summarise() from the dplyr package come in. We get to those lower down, but here is a taster of how they might be used, together with what is called the ‘pipe’ operator %>%.

```{r}
compensation %>%
  group_by(Grazing) %>%
  summarise(mean_Root=mean(Root),mean_Fruit=mean(Fruit)) %>%
  kbl(digits=2) %>% # this and the next line give a nice tabular output.
  kable_styling(full_width=F)
```

Subsetting the data

dplyr provides several functions that let you extract subsets of a larger data set.

Select a subset of columns

select() allows you to choose or exclude whichever columns you want.

Use `select()` to pick out the Fruit column.

Save the output as an object called ‘Fruits’.

```{r}
Fruits<-select(compensation,Fruit)
glimpse(Fruits)
```

As is always the case with functions from dplyr and throughout the tidyverse, the first argument of select() is the name of the data frame from which you want to select some columns. This is followed, in this case, by the name(s) of the column(s) you wish to select.

Use `select()` to pick out all the columns except the Root column.

select() will leave out any column prefaced with a -sign.

Save the output as an object called ‘notRoot’.

```{r}
notRoot<-select(compensation,-Root)
glimpse(notRoot)
```

select() is most useful when you have a data set with many columns and you only want a few of them. This commonly happens when you download a data set from a publicly curated dataset such as NBN, where you may often receive a dataset with dozens of columns but only need a few of them for your particular study.

Choose particular rows

We use slice() to pick out particular rows (but see also filter(), which is more useful than slice() in practice, I find)

Use `slice()` to grab the second row of compensation

```{r}
row_2<-slice(compensation,2)
row_2
```

Use `slice()` to grab the second to the 10th rows

```{r}` ''`r
row_2_to_10<-slice(compensation,2:10)
row_2_to_10
```

Use `slice()` to grab rows 2, 3 and 10.

```{r}
row_2310<-slice(compensation,2,3,10)
row_2310
```

What kind of object does slice() return?
Are the row numbers still 2, 3, and 10?

Choose rows that satisfy some condition

slice() is useful, but often we want to pick out rows according to one or more conditions being satisfied by the values in one or more columns. We use filter() to do this. For example, we might only want those rows corresponding to particular species, or sites, as determined by the values in a Species or Sites column.

Use `filter()` to pick out those rows for which Fruit is greater than 50.

```{r}
big_fruit<-filter(compensation,Fruit>50)
glimpse(big_fruit)
```

Use `filter()` and the logical `OR` symbol ‘|’ to pick rows where Fruit is greater than 80 or less than 20

```{r}
extreme_fruit<-filter(compensation,Fruit>60 | Fruit<20)
glimpse(extreme_fruit)
```

Use `filter()` and the logical `AND` symbol ‘&’ to pick rows where Fruit is less than 80 AND greater than 20

```{r}
medium_fruit<-filter(compensation,Fruit<80 & Fruit>20)
glimps(medium_fruit)
```

Use `filter()` to pick out only the grazed fruits

```{r}
grazed_fruit<-filter(compensation,Grazing=="Grazed")
glimpse(grazed_fruit)
```

Note that we have saved each of these selections to a named object, so now could use them, if we wanted to.

In these filter() examples, we have seen the use of logical operators:

>:greater than
<: less than
>=: greater than or equal to
<=: less than or equal to
==: equal to
&: AND
| OR

Transforming data using `mutate()`

First, use head() to look at the first 6 rows of your data compensation.
How many columns are there?

```{r}
head(compensation)
```

Use mutate() to create an additional column called logFruit which is the natural log of the Fruit column.
Doing a log-transform of data is often a useful trick in data preparation.

```{r}
compensation<-mutate(compensation,logFruit=log(Fruit))
```

Now use head() to look again at the first 6 rows of your data. How many columns are there now?

```{r}
head(compensation)
```

Have you changed the original data file? The answer is no. We have changed the data frame compensation() but the data file itself is untouched. That is a major advantage of doing your data analysis in R rather than in Excel, for example, where you would often be working in and altering your original data file, with all the possibilities of data loss that that entails.

Sorting

Use `arrange()` to sort the data by the Fruit column in ascending order.

```{r}
comp_fruit_ascending<-arrange(compensation,Fruit)
head(comp_fruit_ascending)
```

Use `arrange()` to sort the data by the Fruit column in descending order

```{r}
comp_fruit_descending<-arrange(compensation,-Fruit)
head(comp_fruit_descending)
```

Top Tips

Top Tip 1: Get the hang of and use the pipe symbol `%>%`

We mentioned the pipe operator %>% above. Let’s see how how it works in helping us string together a sequence of dplyr operations, as we might often want to do in preparing a dataset

```{r}
largeRoot<-compensation %>% # create an object called largeRoot. Start from compensation and then...
  filter(Fruit>80) %>% # keep only those rows where Fruit is > 80 and then....
  filter(Grazing="Grazed") %>% # keep only rows where grazing occurred and then....
  select(Root) # keep only the Root column
```

so we should end up with an object called largeRoot which contains a single column of Root values, for those trees where the Fruit value was greater than 80 and where the grazing condition was “Grazed”.

You can think of the pipe symbol as meaning ‘and then’. It feeds the result of each line of code into the next line. You use it with tidyverse functions. All of these act on and produce as output data frames, and so one of these is always fed to the next line in a sequence such we see above. Normally, the first argument of any tidyverse function is a also data frame, but since with the pipe operator each line is being fed a data frame, that argument need not be explicitly included. It is assumed. It is whatever data frame was the result of the previous line.

This use of the pipe operator is very common in data analysis and is very, very useful. I strongly advise you to get the hang of it and to use in your own R code.

Top Tip 2:

Get the hang of and use the pipe symbol %>% :)

Grouping and summarising

This is where we can do in R what pivot tables do for you in Excel.

summary() gave us global means for Root and Fruit. But what if want to know if the means for each depending on the grazing conditions?

A combination of group() and summarise() can be used to do this:

Use `group_by()` and `summarise()` to find the means of Root and Fruit in both Grazed and Ungrazed

```{r}
compensation %>%
  group_by(Grazing) %>%
  summarise(Mean_Root=mean(Root),Mean_Fruit=mean(Fruit))

```

Read this as: Take the compensation data frame and then group it by the value of Grazing and then summarise each group by finding the mean of the Root variable and the mean of the Fruit variable.

This produces an output table. For you, the researcher working through your data, this table is fine, but for your rendered document you might want a prettier version of the table, with some control over how it looks and especially over the the precision with which numerical data are displayed. We can use the kbl() and kable_styling() functions from the kableExtra package for that, for example like this:

```{r}
compensation %>%
  group_by(Grazing) %>%
  summarise(Mean_Root=mean(Root),Mean_Fruit=mean(Fruit)) %>%
  kbl(digits=2) %>%
  kable_styling(full_width=F)
```

Exercises

Packages are a collection of functions, data sets and help documentation that add to the capabilities of base R. But R itself comes with many built in data sets that can be very useful for practice. One of them is the famous iris dataset collected by Anderson in 1935. It contains 150 records of 3 species of iris, for each of which the petal length and width and the sepal length and width are recorded.

Investigation of the Iris dataset

(a)

Although this data set can be obtained from within R, let us practise getting it from our data directory, just as we would have to for our own data.

Start a new notebook and save it into your scripts folder under a suitable name. iris? As before, delete all text beneath the yaml and include author and dates lines in the yaml. Replace the title with something suitable.

(b)

Using Ctrl-alt-I or Cmd-alt-I on a Mac, include a new code chunk to load up the tidyverse and here packages.

```{r}
library(tidyverse)
library(here)
```

Add a suitable header above the chunk, prefaced with ### to remind you what the chunk does. Run the chunk by pressing the green arrow at its ytop right. You can run all other chunks as you include them in this way.

(c)

Include a new chunk to load the iris.csv dataset into an object called iris. Use the glimpse() function to inspect it.

```{r}
filepath<-here("data","iris.csv")
iris<-read_csv(filepath)
glimpse(iris)
```

What type of object is iris? How many rows and columns are there? What type of data does each column contain? Do you see the iris object in the Environment pane?

(d)

Without creating any intermediate objects, and using the pipe operator %>%, create a summary table that gives the mean and standard error of the Petal.Length column, for each species.

```{r}
iris %>%
  group_by(Species) %>%
  summarise(mean.petal.length=mean(Petal.Length),se.petal.length=sd(Petal.Length)/sqrt(n()))
```

Summary tables of this sort are very useful.

The standard error of the mean of a column of numerical data is its standard deviation divided by the square root of the number of data points. Here we use n() to tell us how many data points there are for each species.

(e)

Repeat the last part, but use the kableExtra package to improve the table:

```{r}
library(kableExtra)
iris %>%
  group_by(Species) %>%
  summarise(mean.petal.length=mean(Petal.Length),se.petal.length=sd(Petal.Length)/sqrt(n())) %>%
  kbl(digits=2) %>%
  kable_styling(full_width=F)
```

You may first need to install the kableExtra package using install.packages("kableExtra") in your console window.

(f)

Create new columns which give the ratio of petal length to petal width, and of sepal length to sepal width (and so are a measure of shape). Save just these tow columns and the Species column to a new object called iris_shape

```{r}
iris_shape<-iris %>%
  mutate(Petal.Shape=Petal.Length/Petal.Width) %>%
  mutate(Sepal.Shape=Sepal.Length/Sepal.Width) %>%
  select(Species,Petal.Shape,Sepal.Shape)
glimpse(iris_shape)
```

(g)

Use the ggplot2 package from within tidyverse to plot the shape data in various ways:

Box plot

#### Very basic box plot
```{r}
iris_shape %>%
  ggplot(aes(x=Species,y=Petal.Shape,fill=Species)) +
  geom_boxplot()
```

#### Slightly prettier box plot
```{r}
iris_shape %>%
  ggplot(aes(x=Species,y=Petal.Shape,fill=Species)) +
  geom_boxplot() +
  labs(x="Species",
       y="Petal Shape") +
  theme_classic() +
  theme(legend.position="none")
```

Try leaving out either or both of the theme lines. What difference does that make?

Scatter plot

```{r}
iris_shape %>%
  ggplot(aes(x=Petal.Shape,y=Sepal.Shape,colour=Species)) +
  geom_point() +
  labs(x="Petal.Shape",
       y="Sepal Shape") +
  theme_classic()
```

Faceted scatter plot

```{r}
iris_shape %>%
  ggplot(aes(x=Petal.Shape,y=Sepal.Shape,colour=Species)) +
  geom_point() +
  labs(x="Petal.Shape",
       y="Sepal Shape") +
  facet_wrap(~Species,scales="free") +
  theme_classic()
```

Faceting, using the facet_wrap line is very useful for producing a series of related plots in the same style.

Conclusion

We have seen how a few functions from the dplyrpackage provide most of what you need when it comes to preparing your data for plotting and analysis.

We looked at the following functions:

select() to pick out or exclude certain columns.
slice() to choose certain rows
filter() to choose or exclude certain rows according to one or more criteria
mutate() to alter existing columns or create new ones, according to some calculation or criterion
arrange() to sort data in ascending or descending order.
group_by and summarise() which when used together gives us summary statistics of each group in the data set.

We should also mention pivot_longer() which can be used to tidy data.

We also saw how to use the very useful ‘pipe’ symbol %>% that allows us to run lines of code together and carry out a sequence of operations without having to create intermediate objects. It also makes the code easier to read.

Top Tip 3: Get the hang of and use the%>% pipe symbol. No, really.

dplyr exercises

Michael Hunt

23-01-2021

Data Management, manipulation and exploration using functions from dplyr