Adapted and extended from Chapter 3 of Getting Started with R by Beckerman, Childs and Petchey

Data Management, manipulation and exploration using functions from dplyr

In this exercise we will use the compensation data set, which has 40 observations of the root stock mass and mass of fruit harvested, for apple trees in both grazed and ungrazed conditions.

As before, we emphasise a workflow where you use Projects and make heavy use of the tidyverse. You don’t have to do either of these things in your R life, but that life is much easier in the long run, and mostly in the short run too, if you do.

We will see how a few functions from the dplyr package within tidyverse enable us to select portions of the data, or to manipulate it in some way. This stage of data analysis is often the most time consuming. The dplyr functions make this phase of your analysis as straightforward to do in R as it could possibly be.

Handy References

Get the data

Find the compensation dataset in the Teams/Files/RStuff/data folder or elsewhere and save it into your own RStuff/data folder.

Open up your Project folder

This is how you always begin. In RStudio, do File/Open Project then navigate to your Project folder and click on the file Name_of_my_project.RProj, or whatever the name of your project folder actually is.

You should find that you end up with nothing in your Environment Pane (that’s a good thing, it means that R’s brain is clear) and that the Files tab in the bottom right window shows you the contents of your Project folder.

New script

Start a new notebook and save it into your Project/scripts folder with whatever name you like. You could call it war_and_peace (but not war and peace because R does not like spaces in titles), but something like dplyr_exercise would be more suitable.

Normally each R markdown document is composed of 3 main components, 1) a YAML header, 2) formatted text and 3) one or more code chunks.

The YAML header contains some metadata about the script. It is the few lines of text at the top between two lines of three dashes: ---. Eventually you may want to alter this in all sorts of ways, and you can do this any time, but for now you might just want to alter the title (to something that tells you what this script does. In a month or two you may well have forgotten!) and add lines for author: and date:. You should never delete this section entirely. Beneath it is where you write your script. Begin by deleting all the exemplar text that is presented to you on first creating a new notebook. All of it, so that all you have beneath the yaml is white space.

We will now write our script code-chunk by code-chunk, interspersing these with helpful commentary that will at least serve to remind us what each chunk does, but could also be extensive blocks of text if we wanted that, all of it formatted according to the simple rules of Markdown, which you can find in the Help menu.

At any point you can ‘knit’ or render your script by pressing the Preview/Knit button at the top of the script pane. The text you have written will be formatted according to the simple rules of Markdown, which, remember you can see at a glance in the Help window, having called them up from the Help/Markdown Quick Reference menu.

This very document that you are reading now is the rendered version of a .rmd script such as you are about to write, so you can see what kinds of formatting are possible.

Note though that when you work on a script for the analysis of our own data, you rarely need to knit it. You could just go through it and implement it chunk by chunk.

The very first code chunk in your script

If you do render your script, R will include all the messages and warnings that are normally printed out in the console window when you run lines of code. Normally, we do not want to see these in our fancy, rendered document. We can suppress them for all code chunks, and affect the behaviour of all chunks in other ways, by including this ‘set-up’ chunk first in our script:

```{r, echo=FALSE}
knitr::opts_chunk$set(message=FALSE,warning=FALSE,echo=TRUE)
```

Do you see that within the curly brackets, after the r, we have written echo=FALSE? This is an example of a ‘chunk option’ that affects the way this chunk is rendered and other aspects of its output behaviour. There are dozens of these options. You can read about them here. Many are useful from time to time, but normally you don’t have to worry about them. Here we have written echo=FALSE because we don’t want this code chunk to be visible in the rendered document. The default behaviour is echo=TRUE.

To have to write this and other chunk options at the top of every code chunk would be tedious. This set-up chunk allows you to set options globally for your whole document. You can override these globally set options for any individual chunk just by including whatever options you want at the top of that chunk.

Apart from suppressing warnings and messages as we have here, a common decision is as to whether to show your code in the rendered document or not. That decision usually depends on who the document is to be read by. Managers? The Public? Maybe leave the code out. Research collaborators? Maybe leave it in. Either way, you achieve what you want by including the option echo=TRUE (code left in) or echo=FALSE (code left out) in the set-up chunk.

Having included this chunk, try altering the various TRUE/FALSE settings and see what difference that makes to the rendered version of your document.

Set the working directory to be your Rstuff/data folder

Or maybe not….Ha! You don’t need to do this because you are working within a Project. The working directory is automatically set to be the top level of the Project. Check that this is so by typing getwd() into the working directory. Working Directory? Means nothing to you? Don’t worrry. Our use of the here package means that this concept, something you might read about, is something you do not need to worry about.

Load the tidyverse and here packages into your session

After the set-up chunk, we often start a script by loading whatever packages we want to use, all in one code chunk. We will always use the tidyverse and here packages, so start by writing this header and code chunk:

### Load packages
```{r}
library(tidyverse)
library(here)
```

To implement these lines, select them then press Ctrl-Enter, or Cmd-Enter if you are on a Mac, or press the little green arrow at the top right of the chunk. To implement them line by line, just place the cursor anywhere in a line and press Ctrl/Cmd-Enter

If any of these lines throw an error, it will most likely be because you have not yet installed one or more of these packages. The error message will tell you which. If this happens you need to type install.packages('tidyverse') or install.packages('here') into your console window. You do it there rather than in your script because you only need to do this successfully once, not every time you run your script.

This will install tidyverse and/or here onto your machine. You should then run the library lines again.

Import the compensation data set into a data frame called ‘compensation’

```{r}
filepath<-here("data","compensation.csv")
compensation<-read_csv(filepath)
```

This is where we make use of the here() package and exploit our decision to have designated the RStuff directory as a Project. here() thinks that ‘here’ is the top level of the Project. It now does not matter where that Project sits on your computer. To find your data file you just need to give the here() function the hierarchical sequence of directories within your Project to the file, separated by commas. In our case we just had to go into the data folder, and then there was the file we wanted, compensation.csv. Have you any idea how much more opaque the R you need to use is to both find this file AND organise your RStuff folder sensibly without the use of here()? If not, take note: here() takes away a lot of the need for a ‘techie’ understanding of how to move between folders. It also helps you to write robust code for finding files that will still work when you reorganise your computer or share your Project with someone else who wants to run it on their machine..

Inspect the data

Having entered data into R we should inspect it, usually in several ways, to check that all is as we expect, to see if there is anything we need to take note of such as missing values, maybe even to get a sense of the statistics of the data.

One basic and useful way to begin this process is to see how many columns and rows there are, what each column is called and what kind of data they each contain:

```{r}
glimpse(compensation)
```

Note that you can also inspect the data by clicking on the arrow against its name in the Environment pane.

<dbl> is R-speak for numerical data, not necessarily integers.
<chr> is R-speak for text.

We could then use summary() to find the mean values of the Fruit and Root columns.

```{r}
summary(compensation)
```

summary() gives what can be useful overall summary statistics of each column of numerical data within a data set. It is no use however for columns of categorical data, such as the Grazing column here.

What though if we had wanted the Root and Fruit summaries for each grazing condition? That is where the pair of functions group_by() and summarise() from the dplyr package come in. We get to those lower down, but here is a taster of how they might be used, together with what is called the ‘pipe’ operator %>%.

```{r}
compensation %>%
  group_by(Grazing) %>%
  summarise(mean_Root=mean(Root),mean_Fruit=mean(Fruit))
```

Subsetting the data

dplyr provides several functions that let you extract subsets of a larger data set. All of them use a data frame as their first argument and produce a data frame as their output. Thus the output of one can always be used as the input of another, which allows us to chain together a sequence of these functions to perform a series of tasks.

Select a subset of columns

select() allows you to choose or exclude whichever columns you want.

Use select() to pick out the Fruit column.

In this code chunk we will start with the compensation data frame, do stuff to it and save the result to an object called Fruits.

```{r}
Fruits<-compensation %>%  # Create an object called `Fruits` by tarting with the object `compenstion` and then
  select(Fruit) # selecting from that the column called `Fruit`.
glimpse(Fruits) # Now let's have a look at this new object we have created.
```

As is always the case with functions from dplyr and throughout the tidyverse, the first argument of select() is the name of the data frame from which you want to select some columns. This is followed, in this case, by the name(s) of the column(s) you wish to select.

In the way we have written the code, using thw pipe operator

Use select() to pick out all the columns except the Root column.

select() will leave out any column prefaced with a -sign.

Save the output as an object called ‘notRoot’.

```{r}
notRoot<-select(compensation,-Root)
glimpse(notRoot)
```

select() is most useful when you have a data set with many columns and you only want a few of them. This commonly happens when you download a data set from a publicly curated dataset such as NBN, where you may often receive a dataset with dozens of columns but only need a few of them for your particular study.

Choose particular rows

We use slice() to pick out particular rows (but see also filter(), which is more useful than slice() in practice, I find)

Use slice() to grab the second row of compensation
```{r}
row_2<-slice(compensation,2)
row_2
```

Another way to write this same code, but in better style, is to use the pipe operator, like this:

```{r}
compansation %>%
  row_2<-slice(2)
row_2
```

Use slice() to grab the second to the 10th rows

```{r}
row_2_to_10<-slice(compensation,2:10)
row_2_to_10
```

Use slice() to grab rows 2, 3 and 10.

```{r}
row_2310<-slice(compensation,2,3,10)
row_2310
```

What kind of object does slice() return?
Are the row numbers still 2, 3, and 10?

Choose rows that satisfy some condition

slice() is useful, but often we want to pick out rows according to one or more conditions being satisfied by the values in one or more columns. We use filter() to do this. For example, we might only want those rows corresponding to particular species, or sites, as determined by the values in a Species or Sites column.

Use filter() to pick out those rows for which Fruit is greater than 50.

```{r}
big_fruit<-filter(compensation,Fruit>50)
glimpse(big_fruit)
```

Use filter() and the logical OR symbol ‘|’ to pick rows where Fruit is greater than 80 or less than 20

```{r}
extreme_fruit<-filter(compensation,Fruit>60 | Fruit<20)
glimpse(extreme_fruit)
```

Use filter() and the logical AND symbol ‘&’ to pick rows where Fruit is less than 80 AND greater than 20

```{r}
medium_fruit<-filter(compensation,Fruit<80 & Fruit>20)
glimpse(medium_fruit)
```

Use filter() to pick out only the grazed fruits

```{r}
grazed_fruit<-filter(compensation,Grazing=="Grazed")
glimpse(grazed_fruit)
```

Note that we have saved each of these selections to a named object, so now could use them, if we wanted to.

In these filter() examples, we have seen the use of logical operators:

  • >:greater than
  • <: less than
  • >=: greater than or equal to
  • <=: less than or equal to
  • ==: equal to
  • &: AND
  • | OR

Transforming data using mutate()

First, use head() to look at the first 6 rows of your data compensation.
How many columns are there?

```{r}
head(compensation)
```

Use mutate() to create an additional column called logFruit which is the natural log of the Fruit column.
Doing a log-transform of data is often a useful trick in data preparation.

```{r}
compensation<-mutate(compensation,logFruit=log(Fruit))
```

Now use head() to look again at the first 6 rows of your data. How many columns are there now?

```{r}
head(compensation)
```

Have you changed the original data file? The answer is no. We have changed the data frame compensation() but the data file itself is untouched. That is a major advantage of doing your data analysis in R rather than in Excel, for example, where you would often be working in and altering your original data file, with all the potential for data loss that that entails.

Sorting

Use arrange() to sort the data by the Fruit column in ascending order.

```{r}
comp_fruit_ascending<-arrange(compensation,Fruit)
head(comp_fruit_ascending)
```

Use arrange() to sort the data by the Fruit column in descending order

```{r}
comp_fruit_descending<-arrange(compensation,-Fruit)
head(comp_fruit_descending)
```

Top Tips

Top Tip 1: Get the hang of and use the pipe symbol %>%

We mentioned the pipe operator %>% above. Let’s see how how it works in helping us string together a sequence of dplyr operations, as we might often want to do in preparing a dataset

```{r}
largeRoot<-compensation %>% # create an object called largeRoot. Start from compensation and then...
  filter(Fruit>80) %>% # keep only those rows where Fruit is > 80 and then....
  filter(Grazing=="Grazed") %>% # keep only rows where grazing occurred and then....
  select(Root) # keep only the Root column
```

so we should end up with an object called largeRoot which contains a single column of Root values, for those trees where the Fruit value was greater than 80 and where the grazing condition was “Grazed”.

You can think of the pipe symbol as meaning ‘and then’. It feeds the result of each line of code into the next line. You use it with tidyverse functions. All of these act on and produce as output data frames, and so one of these is always fed to the next line in a sequence such we see above. Normally, the first argument of any tidyverse function is a also data frame, but since with the pipe operator each line is being fed a data frame, that argument need not be explicitly included. It is assumed. Assumed to be, in fact, whatever data frame was the result of the previous line.

This use of the pipe operator is very common in data analysis and is very, very useful. I strongly advise you to get the hang of it and to use in your own R code.

Top Tip 2:

Get the hang of and use the pipe symbol %>% :)

Grouping and summarising

This is where we can do in R what pivot tables do for you in Excel.

summary() gave us global means for Root and Fruit. But what if want to know if the means for each depending on the grazing conditions?

A combination of group() and summarise() can be used to do this:

Use group_by() and summarise() to find the means of Root and Fruit in both Grazed and Ungrazed

```{r}
compensation %>%
  group_by(Grazing) %>%
  summarise(Mean_Root=mean(Root),Mean_Fruit=mean(Fruit))

```

Exercises


Packages are a collection of functions, data sets and help documentation that add to the capabilities of base R. But R itself comes with many built in data sets that can be very useful for practice. One of them is the famous iris dataset collected by Anderson in 1935. It contains 150 records of 3 species of iris, for each of which the petal length and width and the sepal length and width are recorded.

Investigation of the Iris dataset

(a)

Although this data set can be obtained from within R, let us practise getting it from our data directory, just as we would have to for our own data.

Start a new notebook and save it into your scripts folder under a suitable name. iris? As before, delete all text beneath the yaml and include author and dates lines in the yaml. Replace the title with something suitable ie something that will remind you wha the script is for when you come back to it later.

(b)

Using Ctrl-alt-I or Cmd-alt-I on a Mac, include a new code chunk to load up the tidyverse and here packages.

```{r}
library(tidyverse)
library(here)
```

Add a suitable header above the chunk, prefaced with ### to remind you what the chunk does. Run the chunk by pressing the green arrow at its top right. You can run all other chunks as you include them in this way.

(c)

Include a new chunk to load the iris.csv dataset into an object called iris. Use the glimpse() function to inspect it.

```{r}
filepath<-here("data","iris.csv")
iris<-read_csv(filepath)
glimpse(iris)
```

What type of object is iris? How many rows and columns are there? What type of data does each column contain? Do you see the iris object in the Environment pane?

(d)

Without creating any intermediate objects, and using the pipe operator %>%, create a summary table that gives the mean and standard error of the Petal.Length column, for each species.

```{r}
iris %>%
  group_by(Species) %>%
  summarise(mean.petal.length=mean(Petal.Length),se.petal.length=sd(Petal.Length)/sqrt(n()))
```

Summary tables of this sort are very useful.

The standard error of the mean of a column of numerical data is its standard deviation divided by the square root of the number of data points. Here we use n() to tell us how many data points there are for each species.

(e)

Repeat the last part, but use the kableExtra package to improve the table:

```{r}
library(kableExtra)
iris %>%
  group_by(Species) %>%
  summarise(mean.petal.length=mean(Petal.Length),se.petal.length=sd(Petal.Length)/sqrt(n())) %>%
  kbl(digits=2) %>%
  kable_styling(full_width=F)
```

You may first need to install the kableExtra package using install.packages("kableExtra") in your console window.

(f)

Create new columns which give the ratio of petal length to petal width, and of sepal length to sepal width (and so are a measure of shape). Save just these two columns and the Species column to a new object called iris_shape

```{r}
iris_shape<-iris %>%
  mutate(Petal.Shape=Petal.Length/Petal.Width) %>%
  mutate(Sepal.Shape=Sepal.Length/Sepal.Width) %>%
  select(Species,Petal.Shape,Sepal.Shape)
glimpse(iris_shape)
```
(g)

Use the ggplot2 package from within tidyverse to plot the shape data in various ways:

Box plot

#### Very basic box plot
```{r}
iris_shape %>%
  ggplot(aes(x=Species,y=Petal.Shape,fill=Species)) +
  geom_boxplot()
```
#### Slightly prettier box plot
```{r}
iris_shape %>%
  ggplot(aes(x=Species,y=Petal.Shape,fill=Species)) +
  geom_boxplot() +
  labs(x="Species",
       y="Petal Shape") +
  theme_classic() +
  theme(legend.position="none")
```

Try leaving out either or both of the theme lines. What difference does that make?

Scatter plot

```{r}
iris_shape %>%
  ggplot(aes(x=Petal.Shape,y=Sepal.Shape,colour=Species)) +
  geom_point() +
  labs(x="Petal.Shape",
       y="Sepal Shape") +
  theme_classic()
```

Faceted scatter plot

```{r}
iris_shape %>%
  ggplot(aes(x=Petal.Shape,y=Sepal.Shape,colour=Species)) +
  geom_point() +
  labs(x="Petal.Shape",
       y="Sepal Shape") +
  facet_wrap(~Species,scales="free") +
  theme_classic()
```

Faceting, using the facet_wrap line is very useful for producing a series of related plots in the same style.


Conclusion

We have seen how a few functions from the dplyrpackage provide most of what you need when it comes to preparing your data for plotting and analysis.

We looked at the following functions:

We should also mention pivot_longer() which can be used to tidy data.

We also saw how to use the very useful ‘pipe’ symbol %>% that allows us to run lines of code together and carry out a sequence of operations without having to create intermediate objects. It also makes the code easier to read.

Top Tip 3: Get the hang of and use the%>% pipe symbol. No, really.