Adapted and extended from Chapter 3 of Getting Started with R by Beckerman, Childs and Petchey
dplyrIn this exercise we will use the compensation data set,
which has 40 observations of the root stock mass and mass of fruit
harvested, for apple trees in both grazed and ungrazed conditions.
As before, we emphasise a workflow where you use Projects and make
heavy use of the tidyverse. You don’t have to do
either of these things in your R life, but that life is much easier in
the long run, and mostly in the short run too, if you do.
We will see how a few functions from the dplyr package
within tidyverse enable us to select portions of the data,
or to manipulate it in some way. This stage of data analysis is often
the most time consuming. The dplyr functions make this
phase of your analysis as straightforward to do in R as it could
possibly be.
Find the compensation dataset in the
Teams/Files/RStuff/data folder or elsewhere and save it
into your own RStuff/data folder.
This is how you always begin. In RStudio, do File/Open Project then
navigate to your Project folder and click on the file
Name_of_my_project.RProj, or whatever the name of your
project folder actually is.
You should find that you end up with nothing in your Environment Pane (that’s a good thing, it means that R’s brain is clear) and that the Files tab in the bottom right window shows you the contents of your Project folder.
Start a new notebook and save it into your Project/scripts folder
with whatever name you like. You could call it
war_and_peace (but not war and peace because R
does not like spaces in titles), but something like
dplyr_exercise would be more suitable.
Normally each R markdown document is composed of 3 main components, 1) a YAML header, 2) formatted text and 3) one or more code chunks.
The YAML header contains some metadata about the script. It is the
few lines of text at the top between two lines of three dashes:
---. Eventually you may want to alter this in all sorts of
ways, and you can do this any time, but for now you might just want to
alter the title (to something that tells you what this script does. In a
month or two you may well have forgotten!) and add lines for
author: and date:. You should never delete
this section entirely. Beneath it is where you write your script. Begin
by deleting all the exemplar text that is presented to you on first
creating a new notebook. All of it, so that all you have beneath the
yaml is white space.
We will now write our script code-chunk by code-chunk, interspersing these with helpful commentary that will at least serve to remind us what each chunk does, but could also be extensive blocks of text if we wanted that, all of it formatted according to the simple rules of Markdown, which you can find in the Help menu.
At any point you can ‘knit’ or render your script by pressing the Preview/Knit button at the top of the script pane. The text you have written will be formatted according to the simple rules of Markdown, which, remember you can see at a glance in the Help window, having called them up from the Help/Markdown Quick Reference menu.
This very document that you are reading now is the rendered version
of a .rmd script such as you are about to write, so you can
see what kinds of formatting are possible.
Note though that when you work on a script for the analysis of our
own data, you rarely need to knit it. You could just go
through it and implement it chunk by chunk.
If you do render your script, R will include all the messages and warnings that are normally printed out in the console window when you run lines of code. Normally, we do not want to see these in our fancy, rendered document. We can suppress them for all code chunks, and affect the behaviour of all chunks in other ways, by including this ‘set-up’ chunk first in our script:
```{r, echo=FALSE}
knitr::opts_chunk$set(message=FALSE,warning=FALSE,echo=TRUE)
```
Do you see that within the curly brackets, after the r,
we have written echo=FALSE? This is an example of a ‘chunk
option’ that affects the way this chunk is rendered and other aspects of
its output behaviour. There are dozens of these options. You can read
about them here.
Many are useful from time to time, but normally you don’t have to worry
about them. Here we have written echo=FALSE because we
don’t want this code chunk to be visible in the rendered document. The
default behaviour is echo=TRUE.
To have to write this and other chunk options at the top of every code chunk would be tedious. This set-up chunk allows you to set options globally for your whole document. You can override these globally set options for any individual chunk just by including whatever options you want at the top of that chunk.
Apart from suppressing warnings and messages as we have here, a
common decision is as to whether to show your code in the rendered
document or not. That decision usually depends on who the document is to
be read by. Managers? The Public? Maybe leave the code out. Research
collaborators? Maybe leave it in. Either way, you achieve what you want
by including the option echo=TRUE (code left in) or
echo=FALSE (code left out) in the set-up chunk.
Having included this chunk, try altering the various
TRUE/FALSE settings and see what difference that makes to
the rendered version of your document.
Or maybe not….Ha! You don’t need to do this because you are working
within a Project. The working directory is automatically set to be the
top level of the Project. Check that this is so by typing
getwd() into the working directory. Working Directory?
Means nothing to you? Don’t worrry. Our use of the here
package means that this concept, something you might read about, is
something you do not need to worry about.
tidyverse and here packages into
your sessionAfter the set-up chunk, we often start a script by loading whatever
packages we want to use, all in one code chunk. We will always use the
tidyverse and here packages, so start by
writing this header and code chunk:
### Load packages
```{r}
library(tidyverse)
library(here)
```
To implement these lines, select them then press Ctrl-Enter, or Cmd-Enter if you are on a Mac, or press the little green arrow at the top right of the chunk. To implement them line by line, just place the cursor anywhere in a line and press Ctrl/Cmd-Enter
If any of these lines throw an error, it will most likely be because
you have not yet installed one or more of these packages. The error
message will tell you which. If this happens you need to type
install.packages('tidyverse') or
install.packages('here') into your console window. You do
it there rather than in your script because you only need to do this
successfully once, not every time you run your script.
This will install tidyverse and/or here
onto your machine. You should then run the library lines
again.
```{r}
filepath<-here("data","compensation.csv")
compensation<-read_csv(filepath)
```
This is where we make use of the here() package and
exploit our decision to have designated the RStuff directory as a
Project. here() thinks that ‘here’ is the top level of the
Project. It now does not matter where that Project sits on your
computer. To find your data file you just need to give the
here() function the hierarchical sequence of directories
within your Project to the file, separated by commas. In our case we
just had to go into the data folder, and then there was the
file we wanted, compensation.csv. Have you any idea how
much more opaque the R you need to use is to both find this file AND
organise your RStuff folder sensibly without the use of
here()? If not, take note: here() takes away a
lot of the need for a ‘techie’ understanding of how to move between
folders. It also helps you to write robust code for finding files that
will still work when you reorganise your computer or share your Project
with someone else who wants to run it on their machine..
Having entered data into R we should inspect it, usually in several ways, to check that all is as we expect, to see if there is anything we need to take note of such as missing values, maybe even to get a sense of the statistics of the data.
One basic and useful way to begin this process is to see how many columns and rows there are, what each column is called and what kind of data they each contain:
```{r}
glimpse(compensation)
```
Note that you can also inspect the data by clicking on the arrow against its name in the Environment pane.
<dbl> is R-speak for numerical data, not
necessarily integers.
<chr> is R-speak for text.
We could then use summary() to find the mean values of
the Fruit and Root columns.
```{r}
summary(compensation)
```
summary() gives what can be useful overall summary
statistics of each column of numerical data within a data set. It is no
use however for columns of categorical data, such as the Grazing column
here.
What though if we had wanted the Root and Fruit summaries for each
grazing condition? That is where the pair of functions
group_by() and summarise() from the
dplyr package come in. We get to those lower down, but here
is a taster of how they might be used, together with what is called the
‘pipe’ operator %>%.
```{r}
compensation %>%
group_by(Grazing) %>%
summarise(mean_Root=mean(Root),mean_Fruit=mean(Fruit))
```
dplyr provides several functions that let you extract
subsets of a larger data set. All of them use a data frame as their
first argument and produce a data frame as their output. Thus the output
of one can always be used as the input of another, which allows us to
chain together a sequence of these functions to perform a series of
tasks.
select() allows you to choose or exclude whichever
columns you want.
select() to pick out the Fruit column.In this code chunk we will start with the compensation
data frame, do stuff to it and save the result to an object called
Fruits.
```{r}
Fruits<-compensation %>% # Create an object called `Fruits` by tarting with the object `compenstion` and then
select(Fruit) # selecting from that the column called `Fruit`.
glimpse(Fruits) # Now let's have a look at this new object we have created.
```
As is always the case with functions from dplyr and
throughout the tidyverse, the first argument of
select() is the name of the data frame from which you want
to select some columns. This is followed, in this case, by the name(s)
of the column(s) you wish to select.
In the way we have written the code, using thw pipe operator
select() to pick out all the columns except the
Root column.select() will leave out any column prefaced with a
-sign.
Save the output as an object called ‘notRoot’.
```{r}
notRoot<-select(compensation,-Root)
glimpse(notRoot)
```
select() is most useful when you have a data set with
many columns and you only want a few of them. This commonly happens when
you download a data set from a publicly curated dataset such as NBN, where you may often receive a dataset
with dozens of columns but only need a few of them for your particular
study.
We use slice() to pick out particular rows (but see also
filter(), which is more useful than slice() in
practice, I find)
slice() to grab the second row of compensation```{r}
row_2<-slice(compensation,2)
row_2
```
Another way to write this same code, but in better style, is to use the pipe operator, like this:
```{r}
compansation %>%
row_2<-slice(2)
row_2
```
slice() to grab the second to the 10th rows```{r}
row_2_to_10<-slice(compensation,2:10)
row_2_to_10
```
slice() to grab rows 2, 3 and 10.```{r}
row_2310<-slice(compensation,2,3,10)
row_2310
```
What kind of object does slice() return?
Are the row numbers still 2, 3, and 10?
slice() is useful, but often we want to pick out rows
according to one or more conditions being satisfied by the values in one
or more columns. We use filter() to do this. For example,
we might only want those rows corresponding to particular species, or
sites, as determined by the values in a Species or Sites column.
filter() to pick out those rows for which Fruit is
greater than 50.```{r}
big_fruit<-filter(compensation,Fruit>50)
glimpse(big_fruit)
```
filter() and the logical OR symbol ‘|’
to pick rows where Fruit is greater than 80 or less than 20```{r}
extreme_fruit<-filter(compensation,Fruit>60 | Fruit<20)
glimpse(extreme_fruit)
```
filter() and the logical AND symbol
‘&’ to pick rows where Fruit is less than 80 AND greater than
20```{r}
medium_fruit<-filter(compensation,Fruit<80 & Fruit>20)
glimpse(medium_fruit)
```
filter() to pick out only the grazed fruits```{r}
grazed_fruit<-filter(compensation,Grazing=="Grazed")
glimpse(grazed_fruit)
```
Note that we have saved each of these selections to a named object, so now could use them, if we wanted to.
In these filter() examples, we have seen the use of
logical operators:
>:greater than<: less than>=: greater than or equal to<=: less than or equal to==: equal to&: AND| ORmutate()First, use head() to look at the first 6 rows of your
data compensation.
How many columns are there?
```{r}
head(compensation)
```
Use mutate() to create an additional column called
logFruit which is the natural log of the Fruit
column.
Doing a log-transform of data is often a useful trick in data
preparation.
```{r}
compensation<-mutate(compensation,logFruit=log(Fruit))
```
Now use head() to look again at the first 6 rows of your
data. How many columns are there now?
```{r}
head(compensation)
```
Have you changed the original data file? The answer is no. We have
changed the data frame compensation() but the data file
itself is untouched. That is a major advantage of doing your data
analysis in R rather than in Excel, for example, where you would often
be working in and altering your original data file, with all the
potential for data loss that that entails.
arrange() to sort the data by the Fruit column in
ascending order.```{r}
comp_fruit_ascending<-arrange(compensation,Fruit)
head(comp_fruit_ascending)
```
arrange() to sort the data by the Fruit column in
descending order```{r}
comp_fruit_descending<-arrange(compensation,-Fruit)
head(comp_fruit_descending)
```
%>%We mentioned the pipe operator %>% above. Let’s see
how how it works in helping us string together a sequence of
dplyr operations, as we might often want to do in preparing
a dataset
```{r}
largeRoot<-compensation %>% # create an object called largeRoot. Start from compensation and then...
filter(Fruit>80) %>% # keep only those rows where Fruit is > 80 and then....
filter(Grazing=="Grazed") %>% # keep only rows where grazing occurred and then....
select(Root) # keep only the Root column
```
so we should end up with an object called largeRoot which contains a single column of Root values, for those trees where the Fruit value was greater than 80 and where the grazing condition was “Grazed”.
You can think of the pipe symbol as meaning ‘and then’. It feeds the
result of each line of code into the next line. You use it with
tidyverse functions. All of these act on and produce as
output data frames, and so one of these is always fed to the next line
in a sequence such we see above. Normally, the first argument of any
tidyverse function is a also data frame, but since with the
pipe operator each line is being fed a data frame, that argument need
not be explicitly included. It is assumed. Assumed to be, in fact,
whatever data frame was the result of the previous line.
This use of the pipe operator is very common in data analysis and is very, very useful. I strongly advise you to get the hang of it and to use in your own R code.
Get the hang of and use the pipe symbol %>% :)
This is where we can do in R what pivot tables do for you in Excel.
summary() gave us global means for Root and Fruit. But
what if want to know if the means for each depending on the grazing
conditions?
A combination of group() and summarise()
can be used to do this:
group_by() and summarise() to find the
means of Root and Fruit in both Grazed and Ungrazed```{r}
compensation %>%
group_by(Grazing) %>%
summarise(Mean_Root=mean(Root),Mean_Fruit=mean(Fruit))
```
Packages are a collection of functions, data sets and help
documentation that add to the capabilities of base R. But R itself comes
with many built in data sets that can be very useful for practice. One
of them is the famous iris dataset collected by Anderson in
1935. It contains 150 records of 3 species of iris, for each of which
the petal length and width and the sepal length and width are
recorded.
Although this data set can be obtained from within R, let us practise getting it from our data directory, just as we would have to for our own data.
Start a new notebook and save it into your scripts folder under a
suitable name. iris? As before, delete all text beneath the
yaml and include author and dates lines in the
yaml. Replace the title with something suitable ie
something that will remind you wha the script is for when you come back
to it later.
Using Ctrl-alt-I or Cmd-alt-I on a Mac, include a new code chunk to
load up the tidyverse and here packages.
```{r}
library(tidyverse)
library(here)
```
Add a suitable header above the chunk, prefaced with ### to remind you what the chunk does. Run the chunk by pressing the green arrow at its top right. You can run all other chunks as you include them in this way.
Include a new chunk to load the iris.csv dataset into an
object called iris. Use the glimpse() function
to inspect it.
```{r}
filepath<-here("data","iris.csv")
iris<-read_csv(filepath)
glimpse(iris)
```
What type of object is iris? How many rows and columns
are there? What type of data does each column contain? Do you see the
iris object in the Environment pane?
Without creating any intermediate objects, and using the pipe
operator %>%, create a summary table that gives the mean
and standard error of the Petal.Length column, for each
species.
```{r}
iris %>%
group_by(Species) %>%
summarise(mean.petal.length=mean(Petal.Length),se.petal.length=sd(Petal.Length)/sqrt(n()))
```
Summary tables of this sort are very useful.
The standard error of the mean of a column of numerical data is its
standard deviation divided by the square root of the number of data
points. Here we use n() to tell us how many data points
there are for each species.
Repeat the last part, but use the kableExtra package to
improve the table:
```{r}
library(kableExtra)
iris %>%
group_by(Species) %>%
summarise(mean.petal.length=mean(Petal.Length),se.petal.length=sd(Petal.Length)/sqrt(n())) %>%
kbl(digits=2) %>%
kable_styling(full_width=F)
```
You may first need to install the kableExtra package
using install.packages("kableExtra") in your console
window.
Create new columns which give the ratio of petal length to petal
width, and of sepal length to sepal width (and so are a measure of
shape). Save just these two columns and the Species column to a new
object called iris_shape
```{r}
iris_shape<-iris %>%
mutate(Petal.Shape=Petal.Length/Petal.Width) %>%
mutate(Sepal.Shape=Sepal.Length/Sepal.Width) %>%
select(Species,Petal.Shape,Sepal.Shape)
glimpse(iris_shape)
```
Use the ggplot2 package from within
tidyverse to plot the shape data in various ways:
Box plot
#### Very basic box plot
```{r}
iris_shape %>%
ggplot(aes(x=Species,y=Petal.Shape,fill=Species)) +
geom_boxplot()
```
#### Slightly prettier box plot
```{r}
iris_shape %>%
ggplot(aes(x=Species,y=Petal.Shape,fill=Species)) +
geom_boxplot() +
labs(x="Species",
y="Petal Shape") +
theme_classic() +
theme(legend.position="none")
```
Try leaving out either or both of the theme lines. What
difference does that make?
Scatter plot
```{r}
iris_shape %>%
ggplot(aes(x=Petal.Shape,y=Sepal.Shape,colour=Species)) +
geom_point() +
labs(x="Petal.Shape",
y="Sepal Shape") +
theme_classic()
```
Faceted scatter plot
```{r}
iris_shape %>%
ggplot(aes(x=Petal.Shape,y=Sepal.Shape,colour=Species)) +
geom_point() +
labs(x="Petal.Shape",
y="Sepal Shape") +
facet_wrap(~Species,scales="free") +
theme_classic()
```
Faceting, using the facet_wrap line is very useful for
producing a series of related plots in the same style.
We have seen how a few functions from the dplyrpackage
provide most of what you need when it comes to preparing your data for
plotting and analysis.
We looked at the following functions:
select() to pick out or exclude certain columns.slice() to choose certain rowsfilter() to choose or exclude certain rows according to
one or more criteriamutate() to alter existing columns or create new ones,
according to some calculation or criterionarrange() to sort data in ascending or descending
order.group_by and summarise() which when used
together gives us summary statistics of each group in the data set.We should also mention pivot_longer() which can be used
to tidy data.
We also saw how to use the very useful ‘pipe’ symbol
%>% that allows us to run lines of code together and
carry out a sequence of operations without having to create intermediate
objects. It also makes the code easier to read.
Top Tip 3: Get the hang of and use the%>%
pipe symbol. No, really.