Adapted and extended from Chapter 2 of Getting Started with R by Beckerman, Childs and Petchey.
The bird dataset in the exercise is from Mark Gardener’s Statistics for Ecologists using R and Excel
In the following you are intended to write a script of your own, using the snippets of code in this instruction set. As you come to each one, just copy and paste it into your own script. If you feel that writing them out yourself would make you better able to remember them, do that instead, but remember that you dont need to remember them. Once you’re done with this exercise, you will have this worksheet and your own script which you can copy and adapt when you need to do something similar again.
The very useful data manipulation (dplyr) and plotting
(ggplot2) packages within tidyverse work much
more effectively when the data is tidy.
Tidy means - each variable appears only once per row, and each observation appears only once per column.
Tidy data typically looks ‘long and thin’ whereas untidy data, which is often how we like it to be for a human to read, or how it would likely be if a human had recorded it, is typically ‘short and fat’ ie fewer rows and more columns than the equivalent data set once it has been tidied.
The scenario in this exercise is that we have data in a spreadsheet that is as we might have wanted it as a data collection sheet. This means that we have entered the same type of information into several columns.
Somewhere in your computer universe you should by now have an
organised folder where you do your R work for this course. You may have
named it something else, but here we will refer to it as the
RStuff folder. Inside that folder there could be several
sub-folders including at least these two - one called
scripts and the other called data. You can
guess what is supposed to go in each of these.
If you did not do so when setting it up, go now and make your
RStuff folder into a Project. You will notice
that an RStuff.Rproj file appears in the top level of the
folder. In future, you can open the project by navigating to and then
clicking on this from the File/Open Project menu. Using and working with
Projects turns out to have several advantages that become more apparent
the more you use them. In particular, it makes much easier the business
of having an organised RStuff folder while also being able to read in or
write out data or other stuff from particular sub-folders
without having to have a Tech Guru’s knowledge of obscure
pathway syntax. It also keeps your R environment tidy.
Eventually, in your R life, you may well end up using the widely known Github web host, which comes into its own when you want to collaborate with other people and share your Project contents with them. If all the files of your collaboration - data, scripts, manuscripts etc are bundled into a Project, then RStudio makes it very easy to ‘push’ the latest version of your work up to GitHub and to ‘pull’ down whatever your colleagues have done since you last made any changes, in such a way that none of you overwrite what anyone else has done.
But that’s for another day…..
Go to Teams, then from the Files/Class Materials/Rstuff/data folder
download the data file called nasty_format.csv and put it
into your Rtuff/data folder. Or, get this file some other way.
Go to File/New File/R Notebook (the second option in the menu).
Create a new notebook and save it into your RStuff/scripts folder
with the namedata_tidy. RStudio will automatically give it
a .Rmd extension, so you don’t have to do that
yourself.
Add and complete author: and date: lines to
the yaml and delete all the exemplar text. Now you are
ready to go.
Your script window should now be empty, with a tab
data_tidy.Rmd, coloured black because you have no unsaved
changes. If you did, the tab would be coloured red.
Check that the Rstuff folder is the working directory. You can do
this either by typing getwd() into your console window, or,
in the Files tab of the bottom right hand window, click on the gear
wheel then Go to Working Directory, which should appear in
the window beneath.
You should find that the working directory is your RStuff folder, which is handy. Notice too that your Environment window (top-right) is empty, which is nice and neat, and that if you click on the Files tab in the bottom right window you see the internal structure of your RStuff folder. That is, it is showing you your working directory. These are all nice features of using Projects. It leaves you with less to do or worry about.
In particular, we don’t really need to worry about the whole idea of
the working directory, as long as we use the here package,
as we will see, shortly.
The data we will look at is an instance of where observations of the same thing, here bacterial densities in different bottles held at different temperatures, have been recorded on different dates and the observation from each date is placed in a separate column. This means that each row contains many observations - the bacterial densities on each of the dates where these were recorded. We would like each row to contain only a single observation.
So, now, to work, or, as American academics like to say, let’s get started….
Follow through the rest and copy each of the blocks of code into code chunks in your script in turn. As you enter each block, remember to add a header and whatever commentary you think is suficient to make it clear what the code-chunk does., Then select, the code lines within the chunk line by line sequentially and implement them using Ctrl-Enter or Cmd-Enter, or run the whole chunk at once by pressing the green arrow at the top right of the chunk..
You should see each line or block being implemented in the Console pane and any objects created or removed will either appear or disappear in the Environment pane.
This set-up chunk is a bit opaque but is useful at the top of every notebook you write:
```{r,echo = FALSE}
knitr::opts_chunk$set(message = FALSE,warning = FALSE)
```
The arguments to the knitr::opts_chunk$set function are
‘chunk options’. In any other chunk these affect aspects of the way that
chunk will behave. In the set-up chunk they act globally. This saves you
from having to include common chunk options in every chunk where you
might want them. Here, we just have two, to stop messages and warnings
from cluttering up the knitted version of our script. You can add other
chunk options. as you learn about them.
Packages in R are like apps on your phone. They add extra capabilities to R. In fact they are what makes R so powerful.
With phone apps, you install them once, then afterwards you just press on them to make them work.
It works in a similar way with packages in R. So, if you have never used a package before on the machine you are using, you need to install it.
Let us do this with the package tidyverse. We will use
this very important package almost all the time.
If you have already installed tidyverse,
here and lubridate onto your machine, you can
skip this section.
In the Console pane (not in your script) type this:
#install.packages("tidyverse") #uncomment this line if you need to run it.
You should now see this package being installed in the Console
window. tidyverse is a big package, so a lot of red ink
rolls by. Just wait until the > prompt reappears.
Now that we have done this, we do not need to do it again, just like you only install an app on your phone once. That is why we wrote the instruction in the Console pane and not in a script.
Each time you want to use an app on your phone, you have to press on
it. So it is with packages in R, they don’t just work once they are
installed, you have to tell R that you want to use them. We do this with
the library() command, which you write in your script.
Enter a code chunk with these lines in your script and implement it.
library(tidyverse)
library(here)
library(lubridate)
library(kableExtra)
You will see a list of the packages that tidyverse
brings together under one roof. We will make frequent use of most of
these,particularly readr, dplyr and
ggplot2. lubridate is another package, also
part of tidyverse but hwich has to be separately installed.
It
If all has gone well, you should just see the prompt again in the Console window.
nasty_format
using read_csv()Stop! Before you do this, look at the data file in Excel/Numbers/GoogleSheets. What do you notice about one of the columns?
read_csv() is a command from the dplyr
package within tidyverse. You can guess what it is for.
filepath <- here("data","nasty_format.csv")
nasty_format <- read_csv(filepath)
Notice the use here of the package here(). Since we are
working with a Project, the working directory is by default the top
level of the project. That is where here() thinks of as
‘here’, so to speak. From that start point it can then be told to look
down through whatever heirarchy of subfolders it needs to in order to
find the data file you want to read into R.
The object nasty_format is a data frame (actually, it is
a tibble, which the tidyverse people say is just like a data frame, but
without some of the annoying properties. I don’t think this new name has
caught on yet) - R’s word for a container that you can best think of as
a spreadsheet. It has columns of data, where the data can be almost
anything, including text, dates, whole numbers, decimal numbers, and so
on.
glimpse().Having read your data into R, it is a good idea to inspect it in
various ways. I habitually use the glimpse() command
immediately.
glimpse(nasty_format)
## Rows: 36
## Columns: 11
## $ species <chr> "P.caudatum", "P.caudatum", "P.caudatum", "P.caudatum", "P.c…
## $ bottle <chr> "7-P.c", "8-P.c", "9-P.c", "22-P.c", "23-P.c", "24-P.c", "37…
## $ temp <dbl> 22, 22, 22, 20, 20, 20, 15, 15, 15, 22, 22, 22, 20, 20, 20, …
## $ `1.2.13` <dbl> 100.0, 62.5, 75.0, 75.0, 50.0, 87.5, 75.0, 50.0, 75.0, 37.5,…
## $ `2.2.13` <dbl> 58.8, 71.3, 72.5, 73.8, NA, NA, NA, NA, NA, 52.5, 63.7, 70.0…
## $ `3.2.13` <dbl> 67.5, 67.5, 62.3, 76.3, 81.3, 62.5, 90.0, 78.8, 78.3, 23.8, …
## $ `4.2.13` <dbl> 6.80, 7.90, 7.90, 31.30, 32.50, 28.80, 72.50, 92.50, 77.50, …
## $ `6.2.13` <dbl> 0.93, 0.90, 0.88, 3.12, 3.75, 3.12, 102.50, 68.75, 116.20, 0…
## $ `8.2.13` <dbl> 0.39, 0.36, 0.25, 1.01, 1.06, 1.00, 67.50, 72.50, 60.00, 0.9…
## $ `10.2.13` <dbl> 0.19, 0.16, 0.23, 0.56, 0.49, 0.41, 37.50, 52.50, 60.00, 0.3…
## $ `12.2.13` <dbl> 0.46, 0.34, 0.31, 0.50, 0.38, 0.46, 41.25, 45.00, 36.25, 0.9…
What does glimpse() tell you about a data set? How many
rows are there? How many columns? What are the columns called? What kind
of data do they contain?
Other useful ‘quick look’ functions are:
head(nasty_format)
This shows you the first few rows. Does everything look OK?
#view(nasty_format)
You can achieve the same thing as view() by clicking on
the name of the variable in the Environment pane. That is what I do. I
find it annoying to have view() implemented each time I run
a script, which is why I have commented it out!
summary(nasty_format)
## species bottle temp 1.2.13
## Length:36 Length:36 Min. :15 Min. : 25.00
## Class :character Class :character 1st Qu.:15 1st Qu.: 50.00
## Mode :character Mode :character Median :20 Median : 75.00
## Mean :19 Mean : 79.17
## 3rd Qu.:22 3rd Qu.:100.00
## Max. :22 Max. :150.00
##
## 2.2.13 3.2.13 4.2.13 6.2.13
## Min. :52.50 Min. : 23.80 Min. : 1.13 Min. : 0.760
## 1st Qu.:63.77 1st Qu.: 66.88 1st Qu.: 23.57 1st Qu.: 2.603
## Median :71.90 Median : 77.90 Median : 72.50 Median : 43.125
## Mean :71.51 Mean : 74.30 Mean : 57.59 Mean : 40.828
## 3rd Qu.:78.12 3rd Qu.: 85.00 3rd Qu.: 83.45 3rd Qu.: 70.000
## Max. :97.50 Max. :108.80 Max. :111.30 Max. :116.200
## NA's :16
## 8.2.13 10.2.13 12.2.13
## Min. : 0.250 Min. : 0.1600 Min. : 0.30
## 1st Qu.: 1.048 1st Qu.: 0.5475 1st Qu.: 0.50
## Median :26.685 Median :16.2500 Median : 15.19
## Mean :34.403 Mean :28.0172 Mean : 30.33
## 3rd Qu.:69.062 3rd Qu.:60.0000 3rd Qu.: 50.94
## Max. :91.250 Max. :91.3000 Max. :102.50
##
summary() gives you descriptive statistics of each
column. Notice all the NAs in the 2.2.13 column.
Almost always, with real data, there is some cleaning up to do before you can analyse it. Here, we have a column with a lot of missing data.
We notice in looking at the data that one of the dates, 2.2.13, seems to be problematic. There are many NAs in the column for that date. We have a number of choices as to what do about this, but suppose we decide to ignore data from that date, perhaps because we realise from our log book that our procedures were faulty on that day.
We can remove or select columns that meet some criterion using the
select() command from dplyr:
nasty_format <- select(nasty_format,-"2.2.13")
The minus sign here has the effect of selecting all columns except the problematic one for the data 2.2.13.
Our dataset is not tidy since columns 4 to 10 contain values of the
same variable abundance. The names for these columns are
different dates. Thus they are actually different values of the same
variable date.
To tidy the data, we need to ‘gather’ these columns into a pair of
columns which we will name date and abundance,
one of which contains the dates, the other of which contains the
abundance values measured on those dates.
This should give us a tidy data set since in any one row there will
now be just one example of each variable: species,
bottle, temp, date,
abundance.
As we will see, this means that our final, tidy dataset will contain more rows and fewer columns than the original, untidy data set.
We can either do this using:
I don’t recommend it, but you can tidy data in Excel. With simple
data sets this is straightforward - but it gets trickier, more error
prone and more tedious with larger data sets. . Your goal is to put the
data into ‘long and thin’ form in which each variable appears only once
in each row.
You should have five columns with names: “species”, “bottle”, “temp”,
“date” and “abundance” No cells should be empty.This means you may have
to fill down and copies some values into other cells.
Once you have done this, save the data as a .csv file called ‘nice_format.csv’
Now load this data into R, into an object called tidy_data, like this:
#filepath <- here("data","nice_format.csv")
#tidy_data <- read_csv(filepath)
or using…
pivot_longer() function from the
package tidyr, which is part of
tidyverse.The pivot_longer() function supersedes the
gather() function which you may come across but can
otherwise forget about. In making datasets tidy it makes them longer by
increasing the number of rows and decreasing the number of columns.
In this code chunk we will tidy the data and read it into an object
(a tibble, in fact) called tidy_data.
tidy_data <- nasty_format %>%
pivot_longer(4:10,names_to = 'date',values_to = 'abundance')
glimpse(tidy_data)
## Rows: 252
## Columns: 5
## $ species <chr> "P.caudatum", "P.caudatum", "P.caudatum", "P.caudatum", "P.c…
## $ bottle <chr> "7-P.c", "7-P.c", "7-P.c", "7-P.c", "7-P.c", "7-P.c", "7-P.c…
## $ temp <dbl> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, …
## $ date <chr> "1.2.13", "3.2.13", "4.2.13", "6.2.13", "8.2.13", "10.2.13",…
## $ abundance <dbl> 100.00, 67.50, 6.80, 0.93, 0.39, 0.19, 0.46, 62.50, 67.50, 7…
pivot_longer()is the dataset to
reshape, nasty_format. Here it is not included explicitly
since it is fed-through from the previous line by the pipe operator
%>%.names_to argument gives the name of the variable
that will be created from the data stored in the column names,
i.e. date.values_to argument gives the name of the variable
that will be created from the data stored in the cell value,
i.e. abundance.The reshaped dataset tidy_data should now be tidy.
Is it?
Dates and times commonly need to be dealt with in some way. In
particular, when first read into R, R often doesn’t even recognise them
as dates, but just thinks of them as any old text. Look at the output of
the glimpse(tidy_data) command you have just implemented,
and check that this is indeed the case with our data.
So we would like them to be recognised as dates and/or times and to be given a standardised format. This will enqble us to use them in ways that we could not do with text, for example to do arithmetic on them and find the time gap between two dates, and so on.
To do this we look at the dates in the Date column in
tidy_data and see how they have been formatted. The format
appears to be day.month.year.
We will use mutate() and the command dmy()
from the lubridate package, which is part of
tidyverse and which we have already loaded up. Why
dmy(), do you think? There are many other similar commands
for use with dates formatted in other ways.
mutate() is one of several very useful functionss within
dplyr that is used to alter data within columns, or to
create new ones. mutate - geddit? Here we use it together with
dmy() to alter the format of the data in the date
column.
glimpse(tidy_data) # before
## Rows: 252
## Columns: 5
## $ species <chr> "P.caudatum", "P.caudatum", "P.caudatum", "P.caudatum", "P.c…
## $ bottle <chr> "7-P.c", "7-P.c", "7-P.c", "7-P.c", "7-P.c", "7-P.c", "7-P.c…
## $ temp <dbl> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, …
## $ date <chr> "1.2.13", "3.2.13", "4.2.13", "6.2.13", "8.2.13", "10.2.13",…
## $ abundance <dbl> 100.00, 67.50, 6.80, 0.93, 0.39, 0.19, 0.46, 62.50, 67.50, 7…
tidy_data <- mutate(tidy_data,date = dmy(date)) # amend the date column
glimpse(tidy_data) # after
## Rows: 252
## Columns: 5
## $ species <chr> "P.caudatum", "P.caudatum", "P.caudatum", "P.caudatum", "P.c…
## $ bottle <chr> "7-P.c", "7-P.c", "7-P.c", "7-P.c", "7-P.c", "7-P.c", "7-P.c…
## $ temp <dbl> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, …
## $ date <date> 2013-02-01, 2013-02-03, 2013-02-04, 2013-02-06, 2013-02-08,…
## $ abundance <dbl> 100.00, 67.50, 6.80, 0.93, 0.39, 0.19, 0.46, 62.50, 67.50, 7…
What do you notice about the change in the data type of the date
column after the operation of dmy()?
This next chunk of code is very useful. It shows how we can use a
pair of commands from dplyr : group_by() and
summarise() to, (guess what?) group and summarise our data
set. In this way they do what pivot tables do in Excel.
It also illustrates the use of the pipe operator %>%
which takes the result of each line and feeds it into the next. You can
read it as saying, roughly speaking, “and then”. If you get the hang of
it it makes for neater code and avoids you having to create all sorts of
intermediate objects on your way to some goal.
Let us find the mean and standard error of the mean of the abundances
for each species of bacteria and save them to a new data frame called
bacteria_summary:
bacteria_summary <- tidy_data %>%
group_by(species) %>%
summarise(mean = mean(abundance),se.species = sqrt(var(abundance)/n()))
Now we will present this table in a neater form, with sensible level of precison for the numerical data:
bacteria_summary %>%
kbl(digits = 2) %>%
kable_styling(full_width = F) # from the kableExtra package
| species | mean | se.species |
|---|---|---|
| Colpidium | 48.18 | 4.46 |
| P.caudatum | 38.94 | 4.56 |
| S. fonticola | 79.59 | 2.13 |
| Tetrahymena | 30.22 | 4.79 |
We won’t learn properly how to do this here, but this next part is
just included as a motivation for getting to grips with
ggplot2, the very powerful plotting package that comes with
tidyverse.
Just type in the following code. You can leave out the comments.
tidy_data %>%
ggplot(aes(x = temp,y = abundance,colour = species)) + # what we want to plot
geom_point() + # how we want to plot it
labs(title = 'Abundace vs Temperature', # title, axis labels
x = 'Temperature',
y = 'Abundance') +
facet_wrap(~species) + # one graph for each species
theme_classic() + # a different look
theme(legend.position = "none") # get rid of the legend.
What is this plot showing you? Could you have made it without having first tidied the data? No, you could not.
Try adapting the code above to show abundance vs date, for each different bottle.
Start a new R notebook and save it in your scripts folder under a
suitable, R-legal name. Make sure your script includes the
yaml at the top and delete all the exemplar material below
it. Amend the yaml as necessary to include author and date
lines. Change the title if you want.
---
title: "Using R to read in, inspect and tidy up data"
author: Michael Hunt
date: 07-02-2022
output:
html_document:
df_print: paged
---
Now include code chunks for each of the following sections of code. Include suitable headers before each code chunk as illustrated below, making sure they are formatted as headers in the rendered document, just like these..
# This is the set-up chunk
knitr::opts_chunk$set(warning = FALSE,message = FALSE)
# Load packages
library(tidyverse)
library(here)
library(lubridate)
# load data
filepath <- here("data","nasty_format.csv")
nasty_format <- read_csv(filepath)
glimpse(nasty_format)
# inspect the data
head(nasty_format)
summary(nasty_format)
# Remove the 2.2.13 column - the one with loads of NAs
nasty_format <- select(nasty_format,-"2.2.13")
# tidy the dataset
tidy_data <- nasty_format %>%
pivot_longer(4:10,names_to = 'date',values_to = 'abundance')
glimpse(tidy_data)
# Clean the dates
tidy_data <- mutate(tidy_data,date = dmy(date)) # dmy() is from the lubridate package
glimpse(tidy_data)
# Group the data by species, and find mean and standard error of the mean of abundance for each one
bacteria_summary <- tidy_data %>%
group_by(species) %>%
summarise(mean = mean(abundance),st.dev = sd(abundance))
bacteria_summary
bacteria_summary %>%
kbl(digits = 2) %>%
kable_styling(full_width = F) # from the kableExtra package
| species | mean | se.species |
|---|---|---|
| Colpidium | 48.18 | 4.46 |
| P.caudatum | 38.94 | 4.56 |
| S. fonticola | 79.59 | 2.13 |
| Tetrahymena | 30.22 | 4.79 |
tidy_data %>%
ggplot(aes(x = temp,y = abundance,colour = species)) + # what we want to plot
geom_point() + # how we want to plot it
labs(title = 'Abundace vs Temperature', # title, axis labels
x = 'Temperature',
y = 'Abundance') +
facet_wrap(~species) + # one graph for each species
theme_classic() + # a different look
theme(legend.position = "none") # get rid of the legend.
Could you have made this plot without having first tidied up the data? No, you could not.
This exercise uses a set of data on birds from Mark Gardener:
Gardener, M. (2017), Statistics
for Ecologists Using R and Excel
Create a new R notebook and save it in your scripts folder with the
name birds. If you do not already have it, put the
birds.csv data file into your RStuff/data
folder. You should find this in the RStuff folder on the
Teams page.
In your script:
tidyverse and here.untidy_birds.head(untidy_birds) to inspect the first few lines
of this dataset.pivot_longer() to tidy the dataset into a new
object called tidy_birds. This should have three columns:
Species, Habitat, Sightingsmax() and
min() functions for this.If you are familar with the language of ANOVA analysis, you will recognise that the original untidy dataset had one column of data for each of the different ‘treatments’ or manifestations of the single factor ‘habitat’. Those treatments were Garden, Hedgerow, Parkland, Pasture and Woodland. In the same way, “Male and”Female” are two manifestations of the single factor “Sex”. In tidying the data, we are gathering together the data from all the treatment columns and putting them in a single column with a title that describes what the data is, in this case “Sightings”, with alongside it a column with a title which describes the factor, in this case habitat, within which each row has a value according to which type of treatment ie habitat that sighting was made in.
Start a new R notebook and save it in your scripts folder under a
suitable, R-legal name. Make sure your script includes the
yaml at the top and delete all the exemplar material below
it. Amend the yaml as necessary to include author and date
lines. Change the title if you want.
---
title: "Tidying up the bird sightings data set"
author: Your name
date: The date
output:
html_document:
df_print: paged
---
Now include code chunks for each of the following sections of code. Include suitable headers before each code chunk as illustrated below, making sure they are formatted as headers in the rendered document, just like these..
# This is the set-up chunk
knitr::opts_chunk$set(warning = FALSE,message = FALSE)
library(tidyverse)
library(here)
filepath <- here("data","bird.csv")
untidy_birds <- read_csv(filepath)
glimpse(untidy_birds)
head(untidy_birds)
We see that `untidy_birds is a tibble (the new name for new, improved data-frames) with 6 columns and six rows. One of the columns contains text and the others contain numbers, in this case counts of bird sightings.
We see that the there is just one column that contains Species names,
but that sightings values are contained in all the other columns, whose
names are the various types of habitat. This dataset is definitely
untidy. There should be just one column for Habitat, not several columns
for each of the several varieties of habitat. We can use
pivot_longer() to achieve this.
In the code below, !Species means ‘NOT the Species
column’, ie every column but this one.
tidy_birds <- pivot_longer(untidy_birds,!Species,names_to = "Habitat",values_to = "Sightings")
now what do we have:
glimpse(tidy_birds)
head(tidy_birds)
gp_by_species <- tidy_birds %>%
group_by(Species) %>%
summarise(Max = max(Sightings),Min = min(Sightings))
gp_by_species
gp_by_species <- tidy_birds %>%
group_by(Habitat) %>%
summarise(Most_often_seen_species = Species[which.max(Sightings)],Sightings = max(Sightings))
gp_by_species
Note here the use of which.max[Column name] to find the
row number that contains the maximum value in that column, and then of
column_name[index] to pick out the value in column
<column name> at row <index>.