Using R to read in, inspect and tidy up data

Adapted from Chapter 2 of Beckerman, Childs and Petchey: Getting Started with R

The very useful data manipulation (dplyr()) and plotting (ggplot2()) packages within tidyverse() work much more effectively when the data is tidy. We will explain below what is meant by a ‘package’.

Tidy means - each variable appears only once per row, and each observation appears only once per column. Tidy data typically looks ‘long and thin’ whereas untidy data, which is often how we like it to be for a human to read, or how it would likely be if a human had recorded it, is typically ‘short and fat’ ie fewer rows and more columns than the equivalent data set once it has been tidied.

Before you begin

Somewhere in your computer universe you should have a folder called RStuff. Inside that folder there should be two others - one called scripts and the other called data. You can guess what is supposed to go in each of these.

Projects

If you did not do so when setting it up, go now and make your RStuff folder into a Project. You will notice that an RStuff.Rproj file appears in the top level of the folder. In future, you can open the project by clicking on this. Using and working with projects turns out to have several advantages that become more apparent the more you use them. In particular, it makes much easier the business of knowing where your data are, and keeps your R environment tidy.

Download the data into the data folder.

Go to Teams, then from the Files/Class Materials/Rstuff/data folder download the data file called nasty_format.csv and put it into your Rtuff/data folder.

Create a new script file

Go to File/New File/R Script (the top option in the menu).

Create a new script and save it into your RStuff/scripts folder with the name data_tidy. RStudio will automatically give it a .R extension, so you don’t have to do that yourself.

Make your data folder the working directory

Make the Rstuff/data folder the working directory. To do this go through the menus Session/Set Working Directory/Choose Directory then navigate to your data folder and click ‘Open’.

The data we will look at is an instance of where the same thing, here bacterial densities, have been recorded on different dates and the observation from each date is placed in a separate column. We would like each row to contain only a single observation.

Write the script

Title rows

Begin your script with useful identifying comment lines such as the following:

# Your name
# The date
# Simple explanatory title - what does this script do?

Now follow through the rest and copy each of the blocks of code into your script in turn. As you enter each block, remember to add a comment line, starting with a # to say briefly what the block does. Then select the block and implement it using Ctrl-Enter or Command-Enter.

You should see each line or block being implemented in the Console pane and any objects created or removed will either appear or disappear in the Environment pane.

Load any packages you need

Packages in R are like apps on your phone. They add extra capabilities to R. In fact they are what makes R so powerful.

With phone apps, you install them once, then afterwards you just press on them to make them work.

It works in a similar way with packages in R. So, if you have never used a package before on the machine you are using, you need to install it.

Let us do this with the package tidyverse(). We will use this very important package almost all the time.

Installing a package for the first time

If you have already installed tidyverse() onto your machine, you can skip this section.

In the Console pane (not in your script) type this:

#install.packages("tidyverse") #uncomment this line if you need to run it.

You should now see this package being installed in the Console window. tidyverse() is a big package, so a lot of red ink rolls by. Just wait until the > prompt reappears.

Now that we have done this, we do not need to do it again, just like you only install an app on your phone once. Tht is why we wrote the instruction in the Console pane and not in a script.

Loading a package into the current R session

Each time you want to use an app on your phone, you have to press on it. So it is with packages in R, they don’t just work once they are installed, you have to tell R that you want to use them. We do this with the library() command, which you write in your script.

Enter this line in your script and implement it.

library(tidyverse)
library(here)

You will see a list of the packages that tidyverse() brings together under one roof. We will make frequent use of most of these,p articularly readr(), dplyr() and ggplot2().

If all has gone well, you should just see the prompt again in the Console window.

Read the data into a data frame called nasty.format (Use read_csv())

Stop! Before you do this, look at it in Excel. What do you notice about one of the columns?

read_csv() is a command from the dplyr() package within tidyverse. You can guess what it is for.

nasty.format<-read_csv(here("data","nasty_format.csv"))

Notice the use here of the package here(). Since we are working with a Project, the working directory is by default the top level of the project. That is where here() thinks of as ‘here’, so to speak. From that start point it can then be told to look down through whatever heirarchy of subfolders it needs to in order to find the data file you want to read into R.

The object nasty.format is a data frame (sometimes now called a tibble, but I don’t think that new name has caught on yet) - R’s word for a container that you can best think of as a spreadsheet. It has columns of data, where the data can be almost anything, including text, dates, whole numbers, decimal numbers, and so on.

Inspect the data using glimpse().

Having read your data into R, it is a good idea to inspect it in various ways. I habitually use the glimpse() command immediately.

glimpse(nasty.format)
## Rows: 37
## Columns: 11
## $ Species   <chr> "P.caudatum", "P.caudatum", "P.caudatum", "P.caudatum", "P.…
## $ Bottle    <chr> "7-P.c", "8-P.c", "9-P.c", "22-P.c", "23-P.c", "24-P.c", "3…
## $ Temp      <dbl> 22, 22, 22, 20, 20, 20, 15, 15, 15, 22, 22, 22, 20, 20, 20,…
## $ `1.2.13`  <dbl> 100.0, 62.5, 75.0, 75.0, 50.0, 87.5, 75.0, 50.0, 75.0, 37.5…
## $ `2.2.13`  <dbl> 58.8, 71.3, 72.5, 73.8, NA, NA, NA, NA, NA, 52.5, 63.7, 70.…
## $ `3.2.13`  <dbl> 67.5, 67.5, 62.3, 76.3, 81.3, 62.5, 90.0, 78.8, 78.3, 23.8,…
## $ `4.2.13`  <dbl> 6.80, 7.90, 7.90, 31.30, 32.50, 28.80, 72.50, 92.50, 77.50,…
## $ `6.2.13`  <dbl> 0.93, 0.90, 0.88, 3.12, 3.75, 3.12, 102.50, 68.75, 116.20, …
## $ `8.2.13`  <dbl> 0.39, 0.36, 0.25, 1.01, 1.06, 1.00, 67.50, 72.50, 60.00, 0.…
## $ `10.2.13` <dbl> 0.19, 0.16, 0.23, 0.56, 0.49, 0.41, 37.50, 52.50, 60.00, 0.…
## $ `12.2.13` <dbl> 0.46, 0.34, 0.31, 0.50, 0.38, 0.46, 41.25, 45.00, 36.25, 0.…

What does glimpse() tell you about a data set? How many rows are there? How many columns? What are the columns called? What kind of data do they contain?

Some other ways to Inspect your data

Other useful ‘quick look’ commands are:

head(nasty.format)
names(nasty.format)
##  [1] "Species" "Bottle"  "Temp"    "1.2.13"  "2.2.13"  "3.2.13"  "4.2.13" 
##  [8] "6.2.13"  "8.2.13"  "10.2.13" "12.2.13"
summary(nasty.format)
##    Species             Bottle               Temp        1.2.13      
##  Length:37          Length:37          Min.   :15   Min.   : 25.00  
##  Class :character   Class :character   1st Qu.:15   1st Qu.: 50.00  
##  Mode  :character   Mode  :character   Median :20   Median : 75.00  
##                                        Mean   :19   Mean   : 79.17  
##                                        3rd Qu.:22   3rd Qu.:100.00  
##                                        Max.   :22   Max.   :150.00  
##                                        NA's   :1    NA's   :1       
##      2.2.13          3.2.13           4.2.13           6.2.13       
##  Min.   :52.50   Min.   : 23.80   Min.   :  1.13   Min.   :  0.760  
##  1st Qu.:63.77   1st Qu.: 66.88   1st Qu.: 23.57   1st Qu.:  2.603  
##  Median :71.90   Median : 77.90   Median : 72.50   Median : 43.125  
##  Mean   :71.51   Mean   : 74.30   Mean   : 57.59   Mean   : 40.828  
##  3rd Qu.:78.12   3rd Qu.: 85.00   3rd Qu.: 83.45   3rd Qu.: 70.000  
##  Max.   :97.50   Max.   :108.80   Max.   :111.30   Max.   :116.200  
##  NA's   :17      NA's   :1        NA's   :1        NA's   :1        
##      8.2.13          10.2.13           12.2.13      
##  Min.   : 0.250   Min.   : 0.1600   Min.   :  0.30  
##  1st Qu.: 1.048   1st Qu.: 0.5475   1st Qu.:  0.50  
##  Median :26.685   Median :16.2500   Median : 15.19  
##  Mean   :34.403   Mean   :28.0172   Mean   : 30.33  
##  3rd Qu.:69.062   3rd Qu.:60.0000   3rd Qu.: 50.94  
##  Max.   :91.250   Max.   :91.3000   Max.   :102.50  
##  NA's   :1        NA's   :1         NA's   :1

Clean up the data

Almost always, with real data, there is some cleaning up to do before you can analyse it. Click on the text of ‘nasty.format’ in the Environment pane. What do you see?

What R done with the empty cells that we noticed when we first looked at the data?

We notice in particular that there is an empty row. Empty rows and columns sometimes appear in R data frames when we read data in from Excel. I am not sure why, but it is becuase of something we have done in Excel. To get rid of them we can either go to our Excel file, selct just the cells that contain our data paste them into a new spreadsheet and save that as a .csv file, or we can do it in R.

Here, we will do it in R.

Get rid of the empty row at the bottom - use filter(). Save the new object under the same name nasty.format

nasty.format<-filter(nasty.format,!is.na(Bottle))

filter() is a command from dplyr() that is used to remove certain rows from a data frame based on one or more criteria being satisfied. In this cae we looked at what was unique about the bottom row

The format is filter(data_frame name,condition 1, condition 2,....). In our case we have removed all rows where the value in the Bottle column is not (!= means ‘is not equal to’) an NA

is.na()

In the console window, try this:

is.na(NA)
## [1] TRUE
!is.na(NA)
## [1] FALSE
is.na(3)
## [1] FALSE
is.na("Hello Eden")
## [1] FALSE

The data is not tidy. Why not?

Remove one of the columns

We notice in looking at the data that one of the dates, 2.2.13, seems to be problematic. There are many NAs on that date. We have a number of choices here, but suppose we decide to ignore data from that date, perhaps becuase we realise from our log book that our procedures were faulty on that day.

We can remove columns that meet some criterion using the select() command from dplyr():

nasty.format<-select(nasty.format,-"2.2.13")

The minus sign here has the effect of selecting all columns except the problematic one for the data 2.2.13.

Tidy the data

We can either do this using:

Method One - tidy your data in Excel

I don’t recommend it, but you can tidy data in Excel. With simple data sets this is straightforward - but it gets trickier and more tedious with larger data sets. . Your goal is to put the data into ‘long and thin’ form in which each variable appears only once in each row.
You should have five columns with names: “Species”, “Bottle”, “Temp”, “Date” and “Abundance” No cells should be empty.This means you may have to fill down and copies some values into other cells.

Once you have done this, save the data as a .csv file called ‘nice_format.csv’

Now load this data into R, into an object called tidy.data, like this:

tidy.data<-read_csv("../data/nice_format.csv")
## Parsed with column specification:
## cols(
##   Species = col_character(),
##   Bottle = col_character(),
##   Temp = col_double(),
##   `1.2.13` = col_double(),
##   `2.2.13` = col_double(),
##   `3.2.13` = col_double(),
##   `4.2.13` = col_double(),
##   `6.2.13` = col_double(),
##   `8.2.13` = col_double(),
##   `10.2.13` = col_double(),
##   `12.2.13` = col_double()
## )

or using…

Method Two - Use the gather() function from dplyr()

As you get used to R you will discover that there is a command in the dplyr() package (which is bundled in as part of the tidyverse() package), that will tidy data for you. It is called gather().
Using gather(), store the tidy data in an object called tidy.data

tidy.data<-gather(nasty.format,Date,Abundance,4:10)
glimpse(tidy.data)
## Rows: 252
## Columns: 5
## $ Species   <chr> "P.caudatum", "P.caudatum", "P.caudatum", "P.caudatum", "P.…
## $ Bottle    <chr> "7-P.c", "8-P.c", "9-P.c", "22-P.c", "23-P.c", "24-P.c", "3…
## $ Temp      <dbl> 22, 22, 22, 20, 20, 20, 15, 15, 15, 22, 22, 22, 20, 20, 20,…
## $ Date      <chr> "1.2.13", "1.2.13", "1.2.13", "1.2.13", "1.2.13", "1.2.13",…
## $ Abundance <dbl> 100.0, 62.5, 75.0, 75.0, 50.0, 87.5, 75.0, 50.0, 75.0, 37.5…

The data should now be tidy.

Is it?

Clean the dates

Use mutate() and a command from the lubridate() package: dmy(). Why dmy(), do you think?

# if you have not already installed lubridate on your machine,  you need to do it now.
# Type this into the Console window:
install.packages("lubridate")

mutate() is one of several very useful commands within dplyr() that is used to alter data within columns, or create new ones. Here we use it together with dmy() to alter the format of the data in the Date column.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
tidy.data<-mutate(tidy.data,Date=dmy(Date))
glimpse(tidy.data)
## Rows: 252
## Columns: 5
## $ Species   <chr> "P.caudatum", "P.caudatum", "P.caudatum", "P.caudatum", "P.…
## $ Bottle    <chr> "7-P.c", "8-P.c", "9-P.c", "22-P.c", "23-P.c", "24-P.c", "3…
## $ Temp      <dbl> 22, 22, 22, 20, 20, 20, 15, 15, 15, 22, 22, 22, 20, 20, 20,…
## $ Date      <date> 2013-02-01, 2013-02-01, 2013-02-01, 2013-02-01, 2013-02-01…
## $ Abundance <dbl> 100.0, 62.5, 75.0, 75.0, 50.0, 87.5, 75.0, 50.0, 75.0, 37.5…

What do you notice about the data type of the dates?

Grouping the data

This next chunk of code is very useful. It shows how we can use a pair of commands from dplyr() : group_by() and summarise() to, (guess what?) group and summarise our data set.

It also illustrates the use of the pipe %>% which takes the result of each line and feeds it into the next. You can read it as saying, roughly speaking, “and then”. If you get the hang of it it makes for neater code and avoids you having to create all sorts of intermediate objects on your way to some goal.

Let us find the mean and standard deviation of the abundances forund for each species of bacteria and save them to a new data frame called bacteria_summary:

bacteria_summary<-tidy.data %>%
  group_by(Species) %>%
  summarise(mean=mean(Abundance),st.dev=sd(Abundance))
## `summarise()` ungrouping output (override with `.groups` argument)
bacteria_summary

Now let’s plot the data

We won’t learn properly how to do this here, but theis next part is just included as a montivation for getting to grips with ggplo2(), the very powerful plotting package that comes with tidyverse().

Just type in the following code:

library (ggplot2)
ggplot(data=tidy.data,aes(x=Date,y=Abundance))+
  geom_point()+
  facet_wrap(~Bottle)

What is this plot showing you?