1 Recap

Quick refresher on the most important parts from the last session:

1.1 R & RStudio

To access RStudio on LJMU computers, go to the Application Player and search “RStudio”.This course is designed for the RStudio 64 with the circular logo.
On your own computer, you can download RStudio at their own downloads page, remember it is completely free to anyone.
When RStudio first opens, we won’t have a Script pane, so we need to open a new script File > New File > R Script.
Often, R Studio will keep your previous session saved, which means that you will probably open the script we created last week.
Code can be stored in the Script pane and run in the Console pane. If we want to run code from the Script pane, we highlight it and press Ctrl+Enter (Cmd+Enter on Mac). This will grab the code from the Script and put in in the Console.
The Workspace pane displays all of our variables and the Viewer pane displays plots and help files.
The best ways to find Help in R:
If you need help with a function, type ? followed by the function, e.g. ?mean into the console and the help file will pop up in the Viewer pane.
You can search Google/Ecosia/Bing for your R related question, Stack Exchange is a great resource to answer your questions.
Twitter has great #RStats community for other questions.
The Maths, Stats & IT Team also run frequent drop-in sessions and one-to-ones on the Library Calendar.

1.2 Vectors

R stores data as vectors, quickest way to create a vector is using the c() function and these can be stored in a variable using the assignment (<-) operator, such as in the following x <- c(1,2,3).
We can also chain together the c() function as in y <- c(x,4,5).
Vectors can be made up of different data types, as long as eveything in a vector is the same. e.g. x <- c("Hello","World") will create a vector of characters, y <- x == "Hello" will create a vector of logicals.
If different data types are passed into the vector creating function, R will turn them into the same type x <- c("Hello", "World", 1,2,3) will be a character vector.
We can use [ and ] to pull out subsets of our data x[3], which can be extracted using numeric vectors x[2:3], or logical ones x[c(T,T,F,T,F)] or by telling R which ones we don’t want x[-3].
Subsetting and Assignments can be combined to change single elements in a vector x[3] <- "To"

1.3 Statstistics

We can use the data() function to load up one of the Built-in R functions, e.g. data(iris)
We can describe our data using functions such as mean(), sd() and table()
We can plot data in a basic way using plot()

1.4 Probabilities

Data is distributed according to a probability density function (pdf) and a cumulative density function (cdf)
We can use functions related to these distributions with the dxxx(), pxxx(), qxxx()andrxxx()` family of functions

2 Packages

R is designed to be a modular approach to statistical analysis and is open-source and allows the addition of packages. This means that the version of R you download is the basic version of it and we can download packages which contian more advanced functions. Think of it like DLC, but for statistics (and much like R it is still free!) These packages are built by other R users and can be downloaded from a repository called CRAN (Comprehensive R Archive Network).

2.1 Installing packages

The first package we’re going to download and install is the tidyverse package (which actually imports/installs a bunch of other packages).

The command to install a new packages is:

install.packages("tidyverse",lock=F)

This function will search CRAN for the package called "tidyverse" and install it into our own package repository (located in your Documents folder). Note that this needs to be in quotes as this is a string that R will search for. It also downloads and installs all the other packages that tidyverse uses. As it’s installing, you’ll see it run through all these packages (e.g. broom, ggplot2,tidyr), we’ll get to these later.

If you’re running this on your home computer and have full admin rights, then you can omit the lock=F argument. However, on University ran computers, without this argument we can hit annoying errors.

You can see the currently installed packages inthe Viewer pane under the packages tab. If you click on the name of a package in here, it will bring up a Help file (still in the Viewer pane) for that package, including a list of functions contained in that package (which link to their Help Files). Some packages include links to vignettes/documentation regarding the package.

2.2 Documentation

As well as storing the actual package files, CRAN also stores documentation about the package. This includes the Help files that get installed along with the packages. CRAN homepages for packages are standardised and all look the same with (roughly) the same information.

Let’s have a look at the CRAN page for the tidyverse.

Most of what we see here is for fairly advanced R users, so we’re going to ignore a lot of it. At the top, you can see the name of he package and a quick blurb about what it’s for. There are also a bunch of links to other packages that the current package uses. Under the Downloads heading, you can see the Reference manual and the Vignettes.

The Reference manual for a package contains a list of all of the functions that are in that package, and information about their use. This information is the same information that is contained in the ? Help files that is downloaded into R. This can be useful if you can’t remember the name of a function, but you know it’s in a certain package.

The Vignettes are more of a user guide. They’re usually written by the package author and contains information on how to use the package. Kind of like an instruction manual. Packages don’t always have vignettes, but when they do, I highly recomend reading them.

2.3 Loading packages

So, we’ve installed a package, but we’re not actually using it. Once a package has been installed it is stored on our system and we only need to do this once. But, we also need to load the package into our current R session. (Note that this time, we don’t need quotation marks)

library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## ── Conflicts ──────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

We now have access to all of the functions contained within the core bundle of tidyverse packages (listed in the console under Attaching packages. This means that by installing and loading the tidyverse package, we’ve actually installed and loaded these other packages (e.g. ggplot2 and dplyr). Let’s get started!

3 The Pipe

One of the most important changes that comes with the tidyverse is the addition of the Pipe: %>%. It’s a new operator that comes with the dplyr package and it allows us to completely change how we use functions. It can make our code much easier to read, particularly when writing complicated series of functions. For example:

#Written as operators:
x <- (((2+1)*4)/6)

#Written as functions (these also come with dplyr):
x <- divide_by(multiply_by(add(2,1),4),6)

#Spread over multiple lines
x <- 2
x <- add(x,1)
x <- multiply_by(x,4)
x <- divide_by(x,6)

#Written with pipes:
x <- 2 %>%
  add(1) %>%
  multiply_by(4) %>%
  divide_by(6)

The functions in the second version perform the same actions as the operators in the first (e.g. multiply_by(a,b) does the same as a*b). If this is all we’re doing, using these functions would seem silly (just use the operators). But if we’re using more complicated functions (which we will learn about soon), the pipe version is much neater and easier to read.

In the single line version, we are working from the inside out. This gets easier to read in the version where we spread it across multiple lines, but it’s prone to mistakes. What if we type z instead of x in a function call or an assignment call? The piped version gets rid of this risk and duplicated code (x <- on every line). I’ve chosen to spread my command over multiple lines to make it easier to read and to indent each new line (RStudio actually does this automatically for you), this makes my code look neater and much easier to read.

What the pipe does is allow us to pipe our answer into the next function and chain them together. This changes our syntax a little bit:

f(a,b,c,d)
a %>% f(b,c,d)

The shortcut to quickly enter this in RStudio is Ctrl/Cmd + Alt + M

4 Loading Data

I’ve uploaded a messy dataset online for us to use today. You can either download it and save it somewhere in your system, or you can load it straight from the URL. It’s in a csv format, so we use the following command:

dat <- read_csv("https://raw.githubusercontent.com/MyKo101/RJunk/master/data_anthro.csv")

## Parsed with column specification:
## cols(
##   ID = col_character(),
##   DOB = col_character(),
##   Gender = col_character(),
##   Ht = col_character(),
##   Wgt = col_character()
## )

This read_csv() function is contained in the readr package. There is a built in function in R that allows us to bring in csv files, but it’s messier. We’re in the tidyverse now!

This has given a bit of an output. read_csv tries to guess what data types columns/variables within this dataset are. We can suppress this message with the argument col_types = col(), or we can define what we want them to be by passing the data types as a vector (d=double, c=character,l=logical) col_types=c("c","c","c","c","c").

Let’s have a look at this data. Previously, we needed to use the head() function to just look at the top few rows. But that was when we were using data.frame structure from the basic R package. read_csv() loads the data and stores it in a tibble format. This is essentially the same as a data.frame (and can be used in the same way), but provides a bit more information and consistency and only prints the first few rows:

dat

## # A tibble: 150 x 5
##    ID    DOB        Gender Ht    Wgt  
##    <chr> <chr>      <chr>  <chr> <chr>
##  1 001   22/03/1965 m      5 6   7 1  
##  2 002   17/04/1977 f      5 9   11 0 
##  3 003   24/07/1966 f      5 4   9 8  
##  4 004   22/02/1985 Female 5 7   11 1 
##  5 005   30/07/1985 M      6 0   10 10
##  6 006   09/07/1987 male   6 2   11 12
##  7 007   23/01/1995 female 5 10  10 2 
##  8 008   31/12/1992 M      6 3   12 1 
##  9 009   24/01/1952 Male   5 4   7 10 
## 10 010   11/06/1953 F      5 2   8 6  
## # … with 140 more rows

R now tells us that the tibble has 30 rows and 5 columns, and as well as giving us the names of the variables in the data, it also tells us their types (<dbl> or <chr>) and at the bottom, we’re told that there are 20 more rows. If we had a lot of columns in this data, R would only show us the first few columns and tell us at the bottom what the extra columns are (names and data types).

4.1 Messy Data!

This data is quite messy. There are some things we need to fix. First thing to do would be to list out everything we need to fix.

I would prefer ID to be a number (although this is not always true)
DOB is a character. This is supposed to be a date. Can R store dates? OF COURSE IT CAN!
The next two columns are Ht and Wgt, and are supposed to be Height and Weight. We’ll fix those names
They’re also inconsistent(cm and ft/in in Height), we’ll have to fix that too!
Gender is also inconsitent some are words (male/female) and lower/uppercase single letters. We need to standardise this.

5 Changing our data

We could edit this data in excel and go through one cell at a time, but if we have a big data set, this is going to be difficult. Let’s get R to do it for us

5.1 The mutate() function

The dplyr package provides us with the mutate() function, which is incredibly useful for editing data as we go, and is designed to be used with the pipe. Let’s use it, along with the as.numeric() function to convert the ID variable into a number.

dat %>% mutate(ID = as.numeric(ID))

## # A tibble: 150 x 5
##       ID DOB        Gender Ht    Wgt  
##    <dbl> <chr>      <chr>  <chr> <chr>
##  1     1 22/03/1965 m      5 6   7 1  
##  2     2 17/04/1977 f      5 9   11 0 
##  3     3 24/07/1966 f      5 4   9 8  
##  4     4 22/02/1985 Female 5 7   11 1 
##  5     5 30/07/1985 M      6 0   10 10
##  6     6 09/07/1987 male   6 2   11 12
##  7     7 23/01/1995 female 5 10  10 2 
##  8     8 31/12/1992 M      6 3   12 1 
##  9     9 24/01/1952 Male   5 4   7 10 
## 10    10 11/06/1953 F      5 2   8 6  
## # … with 140 more rows

Exactly as we needed, ID is now a number (double). The mutate() function takes in a tibble as it’s first argument (remember, this is what the pipe is doing) and changes it depending on whatever other information is passed to it. We can edit more than one variable at a time this way and create new variables. It’s also worth mentioning that R performs these mutations in order and so any changes we make at the start of the mutate() command can be used later.

dat %>%
  mutate(x = runif(150), #Remember this function?
         y = runif(150),
         x = x + y)

## # A tibble: 150 x 7
##    ID    DOB        Gender Ht    Wgt       x      y
##    <chr> <chr>      <chr>  <chr> <chr> <dbl>  <dbl>
##  1 001   22/03/1965 m      5 6   7 1   1.18  0.253 
##  2 002   17/04/1977 f      5 9   11 0  0.610 0.0839
##  3 003   24/07/1966 f      5 4   9 8   0.870 0.0759
##  4 004   22/02/1985 Female 5 7   11 1  1.54  0.756 
##  5 005   30/07/1985 M      6 0   10 10 1.39  0.616 
##  6 006   09/07/1987 male   6 2   11 12 1.30  0.868 
##  7 007   23/01/1995 female 5 10  10 2  1.22  0.943 
##  8 008   31/12/1992 M      6 3   12 1  0.323 0.127 
##  9 009   24/01/1952 Male   5 4   7 10  1.14  0.639 
## 10 010   11/06/1953 F      5 2   8 6   1.09  0.279 
## # … with 140 more rows

Note that this is the same as running mutate(dat,x=runif(30),y=runif(30),x=x+y).

Right now, we’ve not stored the output from the previous mutate() call, dat is still as it was when we loaded it with the ID variable stored as a character.

5.2 Neat code

For now, let’s clear out our Script pane to keep ourselves organised, and eliminate the code that we don’t need. Copy this into your Script pane and run it all (Ctrl/Cmd + A will select all the code in the Script pane and Ctrl/Cmd + Enter will run it all)

rm(list=ls())

#Load up my libraries
library(tidyverse)

#Load up my data
dat <- read_csv("https://raw.githubusercontent.com/MyKo101/RJunk/master/data_anthro.csv")

#Edit the dat file
dat_clean <- dat %>% 
  mutate(ID = as.numeric(ID)) #Turn ID into numeric

dat_clean

I’ve added a function as the first line rm(list=ls()) which will clear out our workspace and get rid of any variables that have been created. I’ve also added annotations with the # symbol. R won’t run anything in a line that is preceded by the # symbol (and it’ll even colour code it as well), this allows you to make notes (either to yourself or others) explaining what your code is actually doing. This isn’t always necessary if it’s obvious, but it can be useful to keep track of things.

Annotations are voluntary. We don’t have to put them in there, but they really help when you’re reading over code that you’ve not used in a while, or you’re sending it to someone else. It’s much easier to annotation and make your code as readable as possibl than it is to stand there explaining your code to someone who has never seen it before.

5.3 The rename() function

The dplyr package also provides us with the rename function, which allows us to rename some of the variables. This can make it easier to interpret what the variables are. Coming up with clear and concise names is very good practice for budding coders.

We’re going to add this to our dat editing call

dat_clean <- dat %>% 
  mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
  rename(Height = Ht, Weight = Wgt) #Rename some variables

dat_clean

## # A tibble: 150 x 5
##       ID DOB        Gender Height Weight
##    <dbl> <chr>      <chr>  <chr>  <chr> 
##  1     1 22/03/1965 m      5 6    7 1   
##  2     2 17/04/1977 f      5 9    11 0  
##  3     3 24/07/1966 f      5 4    9 8   
##  4     4 22/02/1985 Female 5 7    11 1  
##  5     5 30/07/1985 M      6 0    10 10 
##  6     6 09/07/1987 male   6 2    11 12 
##  7     7 23/01/1995 female 5 10   10 2  
##  8     8 31/12/1992 M      6 3    12 1  
##  9     9 24/01/1952 Male   5 4    7 10  
## 10    10 11/06/1953 F      5 2    8 6   
## # … with 140 more rows

The syntax here is just new-name = old-name. Now the variables Ht and Wgt are completely gone from dat, and we have new variables Height and Weight.

That’s two out of our five problems solved.

5.4 The recode() function

The Gender variable is still a bit of mess. Let’s take a look. This time, we’re just going to put this in the console.

table(dat$Gender) #Remember the table() function from last week?

## 
##          f          F     female     Female          m          M       male 
##         13         26         18         18         12         19         22 
##       Male Non-Binary 
##         18          4

We as humans can tell what each of these entries means. Some of them mean Male and some mean Female. But R can’t discern that. We’re gonna lump them together by converting all the relevant values into “Male” or “Female”. We can do this using the recode() function inside the mutate() function.

dat_clean <- dat %>% 
  mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
  rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
  mutate(Gender = Gender %>% recode(m = "Male", male = "Male", M = "Male", #Tidy up Gender
                                    f = "Female", female = "Female", F = "Female")) 

dat_clean

## # A tibble: 150 x 5
##       ID DOB        Gender Height Weight
##    <dbl> <chr>      <chr>  <chr>  <chr> 
##  1     1 22/03/1965 Male   5 6    7 1   
##  2     2 17/04/1977 Female 5 9    11 0  
##  3     3 24/07/1966 Female 5 4    9 8   
##  4     4 22/02/1985 Female 5 7    11 1  
##  5     5 30/07/1985 Male   6 0    10 10 
##  6     6 09/07/1987 Male   6 2    11 12 
##  7     7 23/01/1995 Female 5 10   10 2  
##  8     8 31/12/1992 Male   6 3    12 1  
##  9     9 24/01/1952 Male   5 4    7 10  
## 10    10 11/06/1953 Female 5 2    8 6   
## # … with 140 more rows

The first argument is actually the vector that we are recoding, Gender. Since we’re only performing the one function here, we’re not going to pipe as it’s still pretty neat as it is (thanks to the neat indentation oriented formatting). Confusingly, this is written backwards to the rename() function, so we need old-name = new-name and the new name has to be in quotes (because it is a string), but other than that, it’s pretty straight forward.

If we’re recoding a string that has spaces or that starts with a number, we need to put them in between backticks. You’ll be used to the single quote or apostrophe (’) and double quote ("), but in R, we sometimes need to use the backtick as a quotation mark (`). The backtick is usually located to the left of the 1 button on your keyboard (yes, that button is finally useful!)

x <- c("One","Two","3","Fo ur")
recode(x,`3` = "Three",`Fo ur` = "Four")

## [1] "One"   "Two"   "Three" "Four"

Technically, the recode() function can be used with a numeric vectors rather than a character vector, but it is much easier not to (and there are other alternatives).

5.5 The if_else() function

The recode() function is pretty useful for a lot of circumstances, but it still looks a little clunky in our code. Previously, we learned about creating branches of code with the if and else statements. We can do a similar thing using a function.

x <- c("One","Two","3","Four")
if_else(x == "3","Three",x)

## [1] "One"   "Two"   "Three" "Four"

For our vector, it evaluates the first argument (x==3 into a logical vector) and then if it’s TRUE, it uses the second argument and if it’s FALSE uses the third argument. A lot of the time, the third argument will just be the original vector (so we only change it if the logical statement is TRUE).

We’re also going to introduce a new operator, the %in% operator. Remember operators have something on the left and something on the right. The %in% operator checks if the things in the vector on the left are in the vector on the right

x <- c(1,2,3,4,5,6)
x %in% c(2,4,6)

## [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE

For every element in x, we check whether it is in the vector on the right. If it is, it returns TRUE, otherwise, it returns FALSE. We can use this instead of the previous recode

dat_clean <- dat %>% 
  mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
  rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
  mutate(Gender = if_else(Gender %in% c("m","M","male"),"Male",Gender), #Tidy up Gender
         Gender = if_else(Gender %in% c("f","F","female"),"Female",Gender)) 

dat_clean

## # A tibble: 150 x 5
##       ID DOB        Gender Height Weight
##    <dbl> <chr>      <chr>  <chr>  <chr> 
##  1     1 22/03/1965 Male   5 6    7 1   
##  2     2 17/04/1977 Female 5 9    11 0  
##  3     3 24/07/1966 Female 5 4    9 8   
##  4     4 22/02/1985 Female 5 7    11 1  
##  5     5 30/07/1985 Male   6 0    10 10 
##  6     6 09/07/1987 Male   6 2    11 12 
##  7     7 23/01/1995 Female 5 10   10 2  
##  8     8 31/12/1992 Male   6 3    12 1  
##  9     9 24/01/1952 Male   5 4    7 10  
## 10    10 11/06/1953 Female 5 2    8 6   
## # … with 140 more rows

There are plenty of situations where recode() is better than if_else(). Since we’re wanting to turn a lot of values into a few simpler ones, the if_else() function is better. If we’re just wanting to rename the values in our variable, then recode() works better. For example, if all the values were either “M”, “F” or “N” and we want them to be “Male”, “Female” or “Non-Binary”, then recode() would be the better option.

But remember that “Better” is often subjective. These commands do the same thing, but whichever you understand better should be the one you use!

5.6 The case_when() function

We’ve converted the recode() function into a pair of if_else() functions, but, we’re still having to perform the if_else() twice. We could neaten it up a little using pipes:

dat_clean <- dat %>% 
  mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
  rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
  mutate(Gender = if_else(Gender %in% c("m","M","male"),"Male",Gender) %>% #Tidy up Gender
           if_else(. %in% c("f","F","female"),"Female",.)) 

dat_clean

## # A tibble: 150 x 5
##       ID DOB        Gender Height Weight
##    <dbl> <chr>      <chr>  <chr>  <chr> 
##  1     1 22/03/1965 Male   5 6    7 1   
##  2     2 17/04/1977 Female 5 9    11 0  
##  3     3 24/07/1966 Female 5 4    9 8   
##  4     4 22/02/1985 Female 5 7    11 1  
##  5     5 30/07/1985 Male   6 0    10 10 
##  6     6 09/07/1987 Male   6 2    11 12 
##  7     7 23/01/1995 Female 5 10   10 2  
##  8     8 31/12/1992 Male   6 3    12 1  
##  9     9 24/01/1952 Male   5 4    7 10  
## 10    10 11/06/1953 Female 5 2    8 6   
## # … with 140 more rows

Here, we’ve used a . to represent the data being piped in. Ordinarily, the pipe will put the data into the first argument of the function, but we can over-ride this, by using the .

In this example, we’ve chained together a pair of if_else() functions. Not too bad when we’re just doing two, but if we needed to look at more cases, this could get complicated. We can use the case_when() function for this:

dat_clean <- dat %>% 
  mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
  rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
  mutate(Gender = case_when(Gender %in% c("m","M","male") ~ "Male",    #Tidy up Gender
                            Gender %in% c("f","F","female") ~ "Female",
                            T ~ Gender))

dat_clean

## # A tibble: 150 x 5
##       ID DOB        Gender Height Weight
##    <dbl> <chr>      <chr>  <chr>  <chr> 
##  1     1 22/03/1965 Male   5 6    7 1   
##  2     2 17/04/1977 Female 5 9    11 0  
##  3     3 24/07/1966 Female 5 4    9 8   
##  4     4 22/02/1985 Female 5 7    11 1  
##  5     5 30/07/1985 Male   6 0    10 10 
##  6     6 09/07/1987 Male   6 2    11 12 
##  7     7 23/01/1995 Female 5 10   10 2  
##  8     8 31/12/1992 Male   6 3    12 1  
##  9     9 24/01/1952 Male   5 4    7 10  
## 10    10 11/06/1953 Female 5 2    8 6   
## # … with 140 more rows

Each argument in the case_when() function is a logical, just like in the if_else() statement, followed by the ~ symbol, and then the resut that we want to use in that case. R evaluates each of the conditional statements in order, this means that if two statements are correct (for your given data), R will return the first choice. This is also why we finish with a T ~ Gender to give the case_when() function a default option, if none of the other statements are TRUE, then this last one is definitely (and literally) TRUE

x <- 1:10
case_when(x < 3 ~ "small", #This is true for 1 & 2
          x < 7 ~ "medium", #This is true for 1, 2, 3, 4, 5 and 6
          T ~ "large") #Everything else (7, 8 & 9) get this result

##  [1] "small"  "small"  "medium" "medium" "medium" "medium" "large"  "large" 
##  [9] "large"  "large"

For our purposes, we can use any of the previous options: recode(), if_else() or case_when(). It’s upto personal choice and style. Which do you find the easiest to understand? Which makes the most sense and is clearer for you. Whichever one it is, keep that one. We’re going to look at one final way to do this soon

5.7 Boolean Logic

The last few bits involved evaluating a conditional statement, sometimes, we need to check a few different statements at once. We might need them both to be true, or just one of them, or neither of them. We can use operators to combine logical vectors (just like we use operators for numeric vectors e.g c(5,4) + c(3,2))

x <- c(T,T,F,F)
y <- c(T,F,T,F)

x & y #AND opertor, both the left and right must be TRUE

## [1]  TRUE FALSE FALSE FALSE

x | y #OR operator, either the left or the right need to be TRUE

## [1]  TRUE  TRUE  TRUE FALSE

xor(x,y) #Exclusive OR, only one can be TRUE

## [1] FALSE  TRUE  TRUE FALSE

Obviously the last one, xor() is a function and not an operator, but it can still come in handy for our if_else(), case_when() and even just the plain old if() statements.

5.8 The factor() function

In data terms, we’d say that the Gender variable is categorical. Since we have a categorical data, it would make sense that we also want to store it in our tibble as such. Let’s do that!

Underneath, R stores a factor as a number with an associated vector of levels, which is (usually) a character.

dat_clean <- dat %>% 
  mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
  rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
  mutate(Gender = case_when(Gender %in% c("m","M","male") ~ "Male",
                            Gender %in% c("f","F","female") ~ "Female",
                            T ~ Gender), #Tidy up Gender
         Gender = factor(Gender))

dat_clean

## # A tibble: 150 x 5
##       ID DOB        Gender Height Weight
##    <dbl> <chr>      <fct>  <chr>  <chr> 
##  1     1 22/03/1965 Male   5 6    7 1   
##  2     2 17/04/1977 Female 5 9    11 0  
##  3     3 24/07/1966 Female 5 4    9 8   
##  4     4 22/02/1985 Female 5 7    11 1  
##  5     5 30/07/1985 Male   6 0    10 10 
##  6     6 09/07/1987 Male   6 2    11 12 
##  7     7 23/01/1995 Female 5 10   10 2  
##  8     8 31/12/1992 Male   6 3    12 1  
##  9     9 24/01/1952 Male   5 4    7 10  
## 10    10 11/06/1953 Female 5 2    8 6   
## # … with 140 more rows

The data type of Gender has switched to <fct>, which means it’s a factor. If we have a look at it, it’ll tell us the levels of that factor (i.e. what are all the possible values)

head(dat_clean$Gender)

## [1] Male   Female Female Female Male   Male  
## Levels: Female Male Non-Binary

head(as.numeric(dat_clean$Gender)) #This is what R sees

## [1] 2 1 1 1 2 2

levels(dat_clean$Gender) #This is what we see

## [1] "Female"     "Male"       "Non-Binary"

In this version, “Male” has been chosen to be the first level. This means if we run any analysis, “Male” will be used as the reference category. This is because it’s the first element in the vector and so R just uses this by default. A lot of functions that can take a factor as an argument will automatically convert a character into a factor, and again, just use the first element as the reference.

This might not be what we want. We’re going to make the “Female” category into our reference category using the relevel() function:

dat_clean <- dat %>% 
  mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
  rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
  mutate(Gender = case_when(Gender %in% c("m","M","male") ~ "Male",
                            Gender %in% c("f","F","female") ~ "Female",
                            T ~ Gender) %>% #Tidy up Gender
           factor %>%
           relevel("Female"))

levels(dat_clean$Gender)

## [1] "Female"     "Male"       "Non-Binary"

The forcats package in the tidyverse provides a bunch of ways to re-order factors depending on the other data in your dataset. For example, you use fct_infreq() to set the levels to be in order depending on how many of each category there are (i.e. biggest category becomes the reference category), or you can set them all manually using fct_manual().

Above, we used the recode(), if_else() and case_when() method of re-organising the Gender variable, but forcats has it’s own way. We can restructure Gender and turn it into a factor in one gor:

dat_clean <- dat %>% 
  mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
  rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
  mutate(Gender = fct_collapse(Gender, #Tidy up Gender
                               Male = c("m","male","M"),
                               Female = c("f","female","F")))

levels(dat_clean$Gender)

## [1] "Female"     "Male"       "Non-Binary"

The only thing to be wary with this method is that the levels we are collapsing must exist in the factor (e.g. if we had put Male = c("m","male","M","Mal")), R would throw an error because there is no Mal value in the Gender variable.

5.9 The separate() function

The Height & Weight variables are still in an unusual form. From inspection, we can see that they’re written in imperial measures: feet & inches and stone & lbs. For Height, the feet & inches are separated by a space, so we can use this to split the vector into two

dat_clean <- dat %>% 
  mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
  rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
  mutate(Gender = fct_collapse(Gender, #Tidy up Gender
                               Male = c("m","male","M"),
                               Female = c("f","female","F"))) %>%
  separate(Height,c("Feet","Inches"),sep=" ") #Split Height up

dat_clean

## # A tibble: 150 x 6
##       ID DOB        Gender Feet  Inches Weight
##    <dbl> <chr>      <fct>  <chr> <chr>  <chr> 
##  1     1 22/03/1965 Male   5     6      7 1   
##  2     2 17/04/1977 Female 5     9      11 0  
##  3     3 24/07/1966 Female 5     4      9 8   
##  4     4 22/02/1985 Female 5     7      11 1  
##  5     5 30/07/1985 Male   6     0      10 10 
##  6     6 09/07/1987 Male   6     2      11 12 
##  7     7 23/01/1995 Female 5     10     10 2  
##  8     8 31/12/1992 Male   6     3      12 1  
##  9     9 24/01/1952 Male   5     4      7 10  
## 10    10 11/06/1953 Female 5     2      8 6   
## # … with 140 more rows

This removes the Height variable and replaces it with two new ones, the Feet and Inches variables. We can now combine these back together by converting them into metres using the following conversion: 1 foot = 0.3048 m and 1 inch = 0.0254 m.

dat_clean <- dat %>% 
  mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
  rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
  mutate(Gender = fct_collapse(Gender, #Tidy up Gender
                               Male = c("m","male","M"),
                               Female = c("f","female","F"))) %>%
  separate(Height,c("Feet","Inches"),sep=" ") %>% #Split Height up
  mutate(Height = 0.3048*Feet + 0.0254*Inches) #Convert into metres

## Error in 0.3048 * Feet: non-numeric argument to binary operator

We’ve hit our first error. Feet and Inches are still stored as characters. We need to convert them into Numbers before we can multiply and add them. We could do it this way:

dat_clean <- dat %>% 
  mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
  rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
  mutate(Gender = fct_collapse(Gender, #Tidy up Gender
                               Male = c("m","male","M"),
                               Female = c("f","female","F"))) %>%
  separate(Height,c("Feet","Inches"),sep=" ") %>% #Split Height up
  mutate(Feet = as.numeric(Feet), #Convert to Numbers
         Inches = as.numeric(Inches))

Which is easy enough, but what if we were wanting to do it with a lot of variables at once?

5.10 The mutate_at() function

We can use the mutate_at() function to apply a function to a bunch of variables all at once. This is pretty advanced tidyverse stuff!

dat_clean <- dat %>% 
  mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
  rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
  mutate(Gender = fct_collapse(Gender, #Tidy up Gender
                               Male = c("m","male","M"),
                               Female = c("f","female","F"))) %>%
  separate(Height,c("Feet","Inches"),sep=" ") %>% #Split Height up
  mutate_at(c("Feet","Inches"),as.numeric) #Convert to Numbers

dat_clean

This is a scoped version of the mutate() function, meaning that it mutates the tibble on a specific set of variables. We told the mutate_at() function that we want to apply the as.numeric() function to Feet and Inches. We passed the names of the variables as strings to mutate_at() and then passed the function without the brackets at the end.

5.11 The select() function

We can now go back to the formula for converting feet/inches to metres and re-run it with Feet and Inches as numbers

dat_clean <- dat %>% 
  mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
  rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
  mutate(Gender = fct_collapse(Gender, #Tidy up Gender
                               Male = c("m","male","M"),
                               Female = c("f","female","F"))) %>%
  separate(Height,c("Feet","Inches"),sep=" ") %>% #Split Height up
  mutate_at(c("Feet","Inches"),as.numeric) %>% #Convert to Numbers
  mutate(Height = 0.3048*Feet + 0.0254*Inches) #Convert into metres

dat_clean

## # A tibble: 150 x 7
##       ID DOB        Gender  Feet Inches Weight Height
##    <dbl> <chr>      <fct>  <dbl>  <dbl> <chr>   <dbl>
##  1     1 22/03/1965 Male       5      6 7 1      1.68
##  2     2 17/04/1977 Female     5      9 11 0     1.75
##  3     3 24/07/1966 Female     5      4 9 8      1.63
##  4     4 22/02/1985 Female     5      7 11 1     1.70
##  5     5 30/07/1985 Male       6      0 10 10    1.83
##  6     6 09/07/1987 Male       6      2 11 12    1.88
##  7     7 23/01/1995 Female     5     10 10 2     1.78
##  8     8 31/12/1992 Male       6      3 12 1     1.91
##  9     9 24/01/1952 Male       5      4 7 10     1.63
## 10    10 11/06/1953 Female     5      2 8 6      1.57
## # … with 140 more rows

Almost done. We’ve now got Height into metres, but do we really need the Feet and Inches variables anymore? We can pick which variables we want to keep in our dataset using the select() function

dat_clean <- dat %>% 
  mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
  rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
  mutate(Gender = fct_collapse(Gender, #Tidy up Gender
                               Male = c("m","male","M"),
                               Female = c("f","female","F"))) %>%
  separate(Height,c("Feet","Inches"),sep=" ") %>% #Split Height up
  mutate_at(c("Feet","Inches"),as.numeric) %>% #Convert to Numbers
  mutate(Height = 0.3048*Feet + 0.0254*Inches) %>% #Convert into metres
  select(ID,DOB,Gender,Height,Weight) #Pick just what we want to keep

dat_clean

## # A tibble: 150 x 5
##       ID DOB        Gender Height Weight
##    <dbl> <chr>      <fct>   <dbl> <chr> 
##  1     1 22/03/1965 Male     1.68 7 1   
##  2     2 17/04/1977 Female   1.75 11 0  
##  3     3 24/07/1966 Female   1.63 9 8   
##  4     4 22/02/1985 Female   1.70 11 1  
##  5     5 30/07/1985 Male     1.83 10 10 
##  6     6 09/07/1987 Male     1.88 11 12 
##  7     7 23/01/1995 Female   1.78 10 2  
##  8     8 31/12/1992 Male     1.91 12 1  
##  9     9 24/01/1952 Male     1.63 7 10  
## 10    10 11/06/1953 Female   1.57 8 6   
## # … with 140 more rows

Similar to working with vectors, we can also tell R which variables we don’t want. This might actually be easier here.

dat_clean <- dat %>% 
  mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
  rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
  mutate(Gender = fct_collapse(Gender, #Tidy up Gender
                               Male = c("m","male","M"),
                               Female = c("f","female","F"))) %>%
  separate(Height,c("Feet","Inches"),sep=" ") %>% #Split Height up
  mutate_at(c("Feet","Inches"),as.numeric) %>% #Convert to Numbers
  mutate(Height = 0.3048*Feet + 0.0254*Inches) %>% #Convert into metres
  select(-Feet,-Inches) #Drop what we don't want

dat_clean

## # A tibble: 150 x 5
##       ID DOB        Gender Weight Height
##    <dbl> <chr>      <fct>  <chr>   <dbl>
##  1     1 22/03/1965 Male   7 1      1.68
##  2     2 17/04/1977 Female 11 0     1.75
##  3     3 24/07/1966 Female 9 8      1.63
##  4     4 22/02/1985 Female 11 1     1.70
##  5     5 30/07/1985 Male   10 10    1.83
##  6     6 09/07/1987 Male   11 12    1.88
##  7     7 23/01/1995 Female 10 2     1.78
##  8     8 31/12/1992 Male   12 1     1.91
##  9     9 24/01/1952 Male   7 10     1.63
## 10    10 11/06/1953 Female 8 6      1.57
## # … with 140 more rows

We’ve converted Height from feet and inches into metres. We still need to do the same for Weight into kg. The conversion between kg and st/lbs is 1 stone = 6.35029 kg and 1 lb = 0.453592 kg. Can you replicate what I did here for the Weight variable? Take Five minutes and have a think. Some functions that you run for this can be combined with the previous functions that we ran for Height. We also want to create a new variable, BMI, which is calculated as BMI = Weight/Height^2, let’s add this too!

dat_clean <- dat %>% 
  mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
  rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
  mutate(Gender = fct_collapse(Gender, #Tidy up Gender
                               Male = c("m","male","M"),
                               Female = c("f","female","F"))) %>%
  separate(Height,c("Feet","Inches"),sep=" ") %>% #Split Height up
  separate(Weight,c("Stone","Pounds"),sep=" ") %>% #Split Weight up
  mutate_at(c("Feet","Inches","Stone","Pounds"),as.numeric) %>% #Convert
  mutate(Height = 0.3048*Feet  + 0.0254*Inches,
         Weight = 6.3502*Stone + 0.4535*Pounds,
         BMI = Weight/Height^2) %>% #Convert into metres
  select(-Feet,-Inches,-Stone,-Pounds) #Drop what we don't want

dat_clean

## # A tibble: 150 x 6
##       ID DOB        Gender Height Weight   BMI
##    <dbl> <chr>      <fct>   <dbl>  <dbl> <dbl>
##  1     1 22/03/1965 Male     1.68   44.9  16.0
##  2     2 17/04/1977 Female   1.75   69.9  22.7
##  3     3 24/07/1966 Female   1.63   60.8  23.0
##  4     4 22/02/1985 Female   1.70   70.3  24.3
##  5     5 30/07/1985 Male     1.83   68.0  20.3
##  6     6 09/07/1987 Male     1.88   75.3  21.3
##  7     7 23/01/1995 Female   1.78   64.4  20.4
##  8     8 31/12/1992 Male     1.91   76.7  21.1
##  9     9 24/01/1952 Male     1.63   49.0  18.5
## 10    10 11/06/1953 Female   1.57   53.5  21.6
## # … with 140 more rows

5.12 Dates

Dates are hard! This is just a fact. Leap Days, Time Zones, Daylight’s Savings Time, Leap Seconds. All make for difficulty in dealing with Dates. We’re going to need a specialised package for dates: lubridate. This is a package designed with the tidyverse in mind, but is not installed when we downloaded the tidyverse package at the start, so in the console, we’re going to install the lubridate package: install.packages("lubridate").

As is convention (and keeping our code organised), we’re going to load the package into R using the library() function at the start of our code (rather than in the middle).

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

This little info box just tells us that there is a function called date() in the base package and in the lubridate package. Since we’re loading the lubridate package, this essentially overwrites the date() function from the base package. Which is fine, since lubridate has a better version. We can still use the original date() function by writing base::date() (but we don’t need it).

dat_clean <- dat %>% 
  mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
  rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
  mutate(Gender = fct_collapse(Gender, #Tidy up Gender
                               Male = c("m","male","M"),
                               Female = c("f","female","F"))) %>%
  separate(Height,c("Feet","Inches"),sep=" ") %>% #Split Height up
  separate(Weight,c("Stone","Pounds"),sep=" ") %>% #Split Weight up
  mutate_at(c("Feet","Inches","Stone","Pounds"),as.numeric) %>% #Convert
  mutate(Height = 0.3048*Feet  + 0.0254*Inches,
         Weight = 6.3502*Stone + 0.4535*Pounds,
         BMI = Weight/Height^2) %>% #Convert into metres
  select(-Feet,-Inches,-Stone,-Pounds) %>% #Drop what we don't want
  mutate(DOB = dmy(DOB)) # Convert Date

dat_clean

## # A tibble: 150 x 6
##       ID DOB        Gender Height Weight   BMI
##    <dbl> <date>     <fct>   <dbl>  <dbl> <dbl>
##  1     1 1965-03-22 Male     1.68   44.9  16.0
##  2     2 1977-04-17 Female   1.75   69.9  22.7
##  3     3 1966-07-24 Female   1.63   60.8  23.0
##  4     4 1985-02-22 Female   1.70   70.3  24.3
##  5     5 1985-07-30 Male     1.83   68.0  20.3
##  6     6 1987-07-09 Male     1.88   75.3  21.3
##  7     7 1995-01-23 Female   1.78   64.4  20.4
##  8     8 1992-12-31 Male     1.91   76.7  21.1
##  9     9 1952-01-24 Male     1.63   49.0  18.5
## 10    10 1953-06-11 Female   1.57   53.5  21.6
## # … with 140 more rows

The dmy() function takes in dates of the form “Day-Month-Year” and converts them into a type date. Pretty intuitive. There are similar functions such as mdy(), ymd() for different formats. These are all found in the same ?dmy Help file. The Date type functions similar to a number since it is considered a continuous variable, rather than a character, which makes it much easier to work with. For example, we can now sort them in order and find the median DOB, the dplyr package provides us with the median() function, which is an improvement on the quantile() function:

median(dat_clean$DOB)

## [1] NA

Oh, it’s returned NA, which means that there are NA values in our data. NA is usually used for Missing values. Let’s see how many we have. The is.na() function returns TRUE if the element in a vector is NA and FALSE if it isn’t (i.e. it contains an acutal number)

dat_clean$DOB %>%
  is.na %>%
  sum

## [1] 5

A lot of functions also contain the option to ignore NA values, so let’s get the real median()

median(dat_clean$DOB, na.rm=T) #na.rm=T tells median to remove the NAs

## [1] "1978-03-11"

5.13 The filter() function

In our DOB variable, we have some NAs and they can really mess up our work. Let’s get rid of those entries using the filter() function. Remember that the ! (not) operator turns TRUE into FALSE and vice versa:

dat_clean <- dat %>% 
  mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
  rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
  mutate(Gender = fct_collapse(Gender, #Tidy up Gender
                               Male = c("m","male","M"),
                               Female = c("f","female","F"))) %>%
  separate(Height,c("Feet","Inches"),sep=" ") %>% #Split Height up
  separate(Weight,c("Stone","Pounds"),sep=" ") %>% #Split Weight up
  mutate_at(c("Feet","Inches","Stone","Pounds"),as.numeric) %>% #Convert
  mutate(Height = 0.3048*Feet  + 0.0254*Inches,
         Weight = 6.3502*Stone + 0.4535*Pounds,
         BMI = Weight/Height^2) %>% #Convert into metres
  select(-Feet,-Inches,-Stone,-Pounds) %>% #Drop what we don't want
  mutate(DOB = dmy(DOB)) %>% # Convert Date
  filter(!is.na(DOB)) #Get rid of the NAs

dat_clean

## # A tibble: 145 x 6
##       ID DOB        Gender Height Weight   BMI
##    <dbl> <date>     <fct>   <dbl>  <dbl> <dbl>
##  1     1 1965-03-22 Male     1.68   44.9  16.0
##  2     2 1977-04-17 Female   1.75   69.9  22.7
##  3     3 1966-07-24 Female   1.63   60.8  23.0
##  4     4 1985-02-22 Female   1.70   70.3  24.3
##  5     5 1985-07-30 Male     1.83   68.0  20.3
##  6     6 1987-07-09 Male     1.88   75.3  21.3
##  7     7 1995-01-23 Female   1.78   64.4  20.4
##  8     8 1992-12-31 Male     1.91   76.7  21.1
##  9     9 1952-01-24 Male     1.63   49.0  18.5
## 10    10 1953-06-11 Female   1.57   53.5  21.6
## # … with 135 more rows

Now we’ve completely got rid of the NAs in our dataset and should have values for every cell.

5.14 The summarise() function

Now we have a bunch of tidy data, what can we do with it? Well firstly, we can grab out some important descriptive statistics. Previously, we have used things like the mean() and sd() functions on vectors and we can do that here within our tibble. What happens if we do it in a mutate() call?

dat_clean <- dat %>% 
  mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
  rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
  mutate(Gender = fct_collapse(Gender, #Tidy up Gender
                               Male = c("m","male","M"),
                               Female = c("f","female","F"))) %>%
  separate(Height,c("Feet","Inches"),sep=" ") %>% #Split Height up
  separate(Weight,c("Stone","Pounds"),sep=" ") %>% #Split Weight up
  mutate_at(c("Feet","Inches","Stone","Pounds"),as.numeric) %>% #Convert
  mutate(Height = 0.3048*Feet  + 0.0254*Inches,
         Weight = 6.3502*Stone + 0.4535*Pounds,
         BMI = Weight/Height^2) %>% #Convert into metres
  select(-Feet,-Inches,-Stone,-Pounds) %>% #Drop what we don't want
  mutate(DOB = dmy(DOB)) %>% # Convert Date
  filter(!is.na(DOB)) %>% #Get rid of the NAs
  mutate(mean.BMI = mean(BMI))

dat_clean

## # A tibble: 145 x 7
##       ID DOB        Gender Height Weight   BMI mean.BMI
##    <dbl> <date>     <fct>   <dbl>  <dbl> <dbl>    <dbl>
##  1     1 1965-03-22 Male     1.68   44.9  16.0     20.5
##  2     2 1977-04-17 Female   1.75   69.9  22.7     20.5
##  3     3 1966-07-24 Female   1.63   60.8  23.0     20.5
##  4     4 1985-02-22 Female   1.70   70.3  24.3     20.5
##  5     5 1985-07-30 Male     1.83   68.0  20.3     20.5
##  6     6 1987-07-09 Male     1.88   75.3  21.3     20.5
##  7     7 1995-01-23 Female   1.78   64.4  20.4     20.5
##  8     8 1992-12-31 Male     1.91   76.7  21.1     20.5
##  9     9 1952-01-24 Male     1.63   49.0  18.5     20.5
## 10    10 1953-06-11 Female   1.57   53.5  21.6     20.5
## # … with 135 more rows

We now have the average BMI in every row. Maybe that’s what we wanted. But it’s not what I wanted. Let’s put it in a summarise() call instead. Since we’ve neatened up our dat_clean dataset as much as possible, we’re going to save it and edit from there

dat_clean <- dat %>% 
  mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
  rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
  mutate(Gender = fct_collapse(Gender, #Tidy up Gender
                               Male = c("m","male","M"),
                               Female = c("f","female","F"))) %>%
  separate(Height,c("Feet","Inches"),sep=" ") %>% #Split Height up
  separate(Weight,c("Stone","Pounds"),sep=" ") %>% #Split Weight up
  mutate_at(c("Feet","Inches","Stone","Pounds"),as.numeric) %>% #Convert
  mutate(Height = 0.3048*Feet  + 0.0254*Inches,
         Weight = 6.3502*Stone + 0.4535*Pounds,
         BMI = Weight/Height^2) %>% #Convert into metres
  select(-Feet,-Inches,-Stone,-Pounds) %>% #Drop what we don't want
  mutate(DOB = dmy(DOB)) %>% # Convert Date
  filter(!is.na(DOB)) #Get rid of the NAs

dat_clean %>%
    summarise(mean.BMI = mean(BMI))

## # A tibble: 1 x 1
##   mean.BMI
##      <dbl>
## 1     20.5

Well, that’s a bit better, but what if we want more than just the mean of the BMI? Let’s over-complicated this summarise() function

dat_clean %>%
  summarise_at(c("Height","Weight","BMI"),
               list(mean=mean,sd=sd,min=min,max=max))

## # A tibble: 1 x 12
##   Height_mean Weight_mean BMI_mean Height_sd Weight_sd BMI_sd Height_min
##         <dbl>       <dbl>    <dbl>     <dbl>     <dbl>  <dbl>      <dbl>
## 1        1.70        59.5     20.5    0.0943      14.6   4.24       1.47
## # … with 5 more variables: Weight_min <dbl>, BMI_min <dbl>, Height_max <dbl>,
## #   Weight_max <dbl>, BMI_max <dbl>

Within the summarise_at() function, we’ve created a list() of functions and told R to perform that function on all of the variables we gave it. Very similar to the way we used mutate_at() earlier.

5.15 The group_by() function

It’s all very well having a summary of the entire dataset, but what if I want to stratify. I want the same statistics, but based on a the value in Gender.

Easy. Peasy.

dat_clean %>% 
  group_by(Gender) #Group the data

## # A tibble: 145 x 6
## # Groups:   Gender [3]
##       ID DOB        Gender Height Weight   BMI
##    <dbl> <date>     <fct>   <dbl>  <dbl> <dbl>
##  1     1 1965-03-22 Male     1.68   44.9  16.0
##  2     2 1977-04-17 Female   1.75   69.9  22.7
##  3     3 1966-07-24 Female   1.63   60.8  23.0
##  4     4 1985-02-22 Female   1.70   70.3  24.3
##  5     5 1985-07-30 Male     1.83   68.0  20.3
##  6     6 1987-07-09 Male     1.88   75.3  21.3
##  7     7 1995-01-23 Female   1.78   64.4  20.4
##  8     8 1992-12-31 Male     1.91   76.7  21.1
##  9     9 1952-01-24 Male     1.63   49.0  18.5
## 10    10 1953-06-11 Female   1.57   53.5  21.6
## # … with 135 more rows

We can tell R that we want the rows in our tibble to be grouped together based on the value in the Gender variable. Now, the summarise() function will respect this grouping.

dat_clean %>% 
  group_by(Gender) %>% #Group the data
  summarise_at(c("Height","Weight","BMI"),
               list(mean=mean,sd=sd,min=min,max=max))

## # A tibble: 3 x 13
##   Gender Height_mean Weight_mean BMI_mean Height_sd Weight_sd BMI_sd Height_min
##   <fct>        <dbl>       <dbl>    <dbl>     <dbl>     <dbl>  <dbl>      <dbl>
## 1 Female        1.65        55.2     20.3    0.0764     12.8    4.16       1.47
## 2 Male          1.76        64.5     20.8    0.0785     15.2    4.40       1.55
## 3 Non-B…        1.62        54.1     20.6    0.0381      9.40   3.14       1.57
## # … with 5 more variables: Weight_min <dbl>, BMI_min <dbl>, Height_max <dbl>,
## #   Weight_max <dbl>, BMI_max <dbl>

6 Combining data

Quite often in our data analysis, we will need to combine data from multiple sources into a single dataset. For example, we might have taken scoring measurements separately and want to combine them back in with our original anthropometric data above.

6.1 The *_join() functions

We can do that with the family of *_join() functions. These functions take two tibbles (or data.frames), x and y and joins them together based on an identifier for each row. However, sometimes we may have an id that is in one dataset, but not in the other. The *_join() family allows us to decide what entries to keep.

inner_join() will keep only the IDs that appear in both x and y.
full_join() will keep all IDs from both and if there are any missing, it will replace them with an NA.
left_join() and right_join() will keep all of the IDs in the left (x) or the right (x) regardless of whether they’re in the other or not.

Let’s bring in a new dataset and try it out!

dat_score <- read_csv("https://raw.githubusercontent.com/MyKo101/RJunk/master/data_score.csv")

## Parsed with column specification:
## cols(
##   ID = col_double(),
##   Score_1 = col_double(),
##   Score_2 = col_double(),
##   Score_3 = col_double()
## )

dat_score

## # A tibble: 150 x 4
##       ID Score_1 Score_2 Score_3
##    <dbl>   <dbl>   <dbl>   <dbl>
##  1     1    68.2    81.0    88.8
##  2     2    42.8    56.2    72.1
##  3     3    45.1    56.6    63.7
##  4     4    59.3    64.9    63.5
##  5     5    52.3    49.8    54.4
##  6     6    63.5    68.2    79.9
##  7     7    32.5    40.3    52.3
##  8     8    32.2    42.2    56.5
##  9     9    59.2    62.6    75.7
## 10    10    38.9    41.5    50.8
## # … with 140 more rows

This dataset has the same ID as our previous one, but don’t forget we previously deleted some rows from our dataset (based on whether DOB was missing or not). So we don’t have the exact same ones. This means we’re going to want to use inner_join()? We need to pass the datasets as arguments along with a vector telling R which variables are our identifiers:

dat_joined <- inner_join(dat_clean,dat_score,by="ID")
dat_joined

## # A tibble: 145 x 9
##       ID DOB        Gender Height Weight   BMI Score_1 Score_2 Score_3
##    <dbl> <date>     <fct>   <dbl>  <dbl> <dbl>   <dbl>   <dbl>   <dbl>
##  1     1 1965-03-22 Male     1.68   44.9  16.0    68.2    81.0    88.8
##  2     2 1977-04-17 Female   1.75   69.9  22.7    42.8    56.2    72.1
##  3     3 1966-07-24 Female   1.63   60.8  23.0    45.1    56.6    63.7
##  4     4 1985-02-22 Female   1.70   70.3  24.3    59.3    64.9    63.5
##  5     5 1985-07-30 Male     1.83   68.0  20.3    52.3    49.8    54.4
##  6     6 1987-07-09 Male     1.88   75.3  21.3    63.5    68.2    79.9
##  7     7 1995-01-23 Female   1.78   64.4  20.4    32.5    40.3    52.3
##  8     8 1992-12-31 Male     1.91   76.7  21.1    32.2    42.2    56.5
##  9     9 1952-01-24 Male     1.63   49.0  18.5    59.2    62.6    75.7
## 10    10 1953-06-11 Female   1.57   53.5  21.6    38.9    41.5    50.8
## # … with 135 more rows

6.2 The gather() & spread() functions

So we have the scores in what is known as a wide format, the table has a lot of columns and each score (1-3) has it’s own column. We can change this into a long format, where each ID x Score combination has it’s own row:

dat_joined %>%
  gather(Score_1,Score_2,Score_3,key="Score_Num",value="Score")

## # A tibble: 435 x 8
##       ID DOB        Gender Height Weight   BMI Score_Num Score
##    <dbl> <date>     <fct>   <dbl>  <dbl> <dbl> <chr>     <dbl>
##  1     1 1965-03-22 Male     1.68   44.9  16.0 Score_1    68.2
##  2     2 1977-04-17 Female   1.75   69.9  22.7 Score_1    42.8
##  3     3 1966-07-24 Female   1.63   60.8  23.0 Score_1    45.1
##  4     4 1985-02-22 Female   1.70   70.3  24.3 Score_1    59.3
##  5     5 1985-07-30 Male     1.83   68.0  20.3 Score_1    52.3
##  6     6 1987-07-09 Male     1.88   75.3  21.3 Score_1    63.5
##  7     7 1995-01-23 Female   1.78   64.4  20.4 Score_1    32.5
##  8     8 1992-12-31 Male     1.91   76.7  21.1 Score_1    32.2
##  9     9 1952-01-24 Male     1.63   49.0  18.5 Score_1    59.2
## 10    10 1953-06-11 Female   1.57   53.5  21.6 Score_1    38.9
## # … with 425 more rows

The effects of gather() can be undone with the spread() function

dat_joined %>%
  gather(Score_1,Score_2,Score_3,key="Score_Num",value="Score") %>%
  spread(key="Score_Num",value="Score")

## # A tibble: 145 x 9
##       ID DOB        Gender Height Weight   BMI Score_1 Score_2 Score_3
##    <dbl> <date>     <fct>   <dbl>  <dbl> <dbl>   <dbl>   <dbl>   <dbl>
##  1     1 1965-03-22 Male     1.68   44.9  16.0    68.2    81.0    88.8
##  2     2 1977-04-17 Female   1.75   69.9  22.7    42.8    56.2    72.1
##  3     3 1966-07-24 Female   1.63   60.8  23.0    45.1    56.6    63.7
##  4     4 1985-02-22 Female   1.70   70.3  24.3    59.3    64.9    63.5
##  5     5 1985-07-30 Male     1.83   68.0  20.3    52.3    49.8    54.4
##  6     6 1987-07-09 Male     1.88   75.3  21.3    63.5    68.2    79.9
##  7     7 1995-01-23 Female   1.78   64.4  20.4    32.5    40.3    52.3
##  8     8 1992-12-31 Male     1.91   76.7  21.1    32.2    42.2    56.5
##  9     9 1952-01-24 Male     1.63   49.0  18.5    59.2    62.6    75.7
## 10    10 1953-06-11 Female   1.57   53.5  21.6    38.9    41.5    50.8
## # … with 135 more rows

These two functions work really well together depending on how we want our data to look. How we want our data to look will usually depend on what analysis we are doing.

7 The Cheat Sheets

We’ve introduced a bunch of new functions throughout this lesson and mentioned quite a few different packages. The Reference manuals and vignettes from a package can be incredibly useful for a thorough description of how the package works. But, a more intuitive resource is the RStudio Cheatsheets. These are much more visual and are really useful as a lookup if you can’t quite remember what function you need to use (rather than trawling through the Reference manual). Not all Cheatsheets are on the RStudio page, but remember, Google is your friend and searching specifically for Cheatsheets is easy! For example: “R forcats Cheatsheet”

There is also a shortcut to some of these Cheatsheets within RStudio in the menubar (at the top of the window) Help > Cheatsheets

Let’s take a look at the dplyr Cheatsheet! We can find this in the Help shortcut in R Studio.

Tidyverse for Tidy Data

Statistics in R - Lesson Two

Michael A Barrowman