This example will demonstrate the main functions of the package dplyr. We will be using (publically available) data of the participants of the Great British Bake Off 2018.

First we need to load dplyr into R

library(dplyr)

1. Create your own dataset from scratch

Here we are creating free variables (vectors in R)

names <- c("Antony", "Briony", "Dan","Imelda", "Jon", "Karen","Kim-Joy","Luke","Manon", "Rahul", "Ruby" ,"Terry")

sex <- c("male", "female","male", "female", "male", "female", "female","male", "female", "male", "female", "male")

hometown <- c("London", "Bristol", "London", "County Tyrone", "Newport","Wakefield", "Leeds", "Sheffield", "London", "Rotherham","London", "West Midlands")

occupation <- c("Banker", "Full-time parent", "Full-time parent", "Countryside recreation officer", "Blood courier", "In-store sampling assistant", "Mental health specialist","Civil servant/house and techno DJ", "Software project manager", "Research scientist", "Project manager", "Retired air steward")

age <- c(30, 33, 36, 33, 47, 60, 27, 30, 26, 30, 30, 56)


And now we’ll create two dataframes using the above variables

gbbo1<-data.frame(names, age, sex, occupation)
gbbo2<-data.frame(names, hometown)

2. Frequently used “dplyr” functions

These functions are some of the most frequently used, but obviously not the only ones. “dplyr” is part of the “tidyverse”, which is huge collection of packages. “dplyr” is used for data manipulation, which can be a whole course in its own right. If you’re interested in this topic and want to learn more, you can have a look at these slides and the accompanying practicals.


Now let’s see functions in action:

2.1. select()


Select the variables names and age


select(gbbo1, names, age, sex)

You can store the selected variables into a new dataframe

gbbo3<-select(gbbo1, names, age, sex)
View(gbbo3) # This allows you to see the dataset in a different tab

2.2. filter()


Now, filter (select) a subsample of gbbo participants younger than 30 and store them into a new dataframe

filter(gbbo3, age<30)
##     names age    sex
## 1 Kim-Joy  27 female
## 2   Manon  26 female
gbbo4<-filter(gbbo1, age<30)

2.3. join()


Before, we created two dataframes: one containing most of the data (gbbo1) and another containing only names and hometown (gbbo2). This is to recreate a typical situation in research, where you find that your dataset does not contain all the variables you’re interested in, but you have another dataset (or find one) that contains additional variables which you can link to your existing dataset by using an ID variable.

left_join(gbbo1, gbbo2, by = "names")
##      names age    sex                        occupation      hometown
## 1   Antony  30   male                            Banker        London
## 2   Briony  33 female                  Full-time parent       Bristol
## 3      Dan  36   male                  Full-time parent        London
## 4   Imelda  33 female    Countryside recreation officer County Tyrone
## 5      Jon  47   male                     Blood courier       Newport
## 6    Karen  60 female       In-store sampling assistant     Wakefield
## 7  Kim-Joy  27 female          Mental health specialist         Leeds
## 8     Luke  30   male Civil servant/house and techno DJ     Sheffield
## 9    Manon  26 female          Software project manager        London
## 10   Rahul  30   male                Research scientist     Rotherham
## 11    Ruby  30 female                   Project manager        London
## 12   Terry  56   male               Retired air steward West Midlands

Note 1: In this example we have been saving all the new datasets that we are creating. This is only done with the purpose of showing you the changes in the data after running the functions. But if you are working with big datasets, this is sometimes not a very good idea, since your R console will be populated with several datasets that are not being used.

gbbo_all<- left_join(gbbo1, gbbo2, by = "names")


Note 2: Whenever you are doing a join operation remember to always be dexterous and deft… and never confuse your right-hand side with your left 1


2.4. rename()


Using rename function to change variable names. Here we are changing the variable “hometown” to “city”

rename(gbbo_all, "city"="hometown")
##      names age    sex                        occupation          city
## 1   Antony  30   male                            Banker        London
## 2   Briony  33 female                  Full-time parent       Bristol
## 3      Dan  36   male                  Full-time parent        London
## 4   Imelda  33 female    Countryside recreation officer County Tyrone
## 5      Jon  47   male                     Blood courier       Newport
## 6    Karen  60 female       In-store sampling assistant     Wakefield
## 7  Kim-Joy  27 female          Mental health specialist         Leeds
## 8     Luke  30   male Civil servant/house and techno DJ     Sheffield
## 9    Manon  26 female          Software project manager        London
## 10   Rahul  30   male                Research scientist     Rotherham
## 11    Ruby  30 female                   Project manager        London
## 12   Terry  56   male               Retired air steward West Midlands

2.5. summarise()


Here we will get some descriptive statistics using the summarise function from dplyr. This function is only for numeric variables.

summarise(gbbo_all, mean(age))
##   mean(age)
## 1      36.5


summarise is very handy since it also allows us to save the summarised variable. You just need to specify a name before the requested statistics

summarise(gbbo_all, mean_age=mean(age))
##   mean_age
## 1     36.5


You can request more than one statistic and store them all

summarise(gbbo_all, mean_age=mean(age),
                    sd(age),
                    median(age))
##   mean_age  sd(age) median(age)
## 1     36.5 11.42963        31.5

2.6. group_by()


This function works by grouping according to a variable

group_by(gbbo_all, sex)
## # A tibble: 12 x 5
## # Groups:   sex [2]
##    names     age sex    occupation                        hometown     
##    <fct>   <dbl> <fct>  <fct>                             <fct>        
##  1 Antony     30 male   Banker                            London       
##  2 Briony     33 female Full-time parent                  Bristol      
##  3 Dan        36 male   Full-time parent                  London       
##  4 Imelda     33 female Countryside recreation officer    County Tyrone
##  5 Jon        47 male   Blood courier                     Newport      
##  6 Karen      60 female In-store sampling assistant       Wakefield    
##  7 Kim-Joy    27 female Mental health specialist          Leeds        
##  8 Luke       30 male   Civil servant/house and techno DJ Sheffield    
##  9 Manon      26 female Software project manager          London       
## 10 Rahul      30 male   Research scientist                Rotherham    
## 11 Ruby       30 female Project manager                   London       
## 12 Terry      56 male   Retired air steward               West Midlands


As you can see, group by does not seem to do anything. This is because it works in combination with other functions, for instance: summarise.
Let’s save the group under a new dataset

bysex<-group_by(gbbo_all, sex) 


And now, let’s use this new dataset to get an indicator of the average age by sex of the participants.

summarise(bysex, age_mean=mean(age))
## # A tibble: 2 x 2
##   sex    age_mean
##   <fct>     <dbl>
## 1 female     34.8
## 2 male       38.2


We have created a new aggregated variable age_mean that takes the mean of the variable age according to sex. ***

3. Piping



When you search for examples using dplyr on the web, you are very likely to encounter this symbol %>% called “pipe”. We are not covering this in full detail, but below is an example of what it is and how to use it.


Pipes are meant to make the coding easy to write and read. It writes the code following a logical set of instructions. This is an example of the last code we used, but now rewritten with pipes


gbbo_all %>%                      # Take the data 
  group_by(sex) %>%               # Now group it by sex
  summarise(age_mean=mean(age))   # This is to create our aggregated variable, all in one go!
## # A tibble: 2 x 2
##   sex    age_mean
##   <fct>     <dbl>
## 1 female     34.8
## 2 male       38.2


If you are going to be using a summary like the above many times during your analysis, you can save it as a new object. It would help you avoid running the code over and over again. To do this, simply give it a name and use the assign arrow “<-”.


summary_gbbo <- gbbo_all %>%
  group_by(sex) %>%   
  summarise(age_mean=mean(age))

summary_gbbo # this is to print the results on the console
## # A tibble: 2 x 2
##   sex    age_mean
##   <fct>     <dbl>
## 1 female     34.8
## 2 male       38.2

4. Plotting


To plot we can use the base package, but much nicer plots can be produced by using the package “ggplot2”. Let’s load it (you may need to install it if not available)


Now do a simple bar chart with the ages of GBBO participants:

ggplot(gbbo_all, aes(age)) + geom_bar()


As you can see the bar chart above isn’t terribly informative, so you can do an alternative plot with summary statistics. You can embed the code with the pipes from before and then run the plot command, as such:

ggplot(gbbo_all %>%                           
  group_by(sex) %>%
  summarise(age_mean=mean(age)), 
  aes(y=age_mean, x=sex)) + 
  geom_bar(stat = "identity") # we need to specify stat="identity" for the value itself to be plotted


Perhaps a much simpler option would have been to save the command with pipes as a new dataframe, as such:

mean_by_sex<-gbbo_all %>%                           
  group_by(sex) %>%
  summarise(age_mean=mean(age))


And then run the plot with the saved summary dataframe:


ggplot(mean_by_sex, aes(y=age_mean, x=sex)) + geom_bar(stat = "identity")



Final comments

There are many online resources to learn R. Here are some you can use:

http://www.cookbook-r.com/
https://www.statmethods.net/
https://www.datacamp.com/
https://r4ds.had.co.nz/
https://rstudio.com/resources/webinars/
Click this if you like R and giraffes… and tea


Remember that you never finish learning R, sometimes R users feel like all we do is get better at googling our questions.



  1. Does this ring a bell? It’s inspired by Dr. Seuss’ “Oh, the places you’ll go!”


 

Patricio Troncoso

patricio.troncoso@manchester.ac.uk