This example will demonstrate the main functions of the package dplyr. We will be using (publically available) data of the participants of the Great British Bake Off 2018.
First we need to load dplyr into R
library(dplyr)
Here we are creating free variables (vectors in R)
names <- c("Antony", "Briony", "Dan","Imelda", "Jon", "Karen","Kim-Joy","Luke","Manon", "Rahul", "Ruby" ,"Terry")
sex <- c("male", "female","male", "female", "male", "female", "female","male", "female", "male", "female", "male")
hometown <- c("London", "Bristol", "London", "County Tyrone", "Newport","Wakefield", "Leeds", "Sheffield", "London", "Rotherham","London", "West Midlands")
occupation <- c("Banker", "Full-time parent", "Full-time parent", "Countryside recreation officer", "Blood courier", "In-store sampling assistant", "Mental health specialist","Civil servant/house and techno DJ", "Software project manager", "Research scientist", "Project manager", "Retired air steward")
age <- c(30, 33, 36, 33, 47, 60, 27, 30, 26, 30, 30, 56)
And now we’ll create two dataframes using the above variables
gbbo1<-data.frame(names, age, sex, occupation)
gbbo2<-data.frame(names, hometown)
These functions are some of the most frequently used, but obviously not the only ones. “dplyr” is part of the “tidyverse”, which is huge collection of packages. “dplyr” is used for data manipulation, which can be a whole course in its own right. If you’re interested in this topic and want to learn more, you can have a look at these slides and the accompanying practicals.
Now let’s see functions in action:
Select the variables names and age
select(gbbo1, names, age, sex)
You can store the selected variables into a new dataframe
gbbo3<-select(gbbo1, names, age, sex)
View(gbbo3) # This allows you to see the dataset in a different tab
Now, filter (select) a subsample of gbbo participants younger than 30 and store them into a new dataframe
filter(gbbo3, age<30)
## names age sex
## 1 Kim-Joy 27 female
## 2 Manon 26 female
gbbo4<-filter(gbbo1, age<30)
Before, we created two dataframes: one containing most of the data (gbbo1) and another containing only names and hometown (gbbo2). This is to recreate a typical situation in research, where you find that your dataset does not contain all the variables you’re interested in, but you have another dataset (or find one) that contains additional variables which you can link to your existing dataset by using an ID variable.
left_join(gbbo1, gbbo2, by = "names")
## names age sex occupation hometown
## 1 Antony 30 male Banker London
## 2 Briony 33 female Full-time parent Bristol
## 3 Dan 36 male Full-time parent London
## 4 Imelda 33 female Countryside recreation officer County Tyrone
## 5 Jon 47 male Blood courier Newport
## 6 Karen 60 female In-store sampling assistant Wakefield
## 7 Kim-Joy 27 female Mental health specialist Leeds
## 8 Luke 30 male Civil servant/house and techno DJ Sheffield
## 9 Manon 26 female Software project manager London
## 10 Rahul 30 male Research scientist Rotherham
## 11 Ruby 30 female Project manager London
## 12 Terry 56 male Retired air steward West Midlands
Note 1: In this example we have been saving all the new datasets that we are creating. This is only done with the purpose of showing you the changes in the data after running the functions. But if you are working with big datasets, this is sometimes not a very good idea, since your R console will be populated with several datasets that are not being used.
gbbo_all<- left_join(gbbo1, gbbo2, by = "names")
Note 2: Whenever you are doing a join operation remember to always be dexterous and deft… and never confuse your right-hand side with your left 1
Using rename function to change variable names. Here we are changing the variable “hometown” to “city”
rename(gbbo_all, "city"="hometown")
## names age sex occupation city
## 1 Antony 30 male Banker London
## 2 Briony 33 female Full-time parent Bristol
## 3 Dan 36 male Full-time parent London
## 4 Imelda 33 female Countryside recreation officer County Tyrone
## 5 Jon 47 male Blood courier Newport
## 6 Karen 60 female In-store sampling assistant Wakefield
## 7 Kim-Joy 27 female Mental health specialist Leeds
## 8 Luke 30 male Civil servant/house and techno DJ Sheffield
## 9 Manon 26 female Software project manager London
## 10 Rahul 30 male Research scientist Rotherham
## 11 Ruby 30 female Project manager London
## 12 Terry 56 male Retired air steward West Midlands
Here we will get some descriptive statistics using the summarise function from dplyr. This function is only for numeric variables.
summarise(gbbo_all, mean(age))
## mean(age)
## 1 36.5
summarise is very handy since it also allows us to save the summarised variable. You just need to specify a name before the requested statistics
summarise(gbbo_all, mean_age=mean(age))
## mean_age
## 1 36.5
You can request more than one statistic and store them all
summarise(gbbo_all, mean_age=mean(age),
sd(age),
median(age))
## mean_age sd(age) median(age)
## 1 36.5 11.42963 31.5
This function works by grouping according to a variable
group_by(gbbo_all, sex)
## # A tibble: 12 x 5
## # Groups: sex [2]
## names age sex occupation hometown
## <fct> <dbl> <fct> <fct> <fct>
## 1 Antony 30 male Banker London
## 2 Briony 33 female Full-time parent Bristol
## 3 Dan 36 male Full-time parent London
## 4 Imelda 33 female Countryside recreation officer County Tyrone
## 5 Jon 47 male Blood courier Newport
## 6 Karen 60 female In-store sampling assistant Wakefield
## 7 Kim-Joy 27 female Mental health specialist Leeds
## 8 Luke 30 male Civil servant/house and techno DJ Sheffield
## 9 Manon 26 female Software project manager London
## 10 Rahul 30 male Research scientist Rotherham
## 11 Ruby 30 female Project manager London
## 12 Terry 56 male Retired air steward West Midlands
As you can see, group by does not seem to do anything. This is because it works in combination with other functions, for instance: summarise.
Let’s save the group under a new dataset
bysex<-group_by(gbbo_all, sex)
And now, let’s use this new dataset to get an indicator of the average age by sex of the participants.
summarise(bysex, age_mean=mean(age))
## # A tibble: 2 x 2
## sex age_mean
## <fct> <dbl>
## 1 female 34.8
## 2 male 38.2
We have created a new aggregated variable age_mean that takes the mean of the variable age according to sex. ***
When you search for examples using dplyr on the web, you are very likely to encounter this symbol %>% called “pipe”. We are not covering this in full detail, but below is an example of what it is and how to use it.
Pipes are meant to make the coding easy to write and read. It writes the code following a logical set of instructions. This is an example of the last code we used, but now rewritten with pipes
gbbo_all %>% # Take the data
group_by(sex) %>% # Now group it by sex
summarise(age_mean=mean(age)) # This is to create our aggregated variable, all in one go!
## # A tibble: 2 x 2
## sex age_mean
## <fct> <dbl>
## 1 female 34.8
## 2 male 38.2
If you are going to be using a summary like the above many times during your analysis, you can save it as a new object. It would help you avoid running the code over and over again. To do this, simply give it a name and use the assign arrow “<-”.
summary_gbbo <- gbbo_all %>%
group_by(sex) %>%
summarise(age_mean=mean(age))
summary_gbbo # this is to print the results on the console
## # A tibble: 2 x 2
## sex age_mean
## <fct> <dbl>
## 1 female 34.8
## 2 male 38.2
To plot we can use the base package, but much nicer plots can be produced by using the package “ggplot2”. Let’s load it (you may need to install it if not available)
Now do a simple bar chart with the ages of GBBO participants:
ggplot(gbbo_all, aes(age)) + geom_bar()
As you can see the bar chart above isn’t terribly informative, so you can do an alternative plot with summary statistics. You can embed the code with the pipes from before and then run the plot command, as such:
ggplot(gbbo_all %>%
group_by(sex) %>%
summarise(age_mean=mean(age)),
aes(y=age_mean, x=sex)) +
geom_bar(stat = "identity") # we need to specify stat="identity" for the value itself to be plotted
Perhaps a much simpler option would have been to save the command with pipes as a new dataframe, as such:
mean_by_sex<-gbbo_all %>%
group_by(sex) %>%
summarise(age_mean=mean(age))
And then run the plot with the saved summary dataframe:
ggplot(mean_by_sex, aes(y=age_mean, x=sex)) + geom_bar(stat = "identity")
There are many online resources to learn R. Here are some you can use:
http://www.cookbook-r.com/
https://www.statmethods.net/
https://www.datacamp.com/
https://r4ds.had.co.nz/
https://rstudio.com/resources/webinars/
Click this if you like R and giraffes… and tea
Remember that you never finish learning R, sometimes R users feel like all we do is get better at googling our questions.
Does this ring a bell? It’s inspired by Dr. Seuss’ “Oh, the places you’ll go!”↩
patricio.troncoso@manchester.ac.uk