Mutating Joins with Dplyr

Introduction

To answer data analysis questions there is often a need to retrieve data from multiple sources. In these cases datasets need to be combined. Base R provides this functionality through the merge function but with the release of the dplyr package combining datasets has been made easier and faster.

The dplyr package provides improvements to the merge function through:

Being less ambiguous - where merge is a single function, dplyr provides “verbs” for each join type,
Standard terminology - in line with widely used standards,
Faster - dplyr performs better than the merge function.

This vignette will introduce you to the mutating join functions within dplyr and show you how to apply them.

Data: Band Members

To understand the mutating join functions within dplyr we’ll use the band members datasets contained within the dplyr package. These datasets are documented in ?band_members. The dataset contains three datasets:

Band members

head(band_members)

## # A tibble: 3 x 2
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles

Band Instruments

head(band_instruments)

## # A tibble: 3 x 2
##   name  plays 
##   <chr> <chr> 
## 1 John  guitar
## 2 Paul  bass  
## 3 Keith guitar

Band Instruments 2

head(band_instruments2)

## # A tibble: 3 x 2
##   artist plays 
##   <chr>  <chr> 
## 1 John   guitar
## 2 Paul   bass  
## 3 Keith  guitar

Joins Types

There are four types of mutating joins:

Inner Join
Left Outer Join
Right Outer Join
Full Outer Join

The syntax for each mutating join utilising dplyr is:

inner_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)

left_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)

right_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)

full_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)

The output of these functions is the same type as that of argument x.

Inner Joins

An inner join combines two datasets and returns observations with matching join values. It is important to note observations from either dataset that do not match an observation from the other dataset are dropped from the combined dataset.

The example below joins the band_members dataset to the band_instruments dataset and returns all matching values. Note that the observation with a name equal to “Mick” doesn’t appear in the combined dataset even though it exists in the band_members dataset; the observation with the name equal to “Keith” in the band_instruments dataset is also dropped as it has no corresponding observation in the band_members dataset.

inner_join(band_members, band_instruments, by = "name")

## # A tibble: 2 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 John  Beatles guitar
## 2 Paul  Beatles bass

The above example could also be written inner_join(band_members, band_instruments) as the key name variable is common to both datasets and the default value for by is NULL meaning that R will join on all variables with matching names.

Outer Joins

Where an inner join keeps observations that appear in both datasets an outer join keeps observations that appear in at least one of the datasets. Which observations are kept and which are excluded is controlled by the choice of outer join. NA is returned when a matching observation doesn’t appear in the returned dataset.

Left Join

A left join keeps all observations in the x, or left, dataset and drops non-matching observations in the y, or right, dataset.

The below example joins the same two datasets as the inner join example but with the left_join verb. The observation in the band_members dataset with a name of “Mick” appears in the combined dataset; the observation in the band_instruments dataset with the name “Keith” doesn’t.

left_join(band_members, band_instruments, by = "name")

## # A tibble: 3 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass

Right Join

A right join works in a similar way to the left join; the difference is it keeps all observations in the y, or right, dataset and drops non-matching observations in the x dataset.

The example below continues with the same two datasets as the inner join example but with the right_join verb. The observation in the band_members dataset with a name of “Mick” doesn’t appear in the combined dataset; the observation in the band_instruments dataset with the name “Keith” does.

right_join(band_members, band_instruments, by = "name")

## # A tibble: 3 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 John  Beatles guitar
## 2 Paul  Beatles bass  
## 3 Keith <NA>    guitar

Full Join

A full join keeps all obeservations in x and y regardless of whether a corresponding match in the other dataset is found.

In the example below the record with “Mick” in the band_members dataset and the record with “Keith” from the band_instruments dataset are both returned.

full_join(band_members, band_instruments, by = "name")

## # A tibble: 4 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass  
## 4 Keith <NA>    guitar

Additional Information

Join on Variables with Different Names

When the variables in each dataset in the join have different names the command is written:

full_join(band_members, band_instruments2, by = c("name" = "artist"))

## # A tibble: 4 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass  
## 4 Keith <NA>    guitar

Additional Arguments

copy - if the datasets are from different sources and copy = TRUE then y will be copied across to the datasource where x is located.

suffix - if a variable name occurs in both datasets, and is not used as part of the join, a suffix is added to ensure variable names are unique. By default “.x” and “.y” are added to the variable names but other suffixes can be specified.

Piping

As dplyr is part of the tidyverse mutating joins can be used with piping.

band_members %>% full_join(band_instruments)

## Joining, by = "name"

## # A tibble: 4 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass  
## 4 Keith <NA>    guitar

References

Wickham, H & Grolemund, G 2017 R for Data Science O’Reilly Media Inc, Sebastopol

Wickham, H, Francois, R, Henry, L & Muller, K Join two tables together, viewed 17 March 2019, https://dplyr.tidyverse.org/reference/join.html