To answer data analysis questions there is often a need to retrieve data from multiple sources. In these cases datasets need to be combined. Base R provides this functionality through the merge function but with the release of the dplyr package combining datasets has been made easier and faster.
The dplyr package provides improvements to the merge function through:
This vignette will introduce you to the mutating join functions within dplyr and show you how to apply them.
To understand the mutating join functions within dplyr we’ll use the band members datasets contained within the dplyr package. These datasets are documented in ?band_members. The dataset contains three datasets:
head(band_members)
## # A tibble: 3 x 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
head(band_instruments)
## # A tibble: 3 x 2
## name plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
head(band_instruments2)
## # A tibble: 3 x 2
## artist plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
There are four types of mutating joins:
The syntax for each mutating join utilising dplyr is:
inner_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
left_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
right_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
full_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
The output of these functions is the same type as that of argument x.
An inner join combines two datasets and returns observations with matching join values. It is important to note observations from either dataset that do not match an observation from the other dataset are dropped from the combined dataset.
The example below joins the band_members dataset to the band_instruments dataset and returns all matching values. Note that the observation with a name equal to “Mick” doesn’t appear in the combined dataset even though it exists in the band_members dataset; the observation with the name equal to “Keith” in the band_instruments dataset is also dropped as it has no corresponding observation in the band_members dataset.
inner_join(band_members, band_instruments, by = "name")
## # A tibble: 2 x 3
## name band plays
## <chr> <chr> <chr>
## 1 John Beatles guitar
## 2 Paul Beatles bass
The above example could also be written inner_join(band_members, band_instruments) as the key name variable is common to both datasets and the default value for by is NULL meaning that R will join on all variables with matching names.
Where an inner join keeps observations that appear in both datasets an outer join keeps observations that appear in at least one of the datasets. Which observations are kept and which are excluded is controlled by the choice of outer join. NA is returned when a matching observation doesn’t appear in the returned dataset.
A left join keeps all observations in the x, or left, dataset and drops non-matching observations in the y, or right, dataset.
The below example joins the same two datasets as the inner join example but with the left_join verb. The observation in the band_members dataset with a name of “Mick” appears in the combined dataset; the observation in the band_instruments dataset with the name “Keith” doesn’t.
left_join(band_members, band_instruments, by = "name")
## # A tibble: 3 x 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
A right join works in a similar way to the left join; the difference is it keeps all observations in the y, or right, dataset and drops non-matching observations in the x dataset.
The example below continues with the same two datasets as the inner join example but with the right_join verb. The observation in the band_members dataset with a name of “Mick” doesn’t appear in the combined dataset; the observation in the band_instruments dataset with the name “Keith” does.
right_join(band_members, band_instruments, by = "name")
## # A tibble: 3 x 3
## name band plays
## <chr> <chr> <chr>
## 1 John Beatles guitar
## 2 Paul Beatles bass
## 3 Keith <NA> guitar
A full join keeps all obeservations in x and y regardless of whether a corresponding match in the other dataset is found.
In the example below the record with “Mick” in the band_members dataset and the record with “Keith” from the band_instruments dataset are both returned.
full_join(band_members, band_instruments, by = "name")
## # A tibble: 4 x 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
## 4 Keith <NA> guitar
When the variables in each dataset in the join have different names the command is written:
full_join(band_members, band_instruments2, by = c("name" = "artist"))
## # A tibble: 4 x 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
## 4 Keith <NA> guitar
copy - if the datasets are from different sources and copy = TRUE then y will be copied across to the datasource where x is located.
suffix - if a variable name occurs in both datasets, and is not used as part of the join, a suffix is added to ensure variable names are unique. By default “.x” and “.y” are added to the variable names but other suffixes can be specified.
As dplyr is part of the tidyverse mutating joins can be used with piping.
band_members %>% full_join(band_instruments)
## Joining, by = "name"
## # A tibble: 4 x 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
## 4 Keith <NA> guitar
Wickham, H & Grolemund, G 2017 R for Data Science O’Reilly Media Inc, Sebastopol
Wickham, H, Francois, R, Henry, L & Muller, K Join two tables together, viewed 17 March 2019, https://dplyr.tidyverse.org/reference/join.html