This cheatsheet demonstrates various types of joins using the
dplyr package in R. We will use two example dataframes
containing data about species in the Procyonidae family, which is most
famous for containing Raccoons (Procyon lotor; my favourite
animal!).
tidyverse
package## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df1 <- data.frame(
species = c("Procyon lotor", "Bassariscus astutus", "Nasua nasua"),
fur_color = c("Gray", "Brown", "Red"), social = c(TRUE, FALSE, TRUE),
avg_weight_kg = c(7.5, 1.5, 4.0)
)df2 <- data.frame(
species = c("Procyon lotor", "Bassariscus astutus", "Potos flavus"),
habitat = c("Forests", "Deserts", "Rainforests"),
diet = c("Omnivore", "Carnivore", "Frugivore")
)df1:
## species fur_color social avg_weight_kg
## 1 Procyon lotor Gray TRUE 7.5
## 2 Bassariscus astutus Brown FALSE 1.5
## 3 Nasua nasua Red TRUE 4.0
df2:
## species habitat diet
## 1 Procyon lotor Forests Omnivore
## 2 Bassariscus astutus Deserts Carnivore
## 3 Potos flavus Rainforests Frugivore
An inner join returns only the rows that have matching values in both dataframes.
## species fur_color social avg_weight_kg habitat diet
## 1 Procyon lotor Gray TRUE 7.5 Forests Omnivore
## 2 Bassariscus astutus Brown FALSE 1.5 Deserts Carnivore
Explanation: The inner join combines rows from both dataframes where the species names match. This reveals the common species between the two dataframes along with their combined attributes.
A semi join returns all rows from the left dataframe where there are matching values in the right dataframe.
## species fur_color social avg_weight_kg
## 1 Procyon lotor Gray TRUE 7.5
## 2 Bassariscus astutus Brown FALSE 1.5
Explanation: The semi join returns only the rows from df1 that have a matching species in df2. This shows which species in df1 are also present in df2. This can be thought of as applying a kind of filter.
A left join returns all rows from the left dataframe, along with matching rows from the right dataframe and columns from that dataframe.
## species fur_color social avg_weight_kg habitat diet
## 1 Procyon lotor Gray TRUE 7.5 Forests Omnivore
## 2 Bassariscus astutus Brown FALSE 1.5 Deserts Carnivore
## 3 Nasua nasua Red TRUE 4.0 <NA> <NA>
Explanation: The left join includes all species from the first dataframe and adds the corresponding attributes from the second dataframe where there is a match. This shows all species in df1, with additional information from df2 where available.
Note that Nasua nasua has NA for the
columns habitat and diet, since these came
from df2 and this species does not exist in df2.
An anti join returns all rows from the left dataframe where there are no matching values in the right dataframe.
## species fur_color social avg_weight_kg
## 1 Nasua nasua Red TRUE 4
Explanation: The anti join returns only the rows from df1 that do not have a matching species in df2. Like the semi join, it can be thought of as a kind of filtering operation.
A full join returns all rows when there is a match in either left or right dataframe.
## species fur_color social avg_weight_kg habitat diet
## 1 Procyon lotor Gray TRUE 7.5 Forests Omnivore
## 2 Bassariscus astutus Brown FALSE 1.5 Deserts Carnivore
## 3 Nasua nasua Red TRUE 4.0 <NA> <NA>
## 4 Potos flavus <NA> NA NA Rainforests Frugivore
Explanation: The full join combines all rows from both dataframes, matching species where possible. This reveals all species from both dataframes, with NA values where there is no match.