R Joins Cheatsheet by Derek Rodgers (rodgersd)

Introduction

This cheatsheet demonstrates various types of joins using the dplyr package in R. We will use two example dataframes containing data about species in the Procyonidae family, which is most famous for containing Raccoons (Procyon lotor; my favourite animal!).

Setup

Load the tidyverse package

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Create the first dataframe

df1 <- data.frame(
                    species = c("Procyon lotor", "Bassariscus astutus", "Nasua nasua"),
                    fur_color = c("Gray", "Brown", "Red"), social = c(TRUE, FALSE, TRUE),
                    avg_weight_kg = c(7.5, 1.5, 4.0)
                  )

Create the second dataframe

df2 <- data.frame(
                    species = c("Procyon lotor", "Bassariscus astutus", "Potos flavus"),
                    habitat = c("Forests", "Deserts", "Rainforests"),
                    diet = c("Omnivore", "Carnivore", "Frugivore")
                  )

Display the dataframes

df1:

df1
##               species fur_color social avg_weight_kg
## 1       Procyon lotor      Gray   TRUE           7.5
## 2 Bassariscus astutus     Brown  FALSE           1.5
## 3         Nasua nasua       Red   TRUE           4.0

df2:

df2
##               species     habitat      diet
## 1       Procyon lotor     Forests  Omnivore
## 2 Bassariscus astutus     Deserts Carnivore
## 3        Potos flavus Rainforests Frugivore

Inner Join

An inner join returns only the rows that have matching values in both dataframes.

inner_join_result <- inner_join(df1, df2, by = "species")
inner_join_result
##               species fur_color social avg_weight_kg habitat      diet
## 1       Procyon lotor      Gray   TRUE           7.5 Forests  Omnivore
## 2 Bassariscus astutus     Brown  FALSE           1.5 Deserts Carnivore

Explanation: The inner join combines rows from both dataframes where the species names match. This reveals the common species between the two dataframes along with their combined attributes.

Semi Join

A semi join returns all rows from the left dataframe where there are matching values in the right dataframe.

semi_join_result <- semi_join(df1, df2, by = "species")
semi_join_result
##               species fur_color social avg_weight_kg
## 1       Procyon lotor      Gray   TRUE           7.5
## 2 Bassariscus astutus     Brown  FALSE           1.5

Explanation: The semi join returns only the rows from df1 that have a matching species in df2. This shows which species in df1 are also present in df2. This can be thought of as applying a kind of filter.

Left Join

A left join returns all rows from the left dataframe, along with matching rows from the right dataframe and columns from that dataframe.

left_join_result <- left_join(df1, df2, by = "species")
left_join_result
##               species fur_color social avg_weight_kg habitat      diet
## 1       Procyon lotor      Gray   TRUE           7.5 Forests  Omnivore
## 2 Bassariscus astutus     Brown  FALSE           1.5 Deserts Carnivore
## 3         Nasua nasua       Red   TRUE           4.0    <NA>      <NA>

Explanation: The left join includes all species from the first dataframe and adds the corresponding attributes from the second dataframe where there is a match. This shows all species in df1, with additional information from df2 where available.

Note that Nasua nasua has NA for the columns habitat and diet, since these came from df2 and this species does not exist in df2.

Anti Join

An anti join returns all rows from the left dataframe where there are no matching values in the right dataframe.

anti_join_result <- anti_join(df1, df2, by = "species")
anti_join_result
##       species fur_color social avg_weight_kg
## 1 Nasua nasua       Red   TRUE             4

Explanation: The anti join returns only the rows from df1 that do not have a matching species in df2. Like the semi join, it can be thought of as a kind of filtering operation.

Full Join

A full join returns all rows when there is a match in either left or right dataframe.

full_join_result <- full_join(df1, df2, by = "species")
full_join_result
##               species fur_color social avg_weight_kg     habitat      diet
## 1       Procyon lotor      Gray   TRUE           7.5     Forests  Omnivore
## 2 Bassariscus astutus     Brown  FALSE           1.5     Deserts Carnivore
## 3         Nasua nasua       Red   TRUE           4.0        <NA>      <NA>
## 4        Potos flavus      <NA>     NA            NA Rainforests Frugivore

Explanation: The full join combines all rows from both dataframes, matching species where possible. This reveals all species from both dataframes, with NA values where there is no match.