1 Prelude

Joining datasets is a crucial skill when working with health-related data as it allows you to combine information from multiple sources, leading to more comprehensive and insightful analyses. In this lesson, you’ll learn how to use different joining techniques using R’s dplyr package. Let’s get started!

2 Learning Objectives

You understand how each of the different dplyr joins work
You’re able to choose the appropriate join for your data
You can join simple datasets together using functions from dplyr

3 Packages

Please load the packages needed for this lesson with the code below:

# Load packages 
if(!require(pacman)) install.packages("pacman")
pacman::p_load(tidyverse, 
               countrycode)

4 What is a join and why do we need it?

To illustrate the utility of joins, let’s start with a toy example. Consider the following two datasets. The first, demographic, contains names and ages of three patients:

demographic <- 
  tribble(~name,     ~age,
          "Alice",    25,
          "Bob",      32,
          "Charlie",  45)
demographic

The second, test_info, contains tuberculosis test dates and results for those patients:

test_info <- 
  tribble(~name,     ~test_date,    ~result,
          "Alice",   "2023-06-05",  "Negative",
          "Bob",     "2023-08-10",  "Positive",
          "Charlie", "2023-07-15",  "Negative")
test_info

We’d like to analyze these data together, and so we need a way to combine them.

One option we might consider is the cbind() function from base R (cbind is short for column bind):

cbind(demographic, test_info)

name	age	name	test_date	result
Alice	25	Alice	2023-06-05	Negative
Bob	32	Bob	2023-08-10	Positive
Charlie	45	Charlie	2023-07-15	Negative

This successfully merges the datasets, but it doesn’t do so very intelligently. The function essentially “pastes” or “staples” the two tables together. So, as you can notice, the “name” column appears twice. This is not ideal and will be problematic for analysis.

Another problem occurs if the rows in the two datasets are not already aligned. In this case, the data will be combined incorrectly with cbind(). Consider the test_info_disordered dataset below, which now has Bob in the first row:

test_info_disordered <- 
  tribble(~name,     ~test_date,    ~result,
          "Bob",     "2023-08-10",  "Positive", # Bobin first row
          "Alice",   "2023-06-05",  "Negative",
          "Charlie", "2023-07-15",  "Negative")

What happens if we cbind() this with the original demographic dataset, where Bob was in the second row?

cbind(demographic, test_info_disordered)

name	age	name	test_date	result
Alice	25	Bob	2023-08-10	Positive
Bob	32	Alice	2023-06-05	Negative
Charlie	45	Charlie	2023-07-15	Negative

Alice’s demographic details are now mistakenly aligned with Bob’s test info!

A third issue arises when an entity appears more than once in one dataset. Perhaps Alice did multiple TB tests:

test_info_multiple <- 
  tribble(~name,     ~test_date,    ~result,
          "Alice",   "2023-06-05",  "Negative",
          "Alice",   "2023-06-06",  "Negative",
          "Bob",     "2023-08-10",  "Positive",
          "Charlie", "2023-07-15",  "Negative")

If we try to cbind() this with the demographic dataset, we’ll get an error, due to a mismatch in row counts:

cbind(demographic, test_info_multiple)

Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 3, 4

What we have here is called a one-to-many relationship—one Alice in the demographic data, but multiple Alice rows in the test data. Joining in such cases will be covered in detail in the second joining lesson.

Clearly, we need a smarter way to combine datasets than cbind(); we’ll need to venture into the world of joining.

Let’s start with the most common join, the left_join(), which solves the problems we previously encountered.

It works for the simple case, and it does not duplicate the name column:

left_join(demographic, test_info)

## Joining with `by = join_by(name)`

It works where the datasets are not ordered identically:

left_join(demographic, test_info_disordered)

## Joining with `by = join_by(name)`

And it works when there are multiple test rows per patient:

left_join(demographic, test_info_multiple)

## Joining with `by = join_by(name)`

Simple yet beautiful!

We’ll be using the pipe operator as well when joining. Remember that this:

demographic %>% left_join(test_info)

## Joining with `by = join_by(name)`

is equivalent to this:

left_join(demographic, test_info)

## Joining with `by = join_by(name)`

5 Joining syntax

Now that we understand why we need joins, let’s look at their basic syntax.

Joins take two dataframes as the first two arguments: x (the left dataframe) and y (the right dataframe). As with other R functions, you can provide these as named or unnamed arguments:

# both the same:
left_join(x = demographic, y = test_info) # named
left_join(demographic, test_info) # unnamed

Another critical argument is by, which indicates the column or key used to connect the tables. We don’t always need to supply this argument; it can be inferred from the datasets. For example, in our original examples, “name” is the only column common to demographic and test_info. So the join function assumes by = "name":

# these are equivalent
left_join(x = demographic, y = test_info)  
left_join(x = demographic, y = test_info, by = "name")

The column used to connect rows across the tables is known as a “key”. In the dplyr join functions, the key is specified in the by argument, as seen in left_join(x = demographic, y = test_info, by = "name")

What happens if the keys are named differently in the two datasets? Consider the test_info_different_name dataset below, where the “name” column has been changed to “test_recipient”:

test_info_different_name <- 
  tribble(~test_recipient,  ~test_date,    ~result,
          "Alice",          "2023-06-05",  "Negative",
          "Bob",            "2023-08-10",  "Positive",
          "Charlie",        "2023-07-15",  "Negative")
test_info_different_name

If we try to join test_info_different_name with our original demographic dataset, we will encounter an error:

left_join(x = demographic, y = test_info_different_name)

Error in `left_join()`:
! `by` must be supplied when `x` and `y` have no common
  variables.
ℹ Use `cross_join()` to perform a cross-join.

The error indicates that there are no common variables, so the join is not possible.

In situations like this, you have two choices: you can rename the column in the second dataframe to match the first, or more simply, specify which columns to join on using by = c().

Here’s how to do this:

left_join(x = demographic, y = test_info_different_name, 
          by = c("name" = "test_recipient"))

The syntax c("name" = "test_recipient") is a bit unusual. It essentially says, “Connect name from data frame x with test_recipient from data frame y because they represent the same data.”

Consider the two datasets below, one with patient details and the other with medical check-up dates for these patients.

patients <- tribble(
  ~patient_id, ~name,     ~age,
  1,          "John",     32,
  2,          "Joy",    28,
  3,          "Khan",      40
)

checkups <- tribble(
  ~patient_id, ~checkup_date,
  1,          "2023-01-20",
  2,          "2023-02-20",
  3,          "2023-05-15"
)

Join the patients dataset with the checkups dataset using left_join()

Two datasets are defined below, one with patient details and the other with vaccination records for those patients.

# Patient Details
patient_details <- tribble(
  ~id_number,  ~full_name,   ~address,
  "A001",      "Alice",      "123 Elm St",
  "B002",      "Bob",        "456 Maple Dr",
  "C003",      "Charlie",    "789 Oak Blvd"
)

# Vaccination Records
vaccination_records <- tribble(
  ~patient_code, ~vaccine_type,  ~vaccination_date,
  "A001",        "COVID-19",     "2022-05-10",
  "B002",        "Flu",          "2023-09-01",
  "C003",        "Hepatitis B",  "2021-12-15"
)

Join the patient_details and vaccination_records datasets. You will need to use the by argument because the patient identifier columns have different names.

6 Types of joins

The toy examples so far have involved datasets that could be matched perfectly - every row in one dataset had a corresponding row in the other dataset.

Real-world data is usually messier. Often, there will be entries in the first table that do not have corresponding entries in the second table, and vice versa.

To handle these cases of imperfect matching, there are different join types with specific behaviors: left_join(), right_join(), inner_join(), and full_join(). In the upcoming sections, we’ll look at examples of how each join type operates on datasets with imperfect matches.

6.1 `left_join()`

Let’s start with left_join(), which you’ve already been introduced to. To see how it handles unmatched rows, we will try to join our original demographic dataset with a modified version of the test_info dataset.

As a reminder, here is the demographic dataset, with Alice, Bob and Charlie:

demographic

For test information, we’ll remove Charlie and we’ll add a new patient, Xavier, and his test data:

test_info_xavier <- tribble(
  ~name,    ~test_date, ~result,
  "Alice",  "2023-06-05", "Negative",
  "Bob",    "2023-08-10", "Positive",
  "Xavier", "2023-05-02", "Negative")
test_info_xavier

If we perform a left_join() using demographic as the left dataset (x = demographic) and test_info_xavier as the right dataset (y = test_info_xavier), what should we expect? Recall that Charlie is only present in the left dataset, and Xavier is only present in the right. Well, here’s what happens:

left_join(x = demographic, y = test_info_xavier, by = "name")

As you can see, with the LEFT join, all records from the LEFT dataframe (demographic) are retained. So, even though Charlie doesn’t have a match in the test_info_xavier dataset, he’s still included in the output. (But of course, since his test information is not available in test_info_xavier those values were left as NA.)

Xavier, on the other hand, who was only present in the right dataset, gets dropped.

The graphic below shows how this join worked:

In a join function like left_join(x, y), the dataset provided to the x argument can be termed the “left” dataset, while the dataset assigned to the y argument can be called the “right” dataset.

Now what if we flip the datasets? Let’s see the outcome when test_info_xavier is the left dataset and demographic is the right one:

left_join(x = test_info_xavier, y = demographic, by = "name")

Once again, the left_join() retains all rows from the left dataset (now test_info_xavier). This means Xavier’s data is included this time. Charlie, on the other hand, is excluded.

The graphic below illustrates how this works:

Primary Dataset: In the context of joins, the primary dataset refers to the main or prioritized dataset in an operation. In a left join, the left dataset is considered the primary dataset because all of its rows are retained in the output, regardless of whether they have a matching row in the other dataset.

Try out the following. Below are two datasets - one with disease diagnoses (disease_dx) and another with patient demographics (patient_demographics).

disease_dx <- tribble(
  ~patient_id, ~disease,       ~date_of_diagnosis,
  1,            "Influenza",    "2023-01-15",
  3,            "COVID-19",     "2023-03-05",
  8,            "Influenza",    "2023-02-20",
)

patient_demographics <- tribble(
  ~patient_id, ~name,      ~age,  ~gender,
  1,            "Fred",      28,  "Female",
  2,            "Genevieve", 45,  "Female",
  3,            "Henry",     32,  "Male",
  5,            "Irene",     55,  "Female",
  8,            "Jules",     40,  "Male"
)

Use left_join() to merge these datasets, keeping only patients for whom we have demographic information. Think carefully about which dataset to put on the left.

Let’s try another example, this time with a more realistic set of data.

First, we have data on the TB incidence rate per 100,000 people for 47 African countries, from the [WHO]((https://www.who.int/data/gho/data/indicators/indicator-details/GHO/incidence-of-tuberculosis-(per-100-000-population-per-year):

tb_2019_africa <- read_csv(here("data/tb_incidence_2019.csv"))

## Rows: 47 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country, conf_int_95
## dbl (1): cases
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

tb_2019_africa

We want to analyze how TB incidence in African countries varies with government health expenditure per capita. For this, we have data on health expenditure per capita in USD, also from the WHO:

health_exp_2019 <- read_csv(here("data/health_expend_per_cap_2019.csv"))

## Rows: 185 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): country
## dbl (1): expend_usd
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

health_exp_2019

Which dataset should we use as the left dataframe for the join?

Since our goal is to analyze African countries, we should use tb_2019_africa as the left dataframe. This will ensure we keep all the African countries in the final joined dataset.

Let’s join them:

tb_health_exp_joined <- 
  tb_2019_africa %>%
  left_join(health_exp_2019, by = "country")
tb_health_exp_joined

Now in the joined dataset, we have just the 47 rows for African countries, which is exactly what we wanted!

All rows from the left dataframe tb_2019_africa were kept, while non-African countries from health_exp_2019 were discarded.

We can check if any rows in tb_2019_africa did not have a match in health_exp_2019 by filtering for NA values:

tb_health_exp_joined %>% 
  filter(is.na(!expend_usd))

This shows that 3 countries - Mauritius, South Sudan, and Comoros - did not have expenditure data in health_exp_2019. But because they were present in tb_2019_africa, and that was the left dataframe, they were still included in the joined data.

To be sure, we can quickly confirm that those countries are absent from the expenditure dataset with a filter statement:

health_exp_2019 %>% 
  filter(country %in% c("Mauritius", "South Sudan", "Comoros"))

Indeed, these countries aren’t present in health_exp_2019.

Copy the code below to define two datasets.

The first, tb_cases_children contains the number of TB cases in under 15s in 2012, by country:

tb_cases_children <- tidyr::who %>% 
  filter(year == 2012) %>% 
  transmute(country, tb_cases_smear_0_14 = new_sp_m014 + new_sp_f014)

tb_cases_children

And country_continents, from the {countrycode} package, lists all countries and their corresponding region and continent:

country_continents <- 
  countrycode::codelist %>% 
  select(country.name.en, continent, region)

country_continents

Your goal is to add the continent and region data to the TB cases dataset.

Which dataset should be the left dataframe, x? And which should be the right, y? Once you’ve decided, join the datasets appropriately using left_join().

6.2 `right_join()`

A right_join() can be thought of as a mirror image of a left_join(). The mechanics are the same, but now all rows from the RIGHT dataset are retained, while only those rows from the left dataset that find a match in the right are kept.

Let’s look at an example to understand this. We’ll use our original demographic and modified test_info_xavier datasets:

demographic

test_info_xavier

Now let’s try right_join(), with demographic as the right dataframe:

right_join(x = test_info_xavier, y = demographic)

## Joining with `by = join_by(name)`

Hopefully you’re getting the hang of this, and could predict that output! Since demographic was the right dataframe, and we are using right-join, all the rows from demographic are kept—Alice, Bob and Charlie. But only matching records in the left data frame test_info_xavier!

The graphic below illustrates this process:

An important point—the same final dataframe can be created with either left_join() or right_join(), it just depends on what order you provide the data frames to these functions:

# here, RIGHT_join prioritizes the RIGHT df, demographic
right_join(x = test_info_xavier, y = demographic)

## Joining with `by = join_by(name)`

# here, LEFT_join prioritizes the LEFT df, again demographic
left_join(x = demographic, y = test_info_xavier)

## Joining with `by = join_by(name)`

The one difference you might notice between left and right-join is that the final column orders are different. But columns can easily be rearranged, so worrying about column order is not really worth your time.

As we previously mentioned, data scientists typically favor left_join() over right_join(). It makes more sense to specify your primary dataset first, in the left position. Opting for a left_join() is a common best practice due to its clearer logic, making it less error-prone.

Great, now we understand how left_join() and right_join() work, let’s move on to inner_join() and full_join()!

Joining