Joining datasets is a crucial skill when working with health-related
data as it allows you to combine information from multiple sources,
leading to more comprehensive and insightful analyses. In this lesson,
you’ll learn how to use different joining techniques using R’s
dplyr package. Let’s get started!
You understand how each of the different dplyr joins
work
You’re able to choose the appropriate join for your data
You can join simple datasets together using functions from
dplyr
Please load the packages needed for this lesson with the code below:
To illustrate the utility of joins, let’s start with a toy example.
Consider the following two datasets. The first,
demographic, contains names and ages of three patients:
The second, test_info, contains tuberculosis test dates
and results for those patients:
test_info <-
tribble(~name, ~test_date, ~result,
"Alice", "2023-06-05", "Negative",
"Bob", "2023-08-10", "Positive",
"Charlie", "2023-07-15", "Negative")
test_infoWe’d like to analyze these data together, and so we need a way to combine them.
One option we might consider is the cbind() function
from base R (cbind is short for column bind):
| name | age | name | test_date | result |
|---|---|---|---|---|
| Alice | 25 | Alice | 2023-06-05 | Negative |
| Bob | 32 | Bob | 2023-08-10 | Positive |
| Charlie | 45 | Charlie | 2023-07-15 | Negative |
This successfully merges the datasets, but it doesn’t do so very intelligently. The function essentially “pastes” or “staples” the two tables together. So, as you can notice, the “name” column appears twice. This is not ideal and will be problematic for analysis.
Another problem occurs if the rows in the two datasets are not
already aligned. In this case, the data will be combined incorrectly
with cbind(). Consider the
test_info_disordered dataset below, which now has Bob in
the first row:
test_info_disordered <-
tribble(~name, ~test_date, ~result,
"Bob", "2023-08-10", "Positive", # Bobin first row
"Alice", "2023-06-05", "Negative",
"Charlie", "2023-07-15", "Negative")What happens if we cbind() this with the original
demographic dataset, where Bob was in the second
row?
| name | age | name | test_date | result |
|---|---|---|---|---|
| Alice | 25 | Bob | 2023-08-10 | Positive |
| Bob | 32 | Alice | 2023-06-05 | Negative |
| Charlie | 45 | Charlie | 2023-07-15 | Negative |
Alice’s demographic details are now mistakenly aligned with Bob’s test info!
A third issue arises when an entity appears more than once in one dataset. Perhaps Alice did multiple TB tests:
test_info_multiple <-
tribble(~name, ~test_date, ~result,
"Alice", "2023-06-05", "Negative",
"Alice", "2023-06-06", "Negative",
"Bob", "2023-08-10", "Positive",
"Charlie", "2023-07-15", "Negative")If we try to cbind() this with the
demographic dataset, we’ll get an error, due to a mismatch
in row counts:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 3, 4
What we have here is called a one-to-many
relationship—one Alice in the demographic data, but multiple Alice rows
in the test data. Joining in such cases will be covered in detail in the
second joining lesson.
Clearly, we need a smarter way to combine datasets than
cbind(); we’ll need to venture into the world of
joining.
Let’s start with the most common join, the left_join(),
which solves the problems we previously encountered.
It works for the simple case, and it does not duplicate the name column:
## Joining with `by = join_by(name)`
It works where the datasets are not ordered identically:
## Joining with `by = join_by(name)`
And it works when there are multiple test rows per patient:
## Joining with `by = join_by(name)`
Simple yet beautiful!
Now that we understand why we need joins, let’s look at their basic syntax.
Joins take two dataframes as the first two arguments: x
(the left dataframe) and y (the right
dataframe). As with other R functions, you can provide these as named or
unnamed arguments:
# both the same:
left_join(x = demographic, y = test_info) # named
left_join(demographic, test_info) # unnamedAnother critical argument is by, which indicates the
column or key used to connect the tables. We don’t
always need to supply this argument; it can be inferred from
the datasets. For example, in our original examples, “name” is the only
column common to demographic and test_info. So
the join function assumes by = "name":
# these are equivalent
left_join(x = demographic, y = test_info)
left_join(x = demographic, y = test_info, by = "name")The column used to connect rows across the tables is known as a
“key”. In the dplyr join functions, the key is specified in the
by argument, as seen in
left_join(x = demographic, y = test_info, by = "name")
What happens if the keys are named differently in the two datasets?
Consider the test_info_different_name dataset below, where
the “name” column has been changed to “test_recipient”:
test_info_different_name <-
tribble(~test_recipient, ~test_date, ~result,
"Alice", "2023-06-05", "Negative",
"Bob", "2023-08-10", "Positive",
"Charlie", "2023-07-15", "Negative")
test_info_different_nameIf we try to join test_info_different_name with our
original demographic dataset, we will encounter an
error:
Error in `left_join()`:
! `by` must be supplied when `x` and `y` have no common
variables.
ℹ Use `cross_join()` to perform a cross-join.
The error indicates that there are no common variables, so the join is not possible.
In situations like this, you have two choices: you can rename the
column in the second dataframe to match the first, or more simply,
specify which columns to join on using by = c().
Here’s how to do this:
The syntax c("name" = "test_recipient") is a bit
unusual. It essentially says, “Connect name from data frame
x with test_recipient from data frame y because they
represent the same data.”
Consider the two datasets below, one with patient details and the other with medical check-up dates for these patients.
patients <- tribble(
~patient_id, ~name, ~age,
1, "John", 32,
2, "Joy", 28,
3, "Khan", 40
)
checkups <- tribble(
~patient_id, ~checkup_date,
1, "2023-01-20",
2, "2023-02-20",
3, "2023-05-15"
)Join the patients dataset with the checkups
dataset using left_join()
Two datasets are defined below, one with patient details and the other with vaccination records for those patients.
# Patient Details
patient_details <- tribble(
~id_number, ~full_name, ~address,
"A001", "Alice", "123 Elm St",
"B002", "Bob", "456 Maple Dr",
"C003", "Charlie", "789 Oak Blvd"
)
# Vaccination Records
vaccination_records <- tribble(
~patient_code, ~vaccine_type, ~vaccination_date,
"A001", "COVID-19", "2022-05-10",
"B002", "Flu", "2023-09-01",
"C003", "Hepatitis B", "2021-12-15"
)Join the patient_details and
vaccination_records datasets. You will need to use the
by argument because the patient identifier columns have
different names.
The toy examples so far have involved datasets that could be matched perfectly - every row in one dataset had a corresponding row in the other dataset.
Real-world data is usually messier. Often, there will be entries in the first table that do not have corresponding entries in the second table, and vice versa.
To handle these cases of imperfect matching, there are different join
types with specific behaviors: left_join(),
right_join(), inner_join(), and
full_join(). In the upcoming sections, we’ll look at
examples of how each join type operates on datasets with imperfect
matches.
left_join()Let’s start with left_join(), which you’ve already been
introduced to. To see how it handles unmatched rows, we will try to join
our original demographic dataset with a modified version of
the test_info dataset.
As a reminder, here is the demographic dataset, with
Alice, Bob and Charlie:
For test information, we’ll remove Charlie and we’ll add
a new patient, Xavier, and his test data:
test_info_xavier <- tribble(
~name, ~test_date, ~result,
"Alice", "2023-06-05", "Negative",
"Bob", "2023-08-10", "Positive",
"Xavier", "2023-05-02", "Negative")
test_info_xavierIf we perform a left_join() using
demographic as the left dataset
(x = demographic) and test_info_xavier as the
right dataset (y = test_info_xavier), what should we
expect? Recall that Charlie is only present in the left dataset, and
Xavier is only present in the right. Well, here’s what happens:
As you can see, with the LEFT join, all records from the
LEFT dataframe (demographic) are retained. So,
even though Charlie doesn’t have a match in the
test_info_xavier dataset, he’s still included in the
output. (But of course, since his test information is not available in
test_info_xavier those values were left as
NA.)
Xavier, on the other hand, who was only present in the right dataset, gets dropped.
The graphic below shows how this join worked:
In a join function like left_join(x, y), the dataset
provided to the x argument can be termed the “left”
dataset, while the dataset assigned to the y argument can
be called the “right” dataset.
Now what if we flip the datasets? Let’s see the outcome when
test_info_xavier is the left dataset and
demographic is the right one:
Once again, the left_join() retains all rows from the
left dataset (now test_info_xavier). This means
Xavier’s data is included this time. Charlie, on the other hand, is
excluded.
The graphic below illustrates how this works:
Primary Dataset: In the context of joins, the primary dataset refers to the main or prioritized dataset in an operation. In a left join, the left dataset is considered the primary dataset because all of its rows are retained in the output, regardless of whether they have a matching row in the other dataset.
Try out the following. Below are two datasets - one with disease
diagnoses (disease_dx) and another with patient
demographics (patient_demographics).
disease_dx <- tribble(
~patient_id, ~disease, ~date_of_diagnosis,
1, "Influenza", "2023-01-15",
3, "COVID-19", "2023-03-05",
8, "Influenza", "2023-02-20",
)
patient_demographics <- tribble(
~patient_id, ~name, ~age, ~gender,
1, "Fred", 28, "Female",
2, "Genevieve", 45, "Female",
3, "Henry", 32, "Male",
5, "Irene", 55, "Female",
8, "Jules", 40, "Male"
)Use left_join() to merge these datasets, keeping only
patients for whom we have demographic information. Think carefully about
which dataset to put on the left.
Let’s try another example, this time with a more realistic set of data.
First, we have data on the TB incidence rate per 100,000 people for 47 African countries, from the [WHO]((https://www.who.int/data/gho/data/indicators/indicator-details/GHO/incidence-of-tuberculosis-(per-100-000-population-per-year):
## Rows: 47 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country, conf_int_95
## dbl (1): cases
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We want to analyze how TB incidence in African countries varies with government health expenditure per capita. For this, we have data on health expenditure per capita in USD, also from the WHO:
## Rows: 185 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): country
## dbl (1): expend_usd
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Which dataset should we use as the left dataframe for the join?
Since our goal is to analyze African countries, we should use
tb_2019_africa as the left dataframe. This will ensure we
keep all the African countries in the final joined dataset.
Let’s join them:
tb_health_exp_joined <-
tb_2019_africa %>%
left_join(health_exp_2019, by = "country")
tb_health_exp_joinedNow in the joined dataset, we have just the 47 rows for African countries, which is exactly what we wanted!
All rows from the left dataframe tb_2019_africa were
kept, while non-African countries from health_exp_2019 were
discarded.
We can check if any rows in tb_2019_africa did not have
a match in health_exp_2019 by filtering for NA
values:
This shows that 3 countries - Mauritius, South Sudan, and Comoros -
did not have expenditure data in health_exp_2019. But
because they were present in tb_2019_africa, and that was
the left dataframe, they were still included in the joined data.
To be sure, we can quickly confirm that those countries are absent from the expenditure dataset with a filter statement:
Indeed, these countries aren’t present in
health_exp_2019.
Copy the code below to define two datasets.
The first, tb_cases_children contains the number of TB
cases in under 15s in 2012, by country:
tb_cases_children <- tidyr::who %>%
filter(year == 2012) %>%
transmute(country, tb_cases_smear_0_14 = new_sp_m014 + new_sp_f014)
tb_cases_childrenAnd country_continents, from the {countrycode} package,
lists all countries and their corresponding region and continent:
country_continents <-
countrycode::codelist %>%
select(country.name.en, continent, region)
country_continentsYour goal is to add the continent and region data to the TB cases dataset.
Which dataset should be the left dataframe, x? And which
should be the right, y? Once you’ve decided, join the
datasets appropriately using left_join().
right_join()A right_join() can be thought of as a mirror image of a
left_join(). The mechanics are the same, but now all rows
from the RIGHT dataset are retained, while only those rows from
the left dataset that find a match in the right are kept.
Let’s look at an example to understand this. We’ll use our original
demographic and modified test_info_xavier
datasets:
Now let’s try right_join(), with
demographic as the right dataframe:
## Joining with `by = join_by(name)`
Hopefully you’re getting the hang of this, and could predict that
output! Since demographic was the right dataframe,
and we are using right-join, all the rows from
demographic are kept—Alice, Bob and Charlie. But only
matching records in the left data frame
test_info_xavier!
The graphic below illustrates this process:
An important point—the same final dataframe can be created with
either left_join() or right_join(), it just
depends on what order you provide the data frames to these
functions:
# here, RIGHT_join prioritizes the RIGHT df, demographic
right_join(x = test_info_xavier, y = demographic)## Joining with `by = join_by(name)`
# here, LEFT_join prioritizes the LEFT df, again demographic
left_join(x = demographic, y = test_info_xavier)## Joining with `by = join_by(name)`
The one difference you might notice between left and right-join is that the final column orders are different. But columns can easily be rearranged, so worrying about column order is not really worth your time.
As we previously mentioned, data scientists typically favor
left_join() over right_join(). It makes more
sense to specify your primary dataset first, in the left position.
Opting for a left_join() is a common best practice due to
its clearer logic, making it less error-prone.
Great, now we understand how left_join() and
right_join() work, let’s move on to
inner_join() and full_join()!