1 Introduction

When working with real-world data, information is rarely stored in a single table. You might have students in one dataset, their grades in another, and their school information in a third. Merging (also called joining) lets us combine these into one unified dataset for analysis.

Key concept: Datasets are merged using a key column — a shared variable that exists in both tables (e.g., student_id, country_code).

2 Types of Joins — Visual Overview

Types of Database Joins

Join Type	R Function (`dplyr`)	Base R	Keeps rows from …
Inner join	`inner_join()`	`merge(all = FALSE)`	Both tables (matches only)
Left join	`left_join()`	`merge(all.x = TRUE)`	All of left + matches
Right join	`right_join()`	`merge(all.y = TRUE)`	All of right + matches
Full join	`full_join()`	`merge(all = TRUE)`	All rows from both

3 Load Required Packages

# Install if you don't have them yet:
# install.packages(c("dplyr", "ggplot2", "knitr", "kableExtra"))

library(dplyr)        # For join functions
library(ggplot2)      # For visualisations
library(knitr)        # For nice tables
library(kableExtra)   # For styled tables

4 Create Three Example Datasets

We will simulate a small school database with students, grades, and schools.

# ── Dataset 1: Students ────────────────────────────────────────────────────────
students <- data.frame(
  student_id = c(1, 2, 3, 4, 5),
  name       = c("Alice", "Bob", "Carol", "David", "Eva"),
  age        = c(20, 22, 21, 23, 20),
  school_id  = c(101, 102, 101, 103, 102),
  stringsAsFactors = FALSE
)

# ── Dataset 2: Grades ──────────────────────────────────────────────────────────
grades <- data.frame(
  student_id = c(1, 2, 3, 6),          # NOTE: student 6 does NOT exist in students!
  subject    = c("Math", "English", "Science", "History"),
  score      = c(88, 74, 92, 65),
  stringsAsFactors = FALSE
)

# ── Dataset 3: Schools ─────────────────────────────────────────────────────────
schools <- data.frame(
  school_id   = c(101, 102, 103, 104),  # NOTE: school 104 has no students!
  school_name = c("Green Valley HS", "Blue Ridge HS", "Sunset Academy", "Eastwood Prep"),
  city        = c("Kigali", "Nairobi", "Lagos", "Accra"),
  stringsAsFactors = FALSE
)

Let’s preview each dataset:

kable(students, caption = "Dataset 1 — Students") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Dataset 1 — Students
student_id	name	age	school_id
1	Alice	20	101
2	Bob	22	102
3	Carol	21	101
4	David	23	103
5	Eva	20	102

kable(grades, caption = "Dataset 2 — Grades") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Dataset 2 — Grades
student_id	subject	score
1	Math	88
2	English	74
3	Science	92
6	History	65

kable(schools, caption = "Dataset 3 — Schools") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Dataset 3 — Schools
school_id	school_name	city
101	Green Valley HS	Kigali
102	Blue Ridge HS	Nairobi
103	Sunset Academy	Lagos
104	Eastwood Prep	Accra

Notice the intentional quirks in the data:

grades contains student_id = 6 which does not exist in students

schools contains school_id = 104 which has no matching students

students 4 & 5 have no grades recorded

These will produce different results depending on which join type you use.

5 Merging Two Datasets

5.1 Inner Join — Only Matching Rows

An inner join keeps only rows that exist in BOTH datasets.

inner_result <- inner_join(students, grades, by = "student_id")

kable(inner_result, caption = "Inner Join: students + grades") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Inner Join: students + grades
student_id	name	age	school_id	subject	score
1	Alice	20	101	Math	88
2	Bob	22	102	English	74
3	Carol	21	101	Science	92

What happened?

Alice (1), Bob (2), Carol (3) appear — they are in both tables.
David (4) and Eva (5) are dropped — they have no grades.
Student 6 in grades is dropped — they have no student record.

5.2 Left Join — Keep All of the Left Table

A left join keeps all rows from the left (first) dataset, filling with NA where there is no match in the right dataset.

left_result <- left_join(students, grades, by = "student_id")

kable(left_result, caption = "Left Join: students + grades") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(which(is.na(left_result$score)), background = "#FFF3CD")  # highlight NAs

Left Join: students + grades
student_id	name	age	school_id	subject	score
1	Alice	20	101	Math	88
2	Bob	22	102	English	74
3	Carol	21	101	Science	92
4	David	23	103	NA	NA
5	Eva	20	102	NA	NA

What happened?

All 5 students are present.
David (4) and Eva (5) have NA for subject and score — they have no grade records.
Student 6 from grades is still dropped (not in the left table).

5.3 Right Join — Keep All of the Right Table

A right join keeps all rows from the right (second) dataset.

right_result <- right_join(students, grades, by = "student_id")

kable(right_result, caption = "Right Join: students + grades") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(which(is.na(right_result$name)), background = "#F8D7DA")

Right Join: students + grades
student_id	name	age	school_id	subject	score
1	Alice	20	101	Math	88
2	Bob	22	102	English	74
3	Carol	21	101	Science	92
6	NA	NA	NA	History	65

What happened?

All 4 rows in grades are present.
Student 6 appears even though they have no student record — their name, age, and school_id are NA.
David (4) and Eva (5) are now dropped (they had no grade rows).

5.4 Full (Outer) Join — Keep Everything

A full join keeps all rows from both datasets, filling with NA where there is no match.

full_result <- full_join(students, grades, by = "student_id")

kable(full_result, caption = "Full Join: students + grades") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(which(is.na(full_result$name) | is.na(full_result$score)),
           background = "#D1ECF1")

Full Join: students + grades
student_id	name	age	school_id	subject	score
1	Alice	20	101	Math	88
2	Bob	22	102	English	74
3	Carol	21	101	Science	92
4	David	23	103	NA	NA
5	Eva	20	102	NA	NA
6	NA	NA	NA	History	65

What happened?

All students AND all grades appear — no row is lost.
NA fills in wherever there is no corresponding match.

6 Merging Three Datasets

To combine all three datasets, you chain two joins together.

# Step 1: Merge students with grades
step1 <- left_join(students, grades, by = "student_id")

# Step 2: Merge that result with schools
final_data <- left_join(step1, schools, by = "school_id")

kable(final_data, caption = "Final Merged Dataset (students + grades + schools)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, background = "#2C3E50", color = "white")

Final Merged Dataset (students + grades + schools)
student_id	name	age	school_id	subject	score	school_name	city
1	Alice	20	101	Math	88	Green Valley HS	Kigali
2	Bob	22	102	English	74	Blue Ridge HS	Nairobi
3	Carol	21	101	Science	92	Green Valley HS	Kigali
4	David	23	103	NA	NA	Sunset Academy	Lagos
5	Eva	20	102	NA	NA	Blue Ridge HS	Nairobi

Reading the result:

Every student now has their grade AND their school information attached.
David and Eva have NA for grade columns because they had no grade records.
All students belong to a known school, so no NA in school columns.

6.1 Pipeline Style (Recommended)

You can write the same three-way merge more elegantly using the |> pipe:

final_pipeline <- students |>
  left_join(grades,   by = "student_id") |>
  left_join(schools,  by = "school_id")

identical(final_data, final_pipeline)  # TRUE — same result

## [1] TRUE

7 Visualising the Merged Data

plot_data <- final_data %>%
  filter(!is.na(score)) %>%
  group_by(school_name) %>%
  summarise(avg_score = mean(score), n = n(), .groups = "drop")

ggplot(plot_data, aes(x = reorder(school_name, avg_score), y = avg_score, fill = school_name)) +
  geom_col(width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = paste0(round(avg_score, 1), "\n(n=", n, ")")),
            hjust = -0.1, size = 3.8) +
  coord_flip(ylim = c(0, 110)) +
  scale_fill_manual(values = c("#3498DB", "#E74C3C", "#2ECC71")) +
  labs(title    = "Average Exam Score by School",
       subtitle = "Based on merged students + grades + schools dataset",
       x = NULL, y = "Average Score") +
  theme_minimal(base_size = 13) +
  theme(plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(color = "grey40"))

Average Score by School (merged data)

ggplot(final_data %>% filter(!is.na(score)),
       aes(x = age, y = score, colour = school_name, label = name)) +
  geom_point(size = 4, alpha = 0.85) +
  geom_text(vjust = -0.8, size = 3.5) +
  scale_color_brewer(palette = "Set1", name = "School") +
  labs(title    = "Student Age vs Exam Score",
       subtitle = "Each point is one student (from merged dataset)",
       x = "Age", y = "Score") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"),
        legend.position = "bottom")

Student Age vs Score

8 Common Errors and How to Fix Them

8.1 ERROR 1 — Wrong Key Column Name

# ✗ WRONG: The key column is called "student_id", not "id"
bad_join <- inner_join(students, grades, by = "id")

## Error in `inner_join()`:
## ! Join columns in `x` must be present in the data.
## ✖ Problem with `id`.

What went wrong?

The column "id" does not exist in either dataset. The correct column name is "student_id".

Fix:

# ✓ CORRECT: Use the actual column name
good_join <- inner_join(students, grades, by = "student_id")
nrow(good_join)  # 3 rows as expected

## [1] 3

8.2 ERROR 2 — Key Columns Have Different Names in Each Table

Sometimes the same variable has a different name in each table.

# Simulate a grades table where the key is named "stu_id" instead of "student_id"
grades_alt <- grades %>% rename(stu_id = student_id)

# ✗ WRONG: Trying to join without telling R about the name difference
bad_join2 <- inner_join(students, grades_alt, by = "student_id")

## Error in `inner_join()`:
## ! Join columns in `y` must be present in the data.
## ✖ Problem with `student_id`.

What went wrong?

grades_alt has no column called "student_id" — it was renamed to "stu_id".

Fix — use the c("left_col" = "right_col") syntax:

# ✓ CORRECT: Map the two different column names
good_join2 <- inner_join(students, grades_alt, by = c("student_id" = "stu_id"))
nrow(good_join2)  # 3 rows — correct

## [1] 3

8.3 ERROR 3 — Duplicate Rows After Merge (Cartesian Explosion)

If the key column has duplicates in both tables, you get a row for every combination — the classic many-to-many explosion.

# Simulate students table with a DUPLICATE student_id
students_dup <- data.frame(
  student_id = c(1, 1, 2),         # student 1 appears TWICE
  name       = c("Alice", "Alice_Copy", "Bob"),
  stringsAsFactors = FALSE
)

grades_dup <- data.frame(
  student_id = c(1, 1, 2),         # student 1 also appears TWICE
  score      = c(88, 91, 74),
  stringsAsFactors = FALSE
)

explosion <- inner_join(students_dup, grades_dup, by = "student_id",
                        relationship = "many-to-many")

cat("Rows in students_dup:", nrow(students_dup), "\n")

## Rows in students_dup: 3

cat("Rows in grades_dup:  ", nrow(grades_dup), "\n")

## Rows in grades_dup:   3

cat("Rows after join:     ", nrow(explosion), "\n")  # 2×2 + 1 = 5 rows!

## Rows after join:      5

kable(explosion, caption = "Cartesian Explosion — unexpected row multiplication") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Cartesian Explosion — unexpected row multiplication
student_id	name	score
1	Alice	88
1	Alice	91
1	Alice_Copy	88
1	Alice_Copy	91
2	Bob	74

What went wrong?

Student 1 appeared twice in both tables, so R created 2 × 2 = 4 rows for student 1. This is usually unintended.

Fix — deduplicate before joining:

# ✓ CORRECT: Remove duplicates from the key column first
students_clean <- students_dup %>% distinct(student_id, .keep_all = TRUE)
grades_clean   <- grades_dup   %>% distinct(student_id, .keep_all = TRUE)

clean_join <- inner_join(students_clean, grades_clean, by = "student_id")
cat("Rows after deduplication and join:", nrow(clean_join), "\n")  # 2 rows

## Rows after deduplication and join: 2

8.4 ERROR 4 — Merging by a Column That Contains NA

# Simulate a student with a MISSING school_id
students_na <- students
students_na$school_id[2] <- NA   # Bob has no school_id

result_na <- left_join(students_na, schools, by = "school_id")

kable(result_na, caption = "Join with NA in key — NA never matches") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(which(is.na(result_na$school_name)), background = "#F8D7DA")

Join with NA in key — NA never matches
student_id	name	age	school_id	school_name	city
1	Alice	20	101	Green Valley HS	Kigali
2	Bob	22	NA	NA	NA
3	Carol	21	101	Green Valley HS	Kigali
4	David	23	103	Sunset Academy	Lagos
5	Eva	20	102	Blue Ridge HS	Nairobi

What happened?

NA never matches anything — Bob’s row gets NA for all school columns. This is usually correct behaviour, but be aware of it.

Fix — fill or drop NAs in the key before joining:

# Option A: Drop rows with NA key
students_dropped <- students_na %>% filter(!is.na(school_id))

# Option B: Fill NA with a placeholder value
students_filled <- students_na %>%
  mutate(school_id = ifelse(is.na(school_id), 999, school_id))

cat("Rows after dropping NA keys:", nrow(students_dropped), "\n")

## Rows after dropping NA keys: 4

9 Summary Comparison

Summary of Join Types in R
Join Type	Rows Kept From	NAs Introduced	Use When
inner_join	Both tables (matches only)	No	You only care about matched records
left_join	All of left table	Right-side columns when no match	Left table is the base, right is optional extra info
right_join	All of right table	Left-side columns when no match	Right table is the base, left is optional extra info
full_join	All rows from both tables	Both sides where no match	You want to keep all data, even unmatched

10 Quick Reference Cheat Sheet

11 Session Info

sessionInfo()

## R version 4.6.0 (2026-04-24 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_Rwanda.utf8  LC_CTYPE=English_Rwanda.utf8   
## [3] LC_MONETARY=English_Rwanda.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=English_Rwanda.utf8    
## 
## time zone: Africa/Kigali
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] kableExtra_1.4.0 knitr_1.51       dplyr_1.2.1      gridExtra_2.3   
## [5] ggplot2_4.0.3   
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.6.0     tidyselect_1.2.1  
##  [5] xml2_1.5.2         stringr_1.6.0      jquerylib_0.1.4    textshaping_1.0.5 
##  [9] systemfonts_1.3.2  scales_1.4.0       yaml_2.3.12        fastmap_1.2.0     
## [13] R6_2.6.1           labeling_0.4.3     generics_0.1.4     tibble_3.3.1      
## [17] svglite_2.2.2      bslib_0.11.0       pillar_1.11.1      RColorBrewer_1.1-3
## [21] rlang_1.2.0        cachem_1.1.0       stringi_1.8.7      xfun_0.57         
## [25] sass_0.4.10        S7_0.2.2           viridisLite_0.4.3  cli_3.6.6         
## [29] withr_3.0.2        magrittr_2.0.5     digest_0.6.39      rstudioapi_0.18.0 
## [33] lifecycle_1.0.5    vctrs_0.7.3        evaluate_1.0.5     glue_1.8.1        
## [37] farver_2.1.2       rmarkdown_2.31     tools_4.6.0        pkgconfig_2.0.3   
## [41] htmltools_0.5.9

Merging Datasets in R: A Complete Guide

Hirwa Fabrice

2026-05-24

1 Introduction

2 Types of Joins — Visual Overview

3 Load Required Packages

4 Create Three Example Datasets

5 Merging Two Datasets

5.1 Inner Join — Only Matching Rows

5.2 Left Join — Keep All of the Left Table

5.3 Right Join — Keep All of the Right Table

5.4 Full (Outer) Join — Keep Everything

6 Merging Three Datasets

6.1 Pipeline Style (Recommended)

7 Visualising the Merged Data

8 Common Errors and How to Fix Them

8.1 ERROR 1 — Wrong Key Column Name

8.2 ERROR 2 — Key Columns Have Different Names in Each Table

8.3 ERROR 3 — Duplicate Rows After Merge (Cartesian Explosion)

8.4 ERROR 4 — Merging by a Column That Contains NA

9 Summary Comparison

10 Quick Reference Cheat Sheet

11 Session Info