1 Introduction

When working with real-world data, information is rarely stored in a single table. You might have students in one dataset, their grades in another, and their school information in a third. Merging (also called joining) lets us combine these into one unified dataset for analysis.

Key concept: Datasets are merged using a key column — a shared variable that exists in both tables (e.g., student_id, country_code).


2 Types of Joins — Visual Overview

Types of Database Joins

Types of Database Joins

Join Type R Function (dplyr) Base R Keeps rows from …
Inner join inner_join() merge(all = FALSE) Both tables (matches only)
Left join left_join() merge(all.x = TRUE) All of left + matches
Right join right_join() merge(all.y = TRUE) All of right + matches
Full join full_join() merge(all = TRUE) All rows from both

3 Load Required Packages

# Install if you don't have them yet:
# install.packages(c("dplyr", "ggplot2", "knitr", "kableExtra"))

library(dplyr)        # For join functions
library(ggplot2)      # For visualisations
library(knitr)        # For nice tables
library(kableExtra)   # For styled tables

4 Create Three Example Datasets

We will simulate a small school database with students, grades, and schools.

# ── Dataset 1: Students ────────────────────────────────────────────────────────
students <- data.frame(
  student_id = c(1, 2, 3, 4, 5),
  name       = c("Alice", "Bob", "Carol", "David", "Eva"),
  age        = c(20, 22, 21, 23, 20),
  school_id  = c(101, 102, 101, 103, 102),
  stringsAsFactors = FALSE
)

# ── Dataset 2: Grades ──────────────────────────────────────────────────────────
grades <- data.frame(
  student_id = c(1, 2, 3, 6),          # NOTE: student 6 does NOT exist in students!
  subject    = c("Math", "English", "Science", "History"),
  score      = c(88, 74, 92, 65),
  stringsAsFactors = FALSE
)

# ── Dataset 3: Schools ─────────────────────────────────────────────────────────
schools <- data.frame(
  school_id   = c(101, 102, 103, 104),  # NOTE: school 104 has no students!
  school_name = c("Green Valley HS", "Blue Ridge HS", "Sunset Academy", "Eastwood Prep"),
  city        = c("Kigali", "Nairobi", "Lagos", "Accra"),
  stringsAsFactors = FALSE
)

Let’s preview each dataset:

kable(students, caption = "Dataset 1 — Students") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Dataset 1 — Students
student_id name age school_id
1 Alice 20 101
2 Bob 22 102
3 Carol 21 101
4 David 23 103
5 Eva 20 102
kable(grades, caption = "Dataset 2 — Grades") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Dataset 2 — Grades
student_id subject score
1 Math 88
2 English 74
3 Science 92
6 History 65
kable(schools, caption = "Dataset 3 — Schools") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Dataset 3 — Schools
school_id school_name city
101 Green Valley HS Kigali
102 Blue Ridge HS Nairobi
103 Sunset Academy Lagos
104 Eastwood Prep Accra

Notice the intentional quirks in the data:

  • grades contains student_id = 6 which does not exist in students
  • schools contains school_id = 104 which has no matching students
  • students 4 & 5 have no grades recorded

These will produce different results depending on which join type you use.


5 Merging Two Datasets

5.1 Inner Join — Only Matching Rows

An inner join keeps only rows that exist in BOTH datasets.

inner_result <- inner_join(students, grades, by = "student_id")

kable(inner_result, caption = "Inner Join: students + grades") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Inner Join: students + grades
student_id name age school_id subject score
1 Alice 20 101 Math 88
2 Bob 22 102 English 74
3 Carol 21 101 Science 92

What happened?

  • Alice (1), Bob (2), Carol (3) appear — they are in both tables.
  • David (4) and Eva (5) are dropped — they have no grades.
  • Student 6 in grades is dropped — they have no student record.

5.2 Left Join — Keep All of the Left Table

A left join keeps all rows from the left (first) dataset, filling with NA where there is no match in the right dataset.

left_result <- left_join(students, grades, by = "student_id")

kable(left_result, caption = "Left Join: students + grades") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(which(is.na(left_result$score)), background = "#FFF3CD")  # highlight NAs
Left Join: students + grades
student_id name age school_id subject score
1 Alice 20 101 Math 88
2 Bob 22 102 English 74
3 Carol 21 101 Science 92
4 David 23 103 NA NA
5 Eva 20 102 NA NA

What happened?

  • All 5 students are present.
  • David (4) and Eva (5) have NA for subject and score — they have no grade records.
  • Student 6 from grades is still dropped (not in the left table).

5.3 Right Join — Keep All of the Right Table

A right join keeps all rows from the right (second) dataset.

right_result <- right_join(students, grades, by = "student_id")

kable(right_result, caption = "Right Join: students + grades") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(which(is.na(right_result$name)), background = "#F8D7DA")
Right Join: students + grades
student_id name age school_id subject score
1 Alice 20 101 Math 88
2 Bob 22 102 English 74
3 Carol 21 101 Science 92
6 NA NA NA History 65

What happened?

  • All 4 rows in grades are present.
  • Student 6 appears even though they have no student record — their name, age, and school_id are NA.
  • David (4) and Eva (5) are now dropped (they had no grade rows).

5.4 Full (Outer) Join — Keep Everything

A full join keeps all rows from both datasets, filling with NA where there is no match.

full_result <- full_join(students, grades, by = "student_id")

kable(full_result, caption = "Full Join: students + grades") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(which(is.na(full_result$name) | is.na(full_result$score)),
           background = "#D1ECF1")
Full Join: students + grades
student_id name age school_id subject score
1 Alice 20 101 Math 88
2 Bob 22 102 English 74
3 Carol 21 101 Science 92
4 David 23 103 NA NA
5 Eva 20 102 NA NA
6 NA NA NA History 65

What happened?

  • All students AND all grades appear — no row is lost.
  • NA fills in wherever there is no corresponding match.

6 Merging Three Datasets

To combine all three datasets, you chain two joins together.

# Step 1: Merge students with grades
step1 <- left_join(students, grades, by = "student_id")

# Step 2: Merge that result with schools
final_data <- left_join(step1, schools, by = "school_id")

kable(final_data, caption = "Final Merged Dataset (students + grades + schools)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, background = "#2C3E50", color = "white")
Final Merged Dataset (students + grades + schools)
student_id name age school_id subject score school_name city
1 Alice 20 101 Math 88 Green Valley HS Kigali
2 Bob 22 102 English 74 Blue Ridge HS Nairobi
3 Carol 21 101 Science 92 Green Valley HS Kigali
4 David 23 103 NA NA Sunset Academy Lagos
5 Eva 20 102 NA NA Blue Ridge HS Nairobi

Reading the result:

  • Every student now has their grade AND their school information attached.
  • David and Eva have NA for grade columns because they had no grade records.
  • All students belong to a known school, so no NA in school columns.

7 Visualising the Merged Data

plot_data <- final_data %>%
  filter(!is.na(score)) %>%
  group_by(school_name) %>%
  summarise(avg_score = mean(score), n = n(), .groups = "drop")

ggplot(plot_data, aes(x = reorder(school_name, avg_score), y = avg_score, fill = school_name)) +
  geom_col(width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = paste0(round(avg_score, 1), "\n(n=", n, ")")),
            hjust = -0.1, size = 3.8) +
  coord_flip(ylim = c(0, 110)) +
  scale_fill_manual(values = c("#3498DB", "#E74C3C", "#2ECC71")) +
  labs(title    = "Average Exam Score by School",
       subtitle = "Based on merged students + grades + schools dataset",
       x = NULL, y = "Average Score") +
  theme_minimal(base_size = 13) +
  theme(plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(color = "grey40"))
Average Score by School (merged data)

Average Score by School (merged data)

ggplot(final_data %>% filter(!is.na(score)),
       aes(x = age, y = score, colour = school_name, label = name)) +
  geom_point(size = 4, alpha = 0.85) +
  geom_text(vjust = -0.8, size = 3.5) +
  scale_color_brewer(palette = "Set1", name = "School") +
  labs(title    = "Student Age vs Exam Score",
       subtitle = "Each point is one student (from merged dataset)",
       x = "Age", y = "Score") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"),
        legend.position = "bottom")
Student Age vs Score

Student Age vs Score


8 Common Errors and How to Fix Them

8.1 ERROR 1 — Wrong Key Column Name

# ✗ WRONG: The key column is called "student_id", not "id"
bad_join <- inner_join(students, grades, by = "id")
## Error in `inner_join()`:
## ! Join columns in `x` must be present in the data.
## ✖ Problem with `id`.

What went wrong?

The column "id" does not exist in either dataset. The correct column name is "student_id".

Fix:

# ✓ CORRECT: Use the actual column name
good_join <- inner_join(students, grades, by = "student_id")
nrow(good_join)  # 3 rows as expected
## [1] 3

8.2 ERROR 2 — Key Columns Have Different Names in Each Table

Sometimes the same variable has a different name in each table.

# Simulate a grades table where the key is named "stu_id" instead of "student_id"
grades_alt <- grades %>% rename(stu_id = student_id)

# ✗ WRONG: Trying to join without telling R about the name difference
bad_join2 <- inner_join(students, grades_alt, by = "student_id")
## Error in `inner_join()`:
## ! Join columns in `y` must be present in the data.
## ✖ Problem with `student_id`.

What went wrong?

grades_alt has no column called "student_id" — it was renamed to "stu_id".

Fix — use the c("left_col" = "right_col") syntax:

# ✓ CORRECT: Map the two different column names
good_join2 <- inner_join(students, grades_alt, by = c("student_id" = "stu_id"))
nrow(good_join2)  # 3 rows — correct
## [1] 3

8.3 ERROR 3 — Duplicate Rows After Merge (Cartesian Explosion)

If the key column has duplicates in both tables, you get a row for every combination — the classic many-to-many explosion.

# Simulate students table with a DUPLICATE student_id
students_dup <- data.frame(
  student_id = c(1, 1, 2),         # student 1 appears TWICE
  name       = c("Alice", "Alice_Copy", "Bob"),
  stringsAsFactors = FALSE
)

grades_dup <- data.frame(
  student_id = c(1, 1, 2),         # student 1 also appears TWICE
  score      = c(88, 91, 74),
  stringsAsFactors = FALSE
)

explosion <- inner_join(students_dup, grades_dup, by = "student_id",
                        relationship = "many-to-many")

cat("Rows in students_dup:", nrow(students_dup), "\n")
## Rows in students_dup: 3
cat("Rows in grades_dup:  ", nrow(grades_dup), "\n")
## Rows in grades_dup:   3
cat("Rows after join:     ", nrow(explosion), "\n")  # 2×2 + 1 = 5 rows!
## Rows after join:      5
kable(explosion, caption = "Cartesian Explosion — unexpected row multiplication") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Cartesian Explosion — unexpected row multiplication
student_id name score
1 Alice 88
1 Alice 91
1 Alice_Copy 88
1 Alice_Copy 91
2 Bob 74

What went wrong?

Student 1 appeared twice in both tables, so R created 2 × 2 = 4 rows for student 1. This is usually unintended.

Fix — deduplicate before joining:

# ✓ CORRECT: Remove duplicates from the key column first
students_clean <- students_dup %>% distinct(student_id, .keep_all = TRUE)
grades_clean   <- grades_dup   %>% distinct(student_id, .keep_all = TRUE)

clean_join <- inner_join(students_clean, grades_clean, by = "student_id")
cat("Rows after deduplication and join:", nrow(clean_join), "\n")  # 2 rows
## Rows after deduplication and join: 2

8.4 ERROR 4 — Merging by a Column That Contains NA

# Simulate a student with a MISSING school_id
students_na <- students
students_na$school_id[2] <- NA   # Bob has no school_id

result_na <- left_join(students_na, schools, by = "school_id")

kable(result_na, caption = "Join with NA in key — NA never matches") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(which(is.na(result_na$school_name)), background = "#F8D7DA")
Join with NA in key — NA never matches
student_id name age school_id school_name city
1 Alice 20 101 Green Valley HS Kigali
2 Bob 22 NA NA NA
3 Carol 21 101 Green Valley HS Kigali
4 David 23 103 Sunset Academy Lagos
5 Eva 20 102 Blue Ridge HS Nairobi

What happened?

NA never matches anything — Bob’s row gets NA for all school columns. This is usually correct behaviour, but be aware of it.

Fix — fill or drop NAs in the key before joining:

# Option A: Drop rows with NA key
students_dropped <- students_na %>% filter(!is.na(school_id))

# Option B: Fill NA with a placeholder value
students_filled <- students_na %>%
  mutate(school_id = ifelse(is.na(school_id), 999, school_id))

cat("Rows after dropping NA keys:", nrow(students_dropped), "\n")
## Rows after dropping NA keys: 4

9 Summary Comparison

Summary of Join Types in R
Join Type Rows Kept From NAs Introduced Use When
inner_join Both tables (matches only) No You only care about matched records
left_join All of left table Right-side columns when no match Left table is the base, right is optional extra info
right_join All of right table Left-side columns when no match Right table is the base, left is optional extra info
full_join All rows from both tables Both sides where no match You want to keep all data, even unmatched

10 Quick Reference Cheat Sheet


11 Session Info

sessionInfo()
## R version 4.6.0 (2026-04-24 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_Rwanda.utf8  LC_CTYPE=English_Rwanda.utf8   
## [3] LC_MONETARY=English_Rwanda.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=English_Rwanda.utf8    
## 
## time zone: Africa/Kigali
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] kableExtra_1.4.0 knitr_1.51       dplyr_1.2.1      gridExtra_2.3   
## [5] ggplot2_4.0.3   
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.6.0     tidyselect_1.2.1  
##  [5] xml2_1.5.2         stringr_1.6.0      jquerylib_0.1.4    textshaping_1.0.5 
##  [9] systemfonts_1.3.2  scales_1.4.0       yaml_2.3.12        fastmap_1.2.0     
## [13] R6_2.6.1           labeling_0.4.3     generics_0.1.4     tibble_3.3.1      
## [17] svglite_2.2.2      bslib_0.11.0       pillar_1.11.1      RColorBrewer_1.1-3
## [21] rlang_1.2.0        cachem_1.1.0       stringi_1.8.7      xfun_0.57         
## [25] sass_0.4.10        S7_0.2.2           viridisLite_0.4.3  cli_3.6.6         
## [29] withr_3.0.2        magrittr_2.0.5     digest_0.6.39      rstudioapi_0.18.0 
## [33] lifecycle_1.0.5    vctrs_0.7.3        evaluate_1.0.5     glue_1.8.1        
## [37] farver_2.1.2       rmarkdown_2.31     tools_4.6.0        pkgconfig_2.0.3   
## [41] htmltools_0.5.9