When working with real-world data, information is rarely stored in a single table. You might have students in one dataset, their grades in another, and their school information in a third. Merging (also called joining) lets us combine these into one unified dataset for analysis.
Key concept: Datasets are merged using a key column — a shared variable that exists in both tables (e.g.,
student_id,country_code).
Types of Database Joins
| Join Type | R Function (dplyr) |
Base R | Keeps rows from … |
|---|---|---|---|
| Inner join | inner_join() |
merge(all = FALSE) |
Both tables (matches only) |
| Left join | left_join() |
merge(all.x = TRUE) |
All of left + matches |
| Right join | right_join() |
merge(all.y = TRUE) |
All of right + matches |
| Full join | full_join() |
merge(all = TRUE) |
All rows from both |
# Install if you don't have them yet:
# install.packages(c("dplyr", "ggplot2", "knitr", "kableExtra"))
library(dplyr) # For join functions
library(ggplot2) # For visualisations
library(knitr) # For nice tables
library(kableExtra) # For styled tablesWe will simulate a small school database with students, grades, and schools.
# ── Dataset 1: Students ────────────────────────────────────────────────────────
students <- data.frame(
student_id = c(1, 2, 3, 4, 5),
name = c("Alice", "Bob", "Carol", "David", "Eva"),
age = c(20, 22, 21, 23, 20),
school_id = c(101, 102, 101, 103, 102),
stringsAsFactors = FALSE
)
# ── Dataset 2: Grades ──────────────────────────────────────────────────────────
grades <- data.frame(
student_id = c(1, 2, 3, 6), # NOTE: student 6 does NOT exist in students!
subject = c("Math", "English", "Science", "History"),
score = c(88, 74, 92, 65),
stringsAsFactors = FALSE
)
# ── Dataset 3: Schools ─────────────────────────────────────────────────────────
schools <- data.frame(
school_id = c(101, 102, 103, 104), # NOTE: school 104 has no students!
school_name = c("Green Valley HS", "Blue Ridge HS", "Sunset Academy", "Eastwood Prep"),
city = c("Kigali", "Nairobi", "Lagos", "Accra"),
stringsAsFactors = FALSE
)Let’s preview each dataset:
kable(students, caption = "Dataset 1 — Students") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| student_id | name | age | school_id |
|---|---|---|---|
| 1 | Alice | 20 | 101 |
| 2 | Bob | 22 | 102 |
| 3 | Carol | 21 | 101 |
| 4 | David | 23 | 103 |
| 5 | Eva | 20 | 102 |
kable(grades, caption = "Dataset 2 — Grades") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| student_id | subject | score |
|---|---|---|
| 1 | Math | 88 |
| 2 | English | 74 |
| 3 | Science | 92 |
| 6 | History | 65 |
kable(schools, caption = "Dataset 3 — Schools") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| school_id | school_name | city |
|---|---|---|
| 101 | Green Valley HS | Kigali |
| 102 | Blue Ridge HS | Nairobi |
| 103 | Sunset Academy | Lagos |
| 104 | Eastwood Prep | Accra |
Notice the intentional quirks in the data:
gradescontainsstudent_id = 6which does not exist instudentsschoolscontainsschool_id = 104which has no matching studentsstudents 4 & 5have no grades recordedThese will produce different results depending on which join type you use.
An inner join keeps only rows that exist in BOTH datasets.
inner_result <- inner_join(students, grades, by = "student_id")
kable(inner_result, caption = "Inner Join: students + grades") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| student_id | name | age | school_id | subject | score |
|---|---|---|---|---|---|
| 1 | Alice | 20 | 101 | Math | 88 |
| 2 | Bob | 22 | 102 | English | 74 |
| 3 | Carol | 21 | 101 | Science | 92 |
What happened?
A left join keeps all rows from the left (first)
dataset, filling with NA where there is no match
in the right dataset.
left_result <- left_join(students, grades, by = "student_id")
kable(left_result, caption = "Left Join: students + grades") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(which(is.na(left_result$score)), background = "#FFF3CD") # highlight NAs| student_id | name | age | school_id | subject | score |
|---|---|---|---|---|---|
| 1 | Alice | 20 | 101 | Math | 88 |
| 2 | Bob | 22 | 102 | English | 74 |
| 3 | Carol | 21 | 101 | Science | 92 |
| 4 | David | 23 | 103 | NA | NA |
| 5 | Eva | 20 | 102 | NA | NA |
What happened?
NA for subject
and score — they have no grade records.A right join keeps all rows from the right (second) dataset.
right_result <- right_join(students, grades, by = "student_id")
kable(right_result, caption = "Right Join: students + grades") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(which(is.na(right_result$name)), background = "#F8D7DA")| student_id | name | age | school_id | subject | score |
|---|---|---|---|---|---|
| 1 | Alice | 20 | 101 | Math | 88 |
| 2 | Bob | 22 | 102 | English | 74 |
| 3 | Carol | 21 | 101 | Science | 92 |
| 6 | NA | NA | NA | History | 65 |
What happened?
grades are present.name, age, and
school_id are NA.A full join keeps all rows from both datasets,
filling with NA where there is no match.
full_result <- full_join(students, grades, by = "student_id")
kable(full_result, caption = "Full Join: students + grades") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(which(is.na(full_result$name) | is.na(full_result$score)),
background = "#D1ECF1")| student_id | name | age | school_id | subject | score |
|---|---|---|---|---|---|
| 1 | Alice | 20 | 101 | Math | 88 |
| 2 | Bob | 22 | 102 | English | 74 |
| 3 | Carol | 21 | 101 | Science | 92 |
| 4 | David | 23 | 103 | NA | NA |
| 5 | Eva | 20 | 102 | NA | NA |
| 6 | NA | NA | NA | History | 65 |
What happened?
NA fills in wherever there is no corresponding
match.To combine all three datasets, you chain two joins together.
# Step 1: Merge students with grades
step1 <- left_join(students, grades, by = "student_id")
# Step 2: Merge that result with schools
final_data <- left_join(step1, schools, by = "school_id")
kable(final_data, caption = "Final Merged Dataset (students + grades + schools)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE) %>%
column_spec(1, bold = TRUE) %>%
row_spec(0, background = "#2C3E50", color = "white")| student_id | name | age | school_id | subject | score | school_name | city |
|---|---|---|---|---|---|---|---|
| 1 | Alice | 20 | 101 | Math | 88 | Green Valley HS | Kigali |
| 2 | Bob | 22 | 102 | English | 74 | Blue Ridge HS | Nairobi |
| 3 | Carol | 21 | 101 | Science | 92 | Green Valley HS | Kigali |
| 4 | David | 23 | 103 | NA | NA | Sunset Academy | Lagos |
| 5 | Eva | 20 | 102 | NA | NA | Blue Ridge HS | Nairobi |
Reading the result:
NA for grade columns because they
had no grade records.NA in
school columns.plot_data <- final_data %>%
filter(!is.na(score)) %>%
group_by(school_name) %>%
summarise(avg_score = mean(score), n = n(), .groups = "drop")
ggplot(plot_data, aes(x = reorder(school_name, avg_score), y = avg_score, fill = school_name)) +
geom_col(width = 0.6, show.legend = FALSE) +
geom_text(aes(label = paste0(round(avg_score, 1), "\n(n=", n, ")")),
hjust = -0.1, size = 3.8) +
coord_flip(ylim = c(0, 110)) +
scale_fill_manual(values = c("#3498DB", "#E74C3C", "#2ECC71")) +
labs(title = "Average Exam Score by School",
subtitle = "Based on merged students + grades + schools dataset",
x = NULL, y = "Average Score") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"),
plot.subtitle = element_text(color = "grey40"))Average Score by School (merged data)
ggplot(final_data %>% filter(!is.na(score)),
aes(x = age, y = score, colour = school_name, label = name)) +
geom_point(size = 4, alpha = 0.85) +
geom_text(vjust = -0.8, size = 3.5) +
scale_color_brewer(palette = "Set1", name = "School") +
labs(title = "Student Age vs Exam Score",
subtitle = "Each point is one student (from merged dataset)",
x = "Age", y = "Score") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"),
legend.position = "bottom")Student Age vs Score
# ✗ WRONG: The key column is called "student_id", not "id"
bad_join <- inner_join(students, grades, by = "id")## Error in `inner_join()`:
## ! Join columns in `x` must be present in the data.
## ✖ Problem with `id`.
What went wrong?
The column "id" does not exist in either dataset. The
correct column name is "student_id".
Fix:
# ✓ CORRECT: Use the actual column name
good_join <- inner_join(students, grades, by = "student_id")
nrow(good_join) # 3 rows as expected## [1] 3
Sometimes the same variable has a different name in each table.
# Simulate a grades table where the key is named "stu_id" instead of "student_id"
grades_alt <- grades %>% rename(stu_id = student_id)
# ✗ WRONG: Trying to join without telling R about the name difference
bad_join2 <- inner_join(students, grades_alt, by = "student_id")## Error in `inner_join()`:
## ! Join columns in `y` must be present in the data.
## ✖ Problem with `student_id`.
What went wrong?
grades_alt has no column called
"student_id" — it was renamed to "stu_id".
Fix — use the c("left_col" = "right_col")
syntax:
# ✓ CORRECT: Map the two different column names
good_join2 <- inner_join(students, grades_alt, by = c("student_id" = "stu_id"))
nrow(good_join2) # 3 rows — correct## [1] 3
If the key column has duplicates in both tables, you get a row for every combination — the classic many-to-many explosion.
# Simulate students table with a DUPLICATE student_id
students_dup <- data.frame(
student_id = c(1, 1, 2), # student 1 appears TWICE
name = c("Alice", "Alice_Copy", "Bob"),
stringsAsFactors = FALSE
)
grades_dup <- data.frame(
student_id = c(1, 1, 2), # student 1 also appears TWICE
score = c(88, 91, 74),
stringsAsFactors = FALSE
)
explosion <- inner_join(students_dup, grades_dup, by = "student_id",
relationship = "many-to-many")
cat("Rows in students_dup:", nrow(students_dup), "\n")## Rows in students_dup: 3
## Rows in grades_dup: 3
## Rows after join: 5
kable(explosion, caption = "Cartesian Explosion — unexpected row multiplication") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| student_id | name | score |
|---|---|---|
| 1 | Alice | 88 |
| 1 | Alice | 91 |
| 1 | Alice_Copy | 88 |
| 1 | Alice_Copy | 91 |
| 2 | Bob | 74 |
What went wrong?
Student 1 appeared twice in both tables, so R created
2 × 2 = 4 rows for student 1. This is usually
unintended.
Fix — deduplicate before joining:
# ✓ CORRECT: Remove duplicates from the key column first
students_clean <- students_dup %>% distinct(student_id, .keep_all = TRUE)
grades_clean <- grades_dup %>% distinct(student_id, .keep_all = TRUE)
clean_join <- inner_join(students_clean, grades_clean, by = "student_id")
cat("Rows after deduplication and join:", nrow(clean_join), "\n") # 2 rows## Rows after deduplication and join: 2
# Simulate a student with a MISSING school_id
students_na <- students
students_na$school_id[2] <- NA # Bob has no school_id
result_na <- left_join(students_na, schools, by = "school_id")
kable(result_na, caption = "Join with NA in key — NA never matches") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(which(is.na(result_na$school_name)), background = "#F8D7DA")| student_id | name | age | school_id | school_name | city |
|---|---|---|---|---|---|
| 1 | Alice | 20 | 101 | Green Valley HS | Kigali |
| 2 | Bob | 22 | NA | NA | NA |
| 3 | Carol | 21 | 101 | Green Valley HS | Kigali |
| 4 | David | 23 | 103 | Sunset Academy | Lagos |
| 5 | Eva | 20 | 102 | Blue Ridge HS | Nairobi |
What happened?
NA never matches anything — Bob’s row gets
NA for all school columns. This is usually correct
behaviour, but be aware of it.
Fix — fill or drop NAs in the key before joining:
# Option A: Drop rows with NA key
students_dropped <- students_na %>% filter(!is.na(school_id))
# Option B: Fill NA with a placeholder value
students_filled <- students_na %>%
mutate(school_id = ifelse(is.na(school_id), 999, school_id))
cat("Rows after dropping NA keys:", nrow(students_dropped), "\n")## Rows after dropping NA keys: 4
| Join Type | Rows Kept From | NAs Introduced | Use When |
|---|---|---|---|
| inner_join | Both tables (matches only) | No | You only care about matched records |
| left_join | All of left table | Right-side columns when no match | Left table is the base, right is optional extra info |
| right_join | All of right table | Left-side columns when no match | Right table is the base, left is optional extra info |
| full_join | All rows from both tables | Both sides where no match | You want to keep all data, even unmatched |
## R version 4.6.0 (2026-04-24 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_Rwanda.utf8 LC_CTYPE=English_Rwanda.utf8
## [3] LC_MONETARY=English_Rwanda.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Rwanda.utf8
##
## time zone: Africa/Kigali
## tzcode source: internal
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] kableExtra_1.4.0 knitr_1.51 dplyr_1.2.1 gridExtra_2.3
## [5] ggplot2_4.0.3
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 jsonlite_2.0.0 compiler_4.6.0 tidyselect_1.2.1
## [5] xml2_1.5.2 stringr_1.6.0 jquerylib_0.1.4 textshaping_1.0.5
## [9] systemfonts_1.3.2 scales_1.4.0 yaml_2.3.12 fastmap_1.2.0
## [13] R6_2.6.1 labeling_0.4.3 generics_0.1.4 tibble_3.3.1
## [17] svglite_2.2.2 bslib_0.11.0 pillar_1.11.1 RColorBrewer_1.1-3
## [21] rlang_1.2.0 cachem_1.1.0 stringi_1.8.7 xfun_0.57
## [25] sass_0.4.10 S7_0.2.2 viridisLite_0.4.3 cli_3.6.6
## [29] withr_3.0.2 magrittr_2.0.5 digest_0.6.39 rstudioapi_0.18.0
## [33] lifecycle_1.0.5 vctrs_0.7.3 evaluate_1.0.5 glue_1.8.1
## [37] farver_2.1.2 rmarkdown_2.31 tools_4.6.0 pkgconfig_2.0.3
## [41] htmltools_0.5.9