The Himalayan Database is one of the most comprehensive mountaineering archives in the world. It documents every recorded expedition to peaks in the Nepal Himalaya, continuing the work of journalist Elizabeth Hawley, who spent decades cataloguing climbing history in the region.
For this project, I am using the TidyTuesday (Week 3, 2025) version of this dataset, which includes:
exped_tidy โ 882 expeditions from
2020 to 2024, with details on season, nationality,
success, deaths, oxygen use, and days to summit.peaks_tidy โ information on all
recorded Himalayan peaks, including height, region, and whether the peak
is open for climbing.Key questions I want to explore:
# Load both datasets directly from GitHub
exped_tidy <- fread(
"https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-01-21/exped_tidy.csv"
)
peaks_tidy <- fread(
"https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-01-21/peaks_tidy.csv"
)## [1] 882 69
## [1] 480 29
head(exped_tidy[, .(EXPID, PEAKID, YEAR, SEASON_FACTOR, HOST_FACTOR,
NATION, SUCCESS1, TOTMEMBERS, MDEATHS, O2USED)])## EXPID PEAKID YEAR SEASON_FACTOR HOST_FACTOR NATION SUCCESS1 TOTMEMBERS
## <char> <char> <int> <char> <char> <char> <lgcl> <int>
## 1: EVER20101 EVER 2020 Spring China China TRUE 0
## 2: EVER20102 EVER 2020 Spring China China TRUE 12
## 3: EVER20103 EVER 2020 Spring China China TRUE 20
## 4: AMAD20301 AMAD 2020 Autumn Nepal Nepal TRUE 14
## 5: AMAD20302 AMAD 2020 Autumn Nepal USA TRUE 6
## 6: AMAD20303 AMAD 2020 Autumn Nepal UK TRUE 2
## MDEATHS O2USED
## <int> <lgcl>
## 1: 0 TRUE
## 2: 0 TRUE
## 3: 0 TRUE
## 4: 0 FALSE
## 5: 0 FALSE
## 6: 0 FALSE
## PEAKID PKNAME HEIGHTM REGION_FACTOR OPEN PYEAR
## <char> <char> <int> <char> <lgcl> <int>
## 1: AMAD Ama Dablam 6814 Khumbu-Rolwaling-Makalu TRUE 1961
## 2: AMPG Amphu Gyabjen 5630 Khumbu-Rolwaling-Makalu TRUE 1953
## 3: ANN1 Annapurna I 8091 Annapurna-Damodar-Peri TRUE 1950
## 4: ANN2 Annapurna II 7937 Annapurna-Damodar-Peri TRUE 1960
## 5: ANN3 Annapurna III 7555 Annapurna-Damodar-Peri TRUE 1961
## 6: ANN4 Annapurna IV 7525 Annapurna-Damodar-Peri TRUE 1955
The expeditions dataset has 882 rows and 69 columns. The peaks dataset has 480 rows and 29 columns.
To answer questions that involve both expedition details and peak
characteristics (like height and region), I merge the two tables on
PEAKID.
# Merge expeditions with peak info on PEAKID
merged <- merge(
exped_tidy,
peaks_tidy[, .(PEAKID, PKNAME, HEIGHTM, REGION_FACTOR, OPEN, HIMAL_FACTOR)],
by = "PEAKID",
all.x = TRUE # keep all expeditions even if peak info is missing
)
cat("Merged dataset:", nrow(merged), "rows and", ncol(merged), "columns\n")## Merged dataset: 882 rows and 74 columns
# --- Filtering ---
# Keep only expeditions with known season and at least one member
exped_clean <- exped_tidy[!is.na(SEASON_FACTOR) & TOTMEMBERS > 0]
# --- Success rate by season ---
season_summary <- exped_clean[, .(
total = .N,
successful = sum(SUCCESS1 == TRUE, na.rm = TRUE),
avg_members = round(mean(TOTMEMBERS, na.rm = TRUE), 1),
total_deaths = sum(MDEATHS, na.rm = TRUE)
), by = SEASON_FACTOR]
season_summary[, success_rate := round(successful / total * 100, 1)]
season_summary <- season_summary[order(-total)]
print(season_summary)## SEASON_FACTOR total successful avg_members total_deaths success_rate
## <char> <int> <int> <num> <int> <num>
## 1: Spring 452 342 8.4 28 75.7
## 2: Autumn 390 259 8.5 8 66.4
## 3: Winter 21 7 4.5 0 33.3
## 4: Summer 5 3 2.2 0 60.0
# --- Top nationalities ---
nation_counts <- exped_clean[!is.na(NATION), .N, by = NATION]
nation_counts <- nation_counts[order(-N)][1:10]
print(nation_counts)## NATION N
## <char> <int>
## 1: Nepal 117
## 2: USA 117
## 3: UK 91
## 4: India 51
## 5: France 51
## 6: Germany 32
## 7: China 30
## 8: Austria 29
## 9: Spain 26
## 10: Russia 26
# --- Oxygen use vs success ---
o2_summary <- exped_clean[!is.na(O2USED), .(
total = .N,
successful = sum(SUCCESS1 == TRUE, na.rm = TRUE)
), by = O2USED]
o2_summary[, success_rate := round(successful / total * 100, 1)]
o2_summary[, O2Label := ifelse(O2USED, "Used Oxygen", "No Oxygen")]
print(o2_summary)## O2USED total successful success_rate O2Label
## <lgcl> <int> <int> <num> <char>
## 1: TRUE 396 352 88.9 Used Oxygen
## 2: FALSE 472 259 54.9 No Oxygen
# --- Deaths by mountain range (from merged dataset) ---
danger_summary <- merged[!is.na(HIMAL_FACTOR), .(
total_deaths = sum(MDEATHS, na.rm = TRUE),
total_expeditions = .N,
death_rate = round(sum(MDEATHS, na.rm = TRUE) / .N * 100, 2)
), by = HIMAL_FACTOR]
danger_summary <- danger_summary[order(-total_deaths)][1:10]
print(danger_summary)## HIMAL_FACTOR total_deaths total_expeditions death_rate
## <char> <int> <int> <num>
## 1: Khumbu 28 497 5.63
## 2: Kangchenjunga/Simhalila 2 29 6.90
## 3: Dhaulagiri 2 48 4.17
## 4: Manaslu/Mansiri 2 99 2.02
## 5: Annapurna 1 34 2.94
## 6: Makalu 1 29 3.45
## 7: Langtang 0 7 0.00
## 8: Damodar 0 11 0.00
## 9: Rolwaling 0 27 0.00
## 10: Nalakankar/Chandi/Changla 0 6 0.00
# --- Average height by region ---
region_heights <- peaks_tidy[!is.na(REGION_FACTOR) & !is.na(HEIGHTM),
.(avg_height = mean(HEIGHTM),
n_peaks = .N),
by = REGION_FACTOR]
region_heights <- region_heights[order(-avg_height)]
print(region_heights)## REGION_FACTOR avg_height n_peaks
## <char> <num> <int>
## 1: Kangchenjunga-Janak 6869.771 70
## 2: Annapurna-Damodar-Peri 6721.757 74
## 3: Dhaulagiri-Mukut 6704.025 40
## 4: Manaslu-Ganesh 6694.829 41
## 5: Khumbu-Rolwaling-Makalu 6682.864 132
## 6: Langtang-Jugal 6495.889 36
## 7: Kanjiroba-Far West 6374.437 87
I apply a consistent theme and ColorBrewer palette across all plots.
# Custom theme applied to all plots
my_theme <- theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 15),
plot.subtitle = element_text(color = "grey40", size = 11),
plot.caption = element_text(color = "grey60", size = 9, hjust = 1),
axis.title = element_text(face = "bold"),
legend.position = "bottom",
panel.grid.minor = element_blank()
)
season_colors <- brewer.pal(4, "Set2") # ColorBrewer palette
blue_palette <- brewer.pal(9, "Blues")
red_palette <- brewer.pal(9, "Reds")ggplot(season_summary, aes(x = reorder(SEASON_FACTOR, -total),
y = total,
fill = SEASON_FACTOR)) +
geom_col(width = 0.6) +
geom_text(aes(label = total), vjust = -0.5, fontface = "bold", size = 4.5) +
scale_fill_manual(values = season_colors) +
scale_y_continuous(expand = expansion(mult = c(0, 0.12))) +
labs(
title = "Himalayan Expedition Counts by Season (2020โ2024)",
subtitle = "Spring dominates โ most climbers aim for the pre-monsoon weather window",
x = "Season",
y = "Number of Expeditions",
fill = "Season",
caption = "Source: The Himalayan Database via TidyTuesday 2025 Week 3"
) +
my_theme +
theme(legend.position = "none")Insight: Spring is by far the most popular season, accounting for the majority of expeditions. This makes sense โ spring offers the most stable weather window before the monsoon season arrives. Winter expeditions are rare and extremely dangerous.
ggplot(season_summary, aes(x = reorder(SEASON_FACTOR, -success_rate),
y = success_rate,
fill = success_rate)) +
geom_col(width = 0.6) +
geom_text(aes(label = paste0(success_rate, "%")), vjust = -0.5,
fontface = "bold", size = 4.5) +
scale_fill_gradient(low = blue_palette[3], high = blue_palette[8]) +
scale_y_continuous(limits = c(0, 100),
labels = label_percent(scale = 1),
expand = expansion(mult = c(0, 0.1))) +
labs(
title = "Summit Success Rate by Season",
subtitle = "Autumn and Spring have the highest chance of reaching the top",
x = "Season",
y = "Success Rate (%)",
fill = "Success Rate",
caption = "Source: The Himalayan Database via TidyTuesday 2025 Week 3"
) +
my_themeInsight: Both Spring and Autumn have relatively high success rates. Winter expeditions, while extremely rare, have a very low success rate โ a testament to how brutal Himalayan winters are.
ggplot(nation_counts, aes(x = reorder(NATION, N), y = N, fill = N)) +
geom_col(width = 0.7) +
geom_text(aes(label = N), hjust = -0.2, fontface = "bold", size = 4) +
scale_fill_gradient(low = brewer.pal(9, "Greens")[3],
high = brewer.pal(9, "Greens")[8]) +
scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
coord_flip() +
labs(
title = "Top 10 Nationalities Leading Himalayan Expeditions (2020โ2024)",
subtitle = "Nepal leads the way, followed by China and other major climbing nations",
x = "Country",
y = "Number of Expeditions",
fill = "Count",
caption = "Source: The Himalayan Database via TidyTuesday 2025 Week 3"
) +
my_theme +
theme(legend.position = "none")Insight: Nepal and China top the list, which makes sense given these are the host countries for the Himalayan peaks. The USA, South Korea, and various European nations also feature prominently, reflecting the global appeal of high-altitude mountaineering.
# Use merged dataset; filter to valid summit days and heights
plot4_data <- merged[!is.na(HEIGHTM) & !is.na(SMTDAYS) & SMTDAYS > 0 & SMTDAYS < 200]
ggplot(plot4_data, aes(x = HEIGHTM, y = SMTDAYS, color = SEASON_FACTOR)) +
geom_point(alpha = 0.4, size = 2) +
geom_smooth(method = "lm", se = TRUE, color = "grey20", linewidth = 1) +
scale_color_manual(values = season_colors) +
labs(
title = "Does Peak Height Predict Days to Summit?",
subtitle = "Each point is one expedition; the line shows the overall linear trend",
x = "Peak Height (metres)",
y = "Days to Reach Summit",
color = "Season",
caption = "Source: The Himalayan Database via TidyTuesday 2025 Week 3"
) +
my_themeInsight: There is a slight positive trend โ higher peaks generally require more days to summit. However, thereโs a lot of variation, which suggests that other factors (weather, acclimatisation strategy, route difficulty) also play a major role.
# Focus on regions with enough peaks
region_data <- peaks_tidy[!is.na(REGION_FACTOR) & !is.na(HEIGHTM)]
# Keep top regions by number of peaks
top_regions <- region_data[, .N, by = REGION_FACTOR][order(-N)][1:8, REGION_FACTOR]
region_data <- region_data[REGION_FACTOR %in% top_regions]
ggplot(region_data, aes(x = reorder(REGION_FACTOR, HEIGHTM, median),
y = HEIGHTM,
fill = REGION_FACTOR)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.4, outlier.size = 1.5) +
scale_fill_brewer(palette = "Set3") +
coord_flip() +
labs(
title = "Peak Height Distribution by Himalayan Region",
subtitle = "Khumbu and Mahalangur regions contain the highest peaks",
x = "Mountain Region",
y = "Peak Height (metres)",
fill = "Region",
caption = "Source: The Himalayan Database via TidyTuesday 2025 Week 3"
) +
my_theme +
theme(legend.position = "none")Insight: Thereโs significant variation in peak heights within each region. The Mahalangur and Khumbu regions (home to Everest) have the tallest median heights. Some regions have surprisingly wide distributions, indicating a mix of beginner-friendly and expert-level peaks.
ggplot(o2_summary, aes(x = O2Label, y = success_rate, fill = O2Label)) +
geom_col(width = 0.5) +
geom_text(aes(label = paste0(success_rate, "%\n(n=", total, ")")),
vjust = -0.4, fontface = "bold", size = 5) +
scale_fill_manual(values = c("Used Oxygen" = brewer.pal(9, "Blues")[7],
"No Oxygen" = brewer.pal(9, "Oranges")[6])) +
scale_y_continuous(limits = c(0, 100),
labels = label_percent(scale = 1),
expand = expansion(mult = c(0, 0.15))) +
labs(
title = "Summit Success Rate: Oxygen vs No Oxygen",
subtitle = "Expeditions using supplemental oxygen reach the summit more often",
x = NULL,
y = "Success Rate (%)",
caption = "Source: The Himalayan Database via TidyTuesday 2025 Week 3"
) +
my_theme +
theme(legend.position = "none")Insight: Expeditions using supplemental oxygen have a notably higher success rate. This is expected โ oxygen helps climbers cope with altitude sickness and maintain physical performance in the death zone above 8,000 metres.
ggplot(danger_summary, aes(x = reorder(HIMAL_FACTOR, total_deaths),
y = total_deaths,
fill = total_deaths)) +
geom_col(width = 0.7) +
geom_text(aes(label = total_deaths), hjust = -0.2, fontface = "bold", size = 4) +
scale_fill_gradient(low = red_palette[3], high = red_palette[8]) +
scale_y_continuous(expand = expansion(mult = c(0, 0.18))) +
coord_flip() +
labs(
title = "Total Member Deaths by Mountain Range (2020โ2024)",
subtitle = "Absolute deaths โ does not account for number of expeditions per range",
x = "Mountain Range",
y = "Total Deaths",
fill = "Deaths",
caption = "Source: The Himalayan Database via TidyTuesday 2025 Week 3"
) +
my_theme +
theme(legend.position = "none")Insight: The Khumbu range (Everestโs home) records the most deaths in absolute terms, which reflects how many expeditions attempt Everest and the surrounding 8,000-metre giants. However, this doesnโt necessarily mean these are the most dangerous per expedition โ more climbers simply attempt them.
# Filter peaks with known open status and height
open_data <- peaks_tidy[!is.na(OPEN) & !is.na(HEIGHTM)]
open_data[, OpenLabel := ifelse(OPEN, "Open to Climbing", "Closed")]
ggplot(open_data, aes(x = OpenLabel, y = HEIGHTM, fill = OpenLabel)) +
geom_violin(alpha = 0.6, trim = FALSE) +
geom_boxplot(width = 0.15, fill = "white", outlier.size = 1.5, alpha = 0.9) +
scale_fill_manual(values = c("Open to Climbing" = brewer.pal(9, "Blues")[6],
"Closed" = brewer.pal(9, "Greys")[5])) +
labs(
title = "Height Distribution: Open vs Closed Himalayan Peaks",
subtitle = "Multiple geom layers: violin shows full distribution, boxplot shows median and IQR",
x = NULL,
y = "Peak Height (metres)",
fill = "Climbing Status",
caption = "Source: The Himalayan Database via TidyTuesday 2025 Week 3"
) +
my_theme +
theme(legend.position = "none")Insight: Open peaks tend to be taller on average than closed ones. This might seem counterintuitive, but it reflects the commercial appeal of high-profile peaks โ taller mountains attract more climbers and more permit revenue, so they are kept open. Closed peaks may include sacred or ecologically sensitive mountains regardless of height.
This analysis explored 882 Himalayan expeditions from 2020 to 2024, merged with a database of all recorded peaks.
Key takeaways:
Overall, mountaineering in the Himalayas remains both a highly organised commercial endeavour and an extreme sport where success depends on many interacting factors โ season, altitude, oxygen, nationality, and pure determination.
Data source: The Himalayan Database (Elizabeth Hawley Archive),
via TidyTuesday 2025 Week 3.
Analysis by: [Your Name], 2026-05-19