This report uses the Himalayan Mountaineering Expeditions dataset from the TidyTuesday project (2025, Week 3).
The data comes from the Himalayan Database, an archive of climbing expeditions in the Nepal Himalaya originally compiled by journalist Elizabeth Hawley.
Two tables are used:
exped_tidy: expedition-level data (routes, team size,
outcomes, dates)peaks_tidy: peak-level data (height, region, first
ascent info)Questions explored: how has climbing activity changed over time, which peaks are most dangerous, and what factors relate to expedition outcomes.
library(data.table)
library(ggplot2)
library(RColorBrewer)
exped <- fread('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-01-21/exped_tidy.csv')
peaks <- fread('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-01-21/peaks_tidy.csv')
Quick look at both tables:
str(exped)
## Classes 'data.table' and 'data.frame': 882 obs. of 69 variables:
## $ EXPID : chr "EVER20101" "EVER20102" "EVER20103" "AMAD20301" ...
## $ PEAKID : chr "EVER" "EVER" "EVER" "AMAD" ...
## $ YEAR : int 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ SEASON : int 1 1 1 3 3 3 3 3 3 3 ...
## $ SEASON_FACTOR : chr "Spring" "Spring" "Spring" "Autumn" ...
## $ HOST : int 2 2 2 1 1 1 1 1 1 1 ...
## $ HOST_FACTOR : chr "China" "China" "China" "Nepal" ...
## $ ROUTE1 : chr "N Col-NE Ridge" "N Col-NE Ridge" "N Col-NE Ridge" "SW Ridge" ...
## $ ROUTE2 : chr NA NA NA NA ...
## $ ROUTE3 : logi NA NA NA NA NA NA ...
## $ ROUTE4 : logi NA NA NA NA NA NA ...
## $ NATION : chr "China" "China" "China" "Nepal" ...
## $ LEADERS : chr "Tibetan Rope-Fixing" "Ci Luo (Tselo)" "Tsering Samdrup" "Chhang Dawa Sherpa" ...
## $ SPONSOR : chr "Tibetan Rope-Fixing Everest North 2020" "Chinese Mount Everest Survey Team" "Holy Mountain Adventure Everest Expedition 2020" "Seven Summit Treks Ama Dablam Expedition 2020" ...
## $ SUCCESS1 : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ SUCCESS2 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ SUCCESS3 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ SUCCESS4 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ ASCENT1 : chr NA NA NA NA ...
## $ ASCENT2 : chr NA NA NA NA ...
## $ ASCENT3 : logi NA NA NA NA NA NA ...
## $ ASCENT4 : logi NA NA NA NA NA NA ...
## $ CLAIMED : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ DISPUTED : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ COUNTRIES : chr NA NA NA "Canada, Czech Republic, France, Poland, Russia, Switzerland, Ukraine, USA" ...
## $ APPROACH : chr "Lhasa->Tingri->Everest BC" NA "Lhasa->Tingri->Everest BC" NA ...
## $ BCDATE : IDate, format: NA NA ...
## $ SMTDATE : IDate, format: "2020-05-26" "2020-05-27" ...
## $ SMTTIME : int 1515 945 545 1300 1300 1243 NA 930 NA 615 ...
## $ SMTDAYS : int 0 0 35 1 9 16 11 3 0 10 ...
## $ TOTDAYS : int 0 0 38 0 11 17 13 4 0 0 ...
## $ TERMDATE : IDate, format: NA NA ...
## $ TERMREASON : int 1 1 1 1 1 1 4 1 12 1 ...
## $ TERMREASON_FACTOR: chr "Success (main peak)" "Success (main peak)" "Success (main peak)" "Success (main peak)" ...
## $ TERMNOTE : chr NA NA NA NA ...
## $ HIGHPOINT : int 8849 8849 8849 6814 6814 6814 6650 6814 0 6814 ...
## $ TRAVERSE : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ SKI : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ PARAPENTE : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ CAMPS : int 3 3 3 2 2 2 2 2 0 2 ...
## $ ROPE : int 0 0 0 0 0 0 0 0 0 0 ...
## $ TOTMEMBERS : int 0 12 20 14 6 2 4 1 1 6 ...
## $ SMTMEMBERS : int 0 8 14 9 6 2 0 1 0 1 ...
## $ MDEATHS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ TOTHIRED : int 6 0 22 19 8 1 2 1 0 6 ...
## $ SMTHIRED : int 6 0 21 14 8 1 0 1 0 3 ...
## $ HDEATHS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NOHIRED : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ O2USED : logi TRUE TRUE TRUE FALSE FALSE FALSE ...
## $ O2NONE : logi FALSE FALSE FALSE TRUE TRUE TRUE ...
## $ O2CLIMB : logi TRUE TRUE TRUE FALSE FALSE FALSE ...
## $ O2DESCENT : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ O2SLEEP : logi TRUE TRUE TRUE FALSE FALSE FALSE ...
## $ O2MEDICAL : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ O2TAKEN : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ O2UNKWN : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ OTHERSMTS : chr NA NA NA NA ...
## $ CAMPSITES : chr "BC,ABC,C1,C2,C3,Smt(26/05)" "BC,ABC,C1,C2,C3,Smt(27/05)" "BC(23/04,5200m),IC(26/04,5800m),ABC(05/01,6500m),C1(25/05,7028m),C2(26/05,7790m),C3(27/05,8300m),Smt(28/05)" "BC(09/11,4450m),C1(5600m),C2(5900m),Smt(10,12-13,15/11)" ...
## $ ROUTEMEMO : int NA 221011 203869 NA 29755 107752 29661 17154 NA 69150 ...
## $ ACCIDENTS : chr NA NA NA NA ...
## $ ACHIEVMENT : chr NA NA NA NA ...
## $ AGENCY : chr "Holy Mountain Adventure" NA "Holy Mountain Adventure" "Seven Summit Treks" ...
## $ COMRTE : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ STDRTE : logi TRUE TRUE TRUE FALSE FALSE FALSE ...
## $ PRIMRTE : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ PRIMMEM : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ PRIMREF : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ PRIMID : chr NA NA NA NA ...
## $ CHKSUM : int 2465291 2465292 2465293 2463299 2463299 2463320 2463316 2463318 4135 2463318 ...
## - attr(*, ".internal.selfref")=<pointer: 0x105f28080>
str(peaks)
## Classes 'data.table' and 'data.frame': 480 obs. of 29 variables:
## $ PEAKID : chr "AMAD" "AMPG" "ANN1" "ANN2" ...
## $ PKNAME : chr "Ama Dablam" "Amphu Gyabjen" "Annapurna I" "Annapurna II" ...
## $ PKNAME2 : chr "Amai Dablang" "Amphu Gyabien" NA NA ...
## $ LOCATION : chr "Khumbu Himal" "Khumbu Himal (N of Ama Dablam)" "Annapurna Himal" "Annapurna Himal" ...
## $ HEIGHTM : int 6814 5630 8091 7937 7555 7525 8026 8051 7219 7132 ...
## $ HEIGHTF : int 22356 18471 26545 26040 24787 24688 26332 26414 23684 23399 ...
## $ HIMAL : int 12 12 1 1 1 1 1 1 1 2 ...
## $ HIMAL_FACTOR : chr "Khumbu" "Khumbu" "Annapurna" "Annapurna" ...
## $ REGION : int 2 2 5 5 5 5 5 5 5 7 ...
## $ REGION_FACTOR : chr "Khumbu-Rolwaling-Makalu" "Khumbu-Rolwaling-Makalu" "Annapurna-Damodar-Peri" "Annapurna-Damodar-Peri" ...
## $ OPEN : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ UNLISTED : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ TREKKING : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ TREKYEAR : int NA NA NA NA NA NA NA NA NA NA ...
## $ RESTRICT : chr NA "Opened in 2002" NA NA ...
## $ PHOST : int 1 1 1 1 1 1 1 1 1 1 ...
## $ PHOST_FACTOR : chr "Nepal only" "Nepal only" "Nepal only" "Nepal only" ...
## $ PSTATUS : int 2 2 2 2 2 2 2 2 2 2 ...
## $ PSTATUS_FACTOR: chr "Climbed" "Climbed" "Climbed" "Climbed" ...
## $ PEAKMEMO : int 8 20 23 31 35 40 44 46 48 54 ...
## $ PYEAR : int 1961 1953 1950 1960 1961 1955 1974 1980 1964 1960 ...
## $ PSEASON : int 1 1 1 1 1 1 1 3 3 1 ...
## $ PEXPID : chr "AMAD61101" "AMPG53101" "ANN150101" "ANN260101" ...
## $ PSMTDATE : chr "Mar 13" "Apr 11" "Jun 03" "May 17" ...
## $ PCOUNTRY : chr "New Zealand, USA, UK" "UK" "France" "UK, Nepal" ...
## $ PSUMMITERS : chr "Mike Gill, Wally Romanes, Barry Bishop, Michael Ward" "John Hunt, Tom Bourdillon" "Maurice Herzog, Louis Lachenal" "Richard Grant, Chris Bonington, Ang Nyima Sherpa" ...
## $ PSMTNOTE : chr NA NA NA NA ...
## $ REFERMEMO : int NA NA 25 33 NA NA NA NA NA NA ...
## $ PHOTOMEMO : int 13 NA 26 34 37 42 45 NA 50 57 ...
## - attr(*, ".internal.selfref")=<pointer: 0x105f28080>
The expeditions table has 882 rows and 69 columns. The peaks table has 480 rows and 29 columns.
Both tables share PEAKID, so merging on that connects
expedition outcomes with peak info like height and region.
dt <- merge(exped, peaks, by = 'PEAKID', suffixes = c('_exp', '_peak'))
dim(dt)
## [1] 882 97
After merging: 882 expedition records with peak-level information attached.
Filter for expeditions to peaks above 8,000 meters (the “eight-thousanders”):
eight_thousanders <- dt[HEIGHTM >= 8000]
nrow(eight_thousanders)
## [1] 483
## expedition count per peak
eight_thousanders[, .N, by = PKNAME][order(-N)]
## PKNAME N
## <char> <int>
## 1: Everest 189
## 2: Manaslu 96
## 3: Lhotse 79
## 4: Dhaulagiri I 35
## 5: Makalu 26
## 6: Annapurna I 22
## 7: Kangchenjunga 20
## 8: Cho Oyu 16
483 expeditions targeted peaks above 8,000m. Everest dominates the list.
Filter for expeditions where at least one person died:
deadly <- dt[MDEATHS > 0 | HDEATHS > 0]
deadly[, total_deaths := MDEATHS + HDEATHS]
nrow(deadly)
## [1] 38
38 expeditions had at least one fatality.
Summary stats by season:
dt[, .(
expeditions = .N,
avg_team_size = round(mean(TOTMEMBERS, na.rm = TRUE), 1),
avg_hired = round(mean(TOTHIRED, na.rm = TRUE), 1),
total_deaths = sum(MDEATHS + HDEATHS, na.rm = TRUE)
), by = SEASON_FACTOR][order(-expeditions)]
## SEASON_FACTOR expeditions avg_team_size avg_hired total_deaths
## <char> <int> <num> <num> <int>
## 1: Spring 462 8.2 9.4 42
## 2: Autumn 394 8.5 5.9 11
## 3: Winter 21 4.5 2.3 0
## 4: Summer 5 2.2 0.8 0
Spring is by far the busiest season, which makes sense since the best weather windows on the big peaks happen in April/May.
Summary stats by region:
dt[, .(
expeditions = .N,
total_deaths = sum(MDEATHS + HDEATHS, na.rm = TRUE),
avg_height = round(mean(HEIGHTM, na.rm = TRUE), 0)
), by = REGION_FACTOR][order(-expeditions)]
## REGION_FACTOR expeditions total_deaths avg_height
## <char> <int> <int> <num>
## 1: Khumbu-Rolwaling-Makalu 553 44 7876
## 2: Annapurna-Damodar-Peri 109 1 7234
## 3: Manaslu-Ganesh 99 4 8106
## 4: Dhaulagiri-Mukut 50 2 7848
## 5: Kangchenjunga-Janak 37 2 7848
## 6: Langtang-Jugal 21 0 6442
## 7: Kanjiroba-Far West 13 0 6455
ggplot(peaks, aes(x = HEIGHTM)) +
geom_histogram(binwidth = 200, fill = brewer.pal(9, 'Blues')[6], color = 'white') +
labs(title = 'Distribution of Himalayan Peak Heights',
x = 'Height (meters)', y = 'Number of Peaks') +
theme_minimal()
Right-skewed distribution. Most peaks are in the 6,000-7,500m range, with only a handful above 8,000m.
yearly <- dt[, .N, by = .(YEAR, SEASON_FACTOR)]
ggplot(yearly[SEASON_FACTOR %in% c('Spring', 'Autumn')],
aes(x = YEAR, y = N, color = SEASON_FACTOR)) +
geom_line(linewidth = 0.8) +
geom_point(size = 1.5) +
scale_color_brewer(palette = 'Set1') +
labs(title = 'Himalayan Expeditions Over Time',
x = 'Year', y = 'Number of Expeditions', color = 'Season') +
theme_minimal()
Spring consistently has more expeditions than Autumn across 2020-2024. Both seasons saw a dip in 2024.
deaths_by_peak <- dt[, .(total_deaths = sum(MDEATHS + HDEATHS, na.rm = TRUE)),
by = PKNAME][order(-total_deaths)][1:10]
ggplot(deaths_by_peak, aes(x = reorder(PKNAME, total_deaths), y = total_deaths)) +
geom_col(fill = brewer.pal(9, 'Reds')[6]) +
coord_flip() +
labs(title = 'Top 10 Deadliest Himalayan Peaks',
subtitle = 'Total deaths (members + hired) across all expeditions',
x = '', y = 'Total Deaths') +
theme_minimal()
Everest leads in absolute deaths, partly because it has the most expeditions. Manaslu and Omoga Ri Chang also rank high. Annapurna I is arguably more dangerous per attempt given fewer expeditions.
This plot uses two geom layers: geom_point and
geom_smooth.
ggplot(dt, aes(x = HEIGHTM, y = TOTMEMBERS)) +
geom_point(alpha = 0.3, color = brewer.pal(9, 'Set1')[2], size = 1.5) +
geom_smooth(method = 'lm', se = TRUE, color = brewer.pal(9, 'Set1')[1]) +
labs(title = 'Team Size vs Peak Height',
x = 'Peak Height (meters)', y = 'Total Team Members') +
theme_minimal()
Positive relationship. Taller peaks need more logistics (porters, camps, fixed ropes), so teams are bigger.
term_counts <- dt[!is.na(TERMREASON_FACTOR) & TERMREASON_FACTOR != '',
.N, by = TERMREASON_FACTOR][order(-N)]
ggplot(term_counts, aes(x = reorder(TERMREASON_FACTOR, N), y = N)) +
geom_col(fill = brewer.pal(9, 'Purples')[6]) +
coord_flip() +
labs(title = 'Why Do Expeditions End?',
x = '', y = 'Number of Expeditions') +
theme_minimal()
Most expeditions end with a successful summit. Among failures, bad weather and conditions are the top reasons.
ggplot(dt[!is.na(TOTDAYS) & TOTDAYS > 0 & TOTDAYS < 200 &
SEASON_FACTOR %in% c('Spring', 'Autumn', 'Winter', 'Summer')],
aes(x = SEASON_FACTOR, y = TOTDAYS, fill = SEASON_FACTOR)) +
geom_boxplot(outlier.alpha = 0.3) +
scale_fill_brewer(palette = 'Pastel1') +
labs(title = 'Expedition Duration by Season',
x = 'Season', y = 'Total Days') +
theme_minimal() +
theme(legend.position = 'none')
Spring has the widest spread and highest median. That’s the season when the big 8,000m peaks are attempted, and those require weeks of acclimatization.
This plot uses two geom layers: geom_point and
geom_text.
peak_danger <- dt[, .(
total_deaths = sum(MDEATHS + HDEATHS, na.rm = TRUE),
expeditions = .N,
height = first(HEIGHTM),
peak_name = first(PKNAME)
), by = PEAKID][total_deaths > 0][order(-total_deaths)]
top_peaks <- peak_danger[1:15]
ggplot(top_peaks, aes(x = height, y = total_deaths)) +
geom_point(aes(size = expeditions), color = brewer.pal(8, 'Dark2')[1], alpha = 0.7) +
geom_text(aes(label = peak_name), hjust = -0.1, vjust = 0.5,
size = 3, check_overlap = TRUE) +
scale_size_continuous(range = c(2, 10), name = 'Total Expeditions') +
labs(title = 'Peak Height vs Total Deaths (Top 15)',
subtitle = 'Bubble size = number of expeditions',
x = 'Peak Height (meters)', y = 'Total Deaths') +
theme_minimal() +
theme(legend.position = 'bottom') +
xlim(NA, max(top_peaks$height) + 400)
Everest has the most deaths and most expeditions. Annapurna I stands out as particularly deadly relative to how few expeditions attempt it.
deaths_time <- dt[YEAR >= 1960, .(
member_deaths = sum(MDEATHS, na.rm = TRUE),
hired_deaths = sum(HDEATHS, na.rm = TRUE)
), by = YEAR]
deaths_long <- melt(deaths_time, id.vars = 'YEAR',
variable.name = 'type', value.name = 'deaths')
deaths_long[, type := ifelse(type == 'member_deaths', 'Expedition Members', 'Hired Personnel')]
ggplot(deaths_long, aes(x = YEAR, y = deaths, fill = type)) +
geom_area(alpha = 0.7) +
scale_fill_brewer(palette = 'Set2') +
labs(title = 'Climbing Deaths Over Time',
subtitle = 'Member and hired personnel deaths (2020-2024)',
x = 'Year', y = 'Deaths', fill = 'Category') +
theme_minimal()
Deaths peaked around 2022-2023. Hired personnel (Sherpas, local guides) make up a significant share of fatalities, which is an ongoing ethical concern
Main takeaways from this analysis: