1. Introduction

The Himalayan Database is one of the most comprehensive mountaineering archives in the world. It documents every recorded expedition to peaks in the Nepal Himalaya, continuing the work of journalist Elizabeth Hawley, who spent decades cataloguing climbing history in the region.

For this project, I am using the TidyTuesday (Week 3, 2025) version of this dataset, which includes:

  • exped_tidy โ€” 882 expeditions from 2020 to 2024, with details on season, nationality, success, deaths, oxygen use, and days to summit.
  • peaks_tidy โ€” information on all recorded Himalayan peaks, including height, region, and whether the peak is open for climbing.

Key questions I want to explore:

  1. Which seasons are most popular for climbing โ€” and which are most successful?
  2. Which nationalities lead the most expeditions?
  3. Does peak height affect how long it takes to reach the summit?
  4. Does using supplemental oxygen increase success rates?
  5. Which mountain ranges are the most dangerous?
  6. How do peak heights differ across regions and open vs.ย closed peaks?

2. Setup โ€” Loading Packages and Data

library(data.table)
library(ggplot2)
library(RColorBrewer)
library(scales)
# Load both datasets directly from GitHub
exped_tidy <- fread(
  "https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-01-21/exped_tidy.csv"
)

peaks_tidy <- fread(
  "https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-01-21/peaks_tidy.csv"
)
# Quick overview of each dataset
dim(exped_tidy)
## [1] 882  69
dim(peaks_tidy)
## [1] 480  29
head(exped_tidy[, .(EXPID, PEAKID, YEAR, SEASON_FACTOR, HOST_FACTOR,
                    NATION, SUCCESS1, TOTMEMBERS, MDEATHS, O2USED)])
##        EXPID PEAKID  YEAR SEASON_FACTOR HOST_FACTOR NATION SUCCESS1 TOTMEMBERS
##       <char> <char> <int>        <char>      <char> <char>   <lgcl>      <int>
## 1: EVER20101   EVER  2020        Spring       China  China     TRUE          0
## 2: EVER20102   EVER  2020        Spring       China  China     TRUE         12
## 3: EVER20103   EVER  2020        Spring       China  China     TRUE         20
## 4: AMAD20301   AMAD  2020        Autumn       Nepal  Nepal     TRUE         14
## 5: AMAD20302   AMAD  2020        Autumn       Nepal    USA     TRUE          6
## 6: AMAD20303   AMAD  2020        Autumn       Nepal     UK     TRUE          2
##    MDEATHS O2USED
##      <int> <lgcl>
## 1:       0   TRUE
## 2:       0   TRUE
## 3:       0   TRUE
## 4:       0  FALSE
## 5:       0  FALSE
## 6:       0  FALSE
head(peaks_tidy[, .(PEAKID, PKNAME, HEIGHTM, REGION_FACTOR, OPEN, PYEAR)])
##    PEAKID        PKNAME HEIGHTM           REGION_FACTOR   OPEN PYEAR
##    <char>        <char>   <int>                  <char> <lgcl> <int>
## 1:   AMAD    Ama Dablam    6814 Khumbu-Rolwaling-Makalu   TRUE  1961
## 2:   AMPG Amphu Gyabjen    5630 Khumbu-Rolwaling-Makalu   TRUE  1953
## 3:   ANN1   Annapurna I    8091  Annapurna-Damodar-Peri   TRUE  1950
## 4:   ANN2  Annapurna II    7937  Annapurna-Damodar-Peri   TRUE  1960
## 5:   ANN3 Annapurna III    7555  Annapurna-Damodar-Peri   TRUE  1961
## 6:   ANN4  Annapurna IV    7525  Annapurna-Damodar-Peri   TRUE  1955

The expeditions dataset has 882 rows and 69 columns. The peaks dataset has 480 rows and 29 columns.


3. Data Transformation

3.1 Merging the Two Datasets

To answer questions that involve both expedition details and peak characteristics (like height and region), I merge the two tables on PEAKID.

# Merge expeditions with peak info on PEAKID
merged <- merge(
  exped_tidy,
  peaks_tidy[, .(PEAKID, PKNAME, HEIGHTM, REGION_FACTOR, OPEN, HIMAL_FACTOR)],
  by   = "PEAKID",
  all.x = TRUE  # keep all expeditions even if peak info is missing
)

cat("Merged dataset:", nrow(merged), "rows and", ncol(merged), "columns\n")
## Merged dataset: 882 rows and 74 columns

3.2 Filtering and Aggregating with data.table

# --- Filtering ---
# Keep only expeditions with known season and at least one member
exped_clean <- exped_tidy[!is.na(SEASON_FACTOR) & TOTMEMBERS > 0]

# --- Success rate by season ---
season_summary <- exped_clean[, .(
  total       = .N,
  successful  = sum(SUCCESS1 == TRUE, na.rm = TRUE),
  avg_members = round(mean(TOTMEMBERS, na.rm = TRUE), 1),
  total_deaths = sum(MDEATHS, na.rm = TRUE)
), by = SEASON_FACTOR]

season_summary[, success_rate := round(successful / total * 100, 1)]
season_summary <- season_summary[order(-total)]
print(season_summary)
##    SEASON_FACTOR total successful avg_members total_deaths success_rate
##           <char> <int>      <int>       <num>        <int>        <num>
## 1:        Spring   452        342         8.4           28         75.7
## 2:        Autumn   390        259         8.5            8         66.4
## 3:        Winter    21          7         4.5            0         33.3
## 4:        Summer     5          3         2.2            0         60.0
# --- Top nationalities ---
nation_counts <- exped_clean[!is.na(NATION), .N, by = NATION]
nation_counts <- nation_counts[order(-N)][1:10]
print(nation_counts)
##      NATION     N
##      <char> <int>
##  1:   Nepal   117
##  2:     USA   117
##  3:      UK    91
##  4:   India    51
##  5:  France    51
##  6: Germany    32
##  7:   China    30
##  8: Austria    29
##  9:   Spain    26
## 10:  Russia    26
# --- Oxygen use vs success ---
o2_summary <- exped_clean[!is.na(O2USED), .(
  total      = .N,
  successful = sum(SUCCESS1 == TRUE, na.rm = TRUE)
), by = O2USED]
o2_summary[, success_rate := round(successful / total * 100, 1)]
o2_summary[, O2Label := ifelse(O2USED, "Used Oxygen", "No Oxygen")]
print(o2_summary)
##    O2USED total successful success_rate     O2Label
##    <lgcl> <int>      <int>        <num>      <char>
## 1:   TRUE   396        352         88.9 Used Oxygen
## 2:  FALSE   472        259         54.9   No Oxygen
# --- Deaths by mountain range (from merged dataset) ---
danger_summary <- merged[!is.na(HIMAL_FACTOR), .(
  total_deaths    = sum(MDEATHS, na.rm = TRUE),
  total_expeditions = .N,
  death_rate      = round(sum(MDEATHS, na.rm = TRUE) / .N * 100, 2)
), by = HIMAL_FACTOR]
danger_summary <- danger_summary[order(-total_deaths)][1:10]
print(danger_summary)
##                  HIMAL_FACTOR total_deaths total_expeditions death_rate
##                        <char>        <int>             <int>      <num>
##  1:                    Khumbu           28               497       5.63
##  2:   Kangchenjunga/Simhalila            2                29       6.90
##  3:                Dhaulagiri            2                48       4.17
##  4:           Manaslu/Mansiri            2                99       2.02
##  5:                 Annapurna            1                34       2.94
##  6:                    Makalu            1                29       3.45
##  7:                  Langtang            0                 7       0.00
##  8:                   Damodar            0                11       0.00
##  9:                 Rolwaling            0                27       0.00
## 10: Nalakankar/Chandi/Changla            0                 6       0.00
# --- Average height by region ---
region_heights <- peaks_tidy[!is.na(REGION_FACTOR) & !is.na(HEIGHTM),
                              .(avg_height = mean(HEIGHTM),
                                n_peaks    = .N),
                              by = REGION_FACTOR]
region_heights <- region_heights[order(-avg_height)]
print(region_heights)
##              REGION_FACTOR avg_height n_peaks
##                     <char>      <num>   <int>
## 1:     Kangchenjunga-Janak   6869.771      70
## 2:  Annapurna-Damodar-Peri   6721.757      74
## 3:        Dhaulagiri-Mukut   6704.025      40
## 4:          Manaslu-Ganesh   6694.829      41
## 5: Khumbu-Rolwaling-Makalu   6682.864     132
## 6:          Langtang-Jugal   6495.889      36
## 7:      Kanjiroba-Far West   6374.437      87

4. Visualizations

I apply a consistent theme and ColorBrewer palette across all plots.

# Custom theme applied to all plots
my_theme <- theme_minimal(base_size = 13) +
  theme(
    plot.title    = element_text(face = "bold", size = 15),
    plot.subtitle = element_text(color = "grey40", size = 11),
    plot.caption  = element_text(color = "grey60", size = 9, hjust = 1),
    axis.title    = element_text(face = "bold"),
    legend.position  = "bottom",
    panel.grid.minor = element_blank()
  )

season_colors <- brewer.pal(4, "Set2")   # ColorBrewer palette
blue_palette  <- brewer.pal(9, "Blues")
red_palette   <- brewer.pal(9, "Reds")

Plot 1 โ€” How Many Expeditions Per Season? (Bar Chart)

ggplot(season_summary, aes(x = reorder(SEASON_FACTOR, -total),
                            y = total,
                            fill = SEASON_FACTOR)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = total), vjust = -0.5, fontface = "bold", size = 4.5) +
  scale_fill_manual(values = season_colors) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.12))) +
  labs(
    title    = "Himalayan Expedition Counts by Season (2020โ€“2024)",
    subtitle = "Spring dominates โ€” most climbers aim for the pre-monsoon weather window",
    x        = "Season",
    y        = "Number of Expeditions",
    fill     = "Season",
    caption  = "Source: The Himalayan Database via TidyTuesday 2025 Week 3"
  ) +
  my_theme +
  theme(legend.position = "none")

Insight: Spring is by far the most popular season, accounting for the majority of expeditions. This makes sense โ€” spring offers the most stable weather window before the monsoon season arrives. Winter expeditions are rare and extremely dangerous.


Plot 2 โ€” Success Rate by Season (Bar Chart with color)

ggplot(season_summary, aes(x = reorder(SEASON_FACTOR, -success_rate),
                            y = success_rate,
                            fill = success_rate)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = paste0(success_rate, "%")), vjust = -0.5,
            fontface = "bold", size = 4.5) +
  scale_fill_gradient(low = blue_palette[3], high = blue_palette[8]) +
  scale_y_continuous(limits = c(0, 100),
                     labels = label_percent(scale = 1),
                     expand = expansion(mult = c(0, 0.1))) +
  labs(
    title    = "Summit Success Rate by Season",
    subtitle = "Autumn and Spring have the highest chance of reaching the top",
    x        = "Season",
    y        = "Success Rate (%)",
    fill     = "Success Rate",
    caption  = "Source: The Himalayan Database via TidyTuesday 2025 Week 3"
  ) +
  my_theme

Insight: Both Spring and Autumn have relatively high success rates. Winter expeditions, while extremely rare, have a very low success rate โ€” a testament to how brutal Himalayan winters are.


Plot 3 โ€” Top 10 Nationalities Leading Expeditions (Bar Chart)

ggplot(nation_counts, aes(x = reorder(NATION, N), y = N, fill = N)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = N), hjust = -0.2, fontface = "bold", size = 4) +
  scale_fill_gradient(low = brewer.pal(9, "Greens")[3],
                      high = brewer.pal(9, "Greens")[8]) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  coord_flip() +
  labs(
    title    = "Top 10 Nationalities Leading Himalayan Expeditions (2020โ€“2024)",
    subtitle = "Nepal leads the way, followed by China and other major climbing nations",
    x        = "Country",
    y        = "Number of Expeditions",
    fill     = "Count",
    caption  = "Source: The Himalayan Database via TidyTuesday 2025 Week 3"
  ) +
  my_theme +
  theme(legend.position = "none")

Insight: Nepal and China top the list, which makes sense given these are the host countries for the Himalayan peaks. The USA, South Korea, and various European nations also feature prominently, reflecting the global appeal of high-altitude mountaineering.


Plot 4 โ€” Peak Height vs Days to Summit (Scatterplot + Smooth)

# Use merged dataset; filter to valid summit days and heights
plot4_data <- merged[!is.na(HEIGHTM) & !is.na(SMTDAYS) & SMTDAYS > 0 & SMTDAYS < 200]

ggplot(plot4_data, aes(x = HEIGHTM, y = SMTDAYS, color = SEASON_FACTOR)) +
  geom_point(alpha = 0.4, size = 2) +
  geom_smooth(method = "lm", se = TRUE, color = "grey20", linewidth = 1) +
  scale_color_manual(values = season_colors) +
  labs(
    title    = "Does Peak Height Predict Days to Summit?",
    subtitle = "Each point is one expedition; the line shows the overall linear trend",
    x        = "Peak Height (metres)",
    y        = "Days to Reach Summit",
    color    = "Season",
    caption  = "Source: The Himalayan Database via TidyTuesday 2025 Week 3"
  ) +
  my_theme

Insight: There is a slight positive trend โ€” higher peaks generally require more days to summit. However, thereโ€™s a lot of variation, which suggests that other factors (weather, acclimatisation strategy, route difficulty) also play a major role.


Plot 5 โ€” Peak Height Distribution by Mountain Range (Boxplot)

# Focus on regions with enough peaks
region_data <- peaks_tidy[!is.na(REGION_FACTOR) & !is.na(HEIGHTM)]

# Keep top regions by number of peaks
top_regions <- region_data[, .N, by = REGION_FACTOR][order(-N)][1:8, REGION_FACTOR]
region_data <- region_data[REGION_FACTOR %in% top_regions]

ggplot(region_data, aes(x = reorder(REGION_FACTOR, HEIGHTM, median),
                         y = HEIGHTM,
                         fill = REGION_FACTOR)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.4, outlier.size = 1.5) +
  scale_fill_brewer(palette = "Set3") +
  coord_flip() +
  labs(
    title    = "Peak Height Distribution by Himalayan Region",
    subtitle = "Khumbu and Mahalangur regions contain the highest peaks",
    x        = "Mountain Region",
    y        = "Peak Height (metres)",
    fill     = "Region",
    caption  = "Source: The Himalayan Database via TidyTuesday 2025 Week 3"
  ) +
  my_theme +
  theme(legend.position = "none")

Insight: Thereโ€™s significant variation in peak heights within each region. The Mahalangur and Khumbu regions (home to Everest) have the tallest median heights. Some regions have surprisingly wide distributions, indicating a mix of beginner-friendly and expert-level peaks.


Plot 6 โ€” Does Oxygen Use Improve Summit Success? (Bar Chart)

ggplot(o2_summary, aes(x = O2Label, y = success_rate, fill = O2Label)) +
  geom_col(width = 0.5) +
  geom_text(aes(label = paste0(success_rate, "%\n(n=", total, ")")),
            vjust = -0.4, fontface = "bold", size = 5) +
  scale_fill_manual(values = c("Used Oxygen" = brewer.pal(9, "Blues")[7],
                                "No Oxygen"   = brewer.pal(9, "Oranges")[6])) +
  scale_y_continuous(limits = c(0, 100),
                     labels = label_percent(scale = 1),
                     expand = expansion(mult = c(0, 0.15))) +
  labs(
    title    = "Summit Success Rate: Oxygen vs No Oxygen",
    subtitle = "Expeditions using supplemental oxygen reach the summit more often",
    x        = NULL,
    y        = "Success Rate (%)",
    caption  = "Source: The Himalayan Database via TidyTuesday 2025 Week 3"
  ) +
  my_theme +
  theme(legend.position = "none")

Insight: Expeditions using supplemental oxygen have a notably higher success rate. This is expected โ€” oxygen helps climbers cope with altitude sickness and maintain physical performance in the death zone above 8,000 metres.


Plot 7 โ€” Deaths by Mountain Range (Bar Chart)

ggplot(danger_summary, aes(x = reorder(HIMAL_FACTOR, total_deaths),
                            y = total_deaths,
                            fill = total_deaths)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = total_deaths), hjust = -0.2, fontface = "bold", size = 4) +
  scale_fill_gradient(low = red_palette[3], high = red_palette[8]) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.18))) +
  coord_flip() +
  labs(
    title    = "Total Member Deaths by Mountain Range (2020โ€“2024)",
    subtitle = "Absolute deaths โ€” does not account for number of expeditions per range",
    x        = "Mountain Range",
    y        = "Total Deaths",
    fill     = "Deaths",
    caption  = "Source: The Himalayan Database via TidyTuesday 2025 Week 3"
  ) +
  my_theme +
  theme(legend.position = "none")

Insight: The Khumbu range (Everestโ€™s home) records the most deaths in absolute terms, which reflects how many expeditions attempt Everest and the surrounding 8,000-metre giants. However, this doesnโ€™t necessarily mean these are the most dangerous per expedition โ€” more climbers simply attempt them.


Plot 8 โ€” Open vs Closed Peaks: Height Distribution (Violin + Boxplot)

# Filter peaks with known open status and height
open_data <- peaks_tidy[!is.na(OPEN) & !is.na(HEIGHTM)]
open_data[, OpenLabel := ifelse(OPEN, "Open to Climbing", "Closed")]

ggplot(open_data, aes(x = OpenLabel, y = HEIGHTM, fill = OpenLabel)) +
  geom_violin(alpha = 0.6, trim = FALSE) +
  geom_boxplot(width = 0.15, fill = "white", outlier.size = 1.5, alpha = 0.9) +
  scale_fill_manual(values = c("Open to Climbing" = brewer.pal(9, "Blues")[6],
                                "Closed"           = brewer.pal(9, "Greys")[5])) +
  labs(
    title    = "Height Distribution: Open vs Closed Himalayan Peaks",
    subtitle = "Multiple geom layers: violin shows full distribution, boxplot shows median and IQR",
    x        = NULL,
    y        = "Peak Height (metres)",
    fill     = "Climbing Status",
    caption  = "Source: The Himalayan Database via TidyTuesday 2025 Week 3"
  ) +
  my_theme +
  theme(legend.position = "none")

Insight: Open peaks tend to be taller on average than closed ones. This might seem counterintuitive, but it reflects the commercial appeal of high-profile peaks โ€” taller mountains attract more climbers and more permit revenue, so they are kept open. Closed peaks may include sacred or ecologically sensitive mountains regardless of height.


5. Summary and Conclusions

This analysis explored 882 Himalayan expeditions from 2020 to 2024, merged with a database of all recorded peaks.

Key takeaways:

  • Spring is peak season for Himalayan climbing โ€” the most expeditions and solid success rates.
  • Nepal and China lead in expedition counts, reflecting their role as the host nations of the Himalayas.
  • Higher peaks generally take longer to summit, though individual variation is large.
  • Oxygen use significantly improves success rates, confirming its importance above 8,000 metres.
  • The Khumbu range records the most deaths, largely because it hosts the worldโ€™s most-attempted extreme peaks including Mount Everest.
  • Open peaks are generally taller than closed ones, likely due to the commercial and historical significance of high-altitude objectives.

Overall, mountaineering in the Himalayas remains both a highly organised commercial endeavour and an extreme sport where success depends on many interacting factors โ€” season, altitude, oxygen, nationality, and pure determination.


Data source: The Himalayan Database (Elizabeth Hawley Archive), via TidyTuesday 2025 Week 3.
Analysis by: [Your Name], 2026-05-19