Group Project by: Joel, Ryan, Daniel, Sasindhi, Ruoyu

# Base Packages and loading datasets
library(readr)
library(tidyverse)
library(readxl)
library(stringr)
library(lubridate)
library(ggthemes)
library(ggtext)
library(shiny)

exped_tidy <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-01-21/exped_tidy.csv')
peaks_tidy <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-01-21/peaks_tidy.csv')

1 Introduction

The Himalayan Expeditions dataset offers a comprehensive view of the successes and failures of thousands of mountaineering attempts across the Himalayas. Compiled from The Himalayan Database, it includes two data frames: one documenting details of expeditions from 2020 to 2024, and the other capturing the geographical characteristics of Himalayan peaks. Together, these data provide a valuable foundation for understanding the dynamics of high-altitude mountaineering.

This project aims to investigate the key factors that contribute to the success of Himalayan climbing expeditions. Specifically, we will begin by exploring the attrition and success rates across different peaks, followed by a deep dive into how the number of hired personnel influences expedition outcomes. Lastly, we will analyse the primary causes of expedition failure at varying elevations. By visualising these trends, we hope to uncover how physical, logistical, and environmental factors intersect to shape expedition outcomes in one of the most demanding terrains on Earth.

2 Data Description and Cleaning

The repository contains 2 data sets as mentioned above, namely exped_tidy and peaks_tidy.

2.1 Expedition Data Set: exped_tidy

The first data set exped_tidy is a curated subset of the original Himalayan Expedition dataset, focusing on the years 2020 to 2024. It contains 882 entries evaluated across 69 variables, providing detailed information on the expeditions in this period, these include:

  1. Date, duration, season of the expedition
  2. Route taken
  3. Reason for termination (Success, abandonment due to bad weather etc.)
  4. Team composition and agency
  5. Logistical usage (Hired guides, oxygen, rope etc.)
glimpse(exped_tidy)
## Rows: 882
## Columns: 69
## $ EXPID             <chr> "EVER20101", "EVER20102", "EVER20103", "AMAD20301", …
## $ PEAKID            <chr> "EVER", "EVER", "EVER", "AMAD", "AMAD", "AMAD", "AMA…
## $ YEAR              <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020…
## $ SEASON            <dbl> 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 4, 4…
## $ SEASON_FACTOR     <chr> "Spring", "Spring", "Spring", "Autumn", "Autumn", "A…
## $ HOST              <dbl> 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1…
## $ HOST_FACTOR       <chr> "China", "China", "China", "Nepal", "Nepal", "Nepal"…
## $ ROUTE1            <chr> "N Col-NE Ridge", "N Col-NE Ridge", "N Col-NE Ridge"…
## $ ROUTE2            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ ROUTE3            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ ROUTE4            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ NATION            <chr> "China", "China", "China", "Nepal", "USA", "UK", "UK…
## $ LEADERS           <chr> "Tibetan Rope-Fixing", "Ci Luo (Tselo)", "Tsering Sa…
## $ SPONSOR           <chr> "Tibetan Rope-Fixing Everest North 2020", "Chinese M…
## $ SUCCESS1          <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FAL…
## $ SUCCESS2          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ SUCCESS3          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ SUCCESS4          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ ASCENT1           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ ASCENT2           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ ASCENT3           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ ASCENT4           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ CLAIMED           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ DISPUTED          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ COUNTRIES         <chr> NA, NA, NA, "Canada, Czech Republic, France, Poland,…
## $ APPROACH          <chr> "Lhasa->Tingri->Everest BC", NA, "Lhasa->Tingri->Eve…
## $ BCDATE            <date> NA, NA, 2020-04-23, 2020-11-09, 2020-11-01, 2020-11…
## $ SMTDATE           <date> 2020-05-26, 2020-05-27, 2020-05-28, 2020-11-10, 202…
## $ SMTTIME           <chr> "1515", "0945", "0545", "1300", "1300", "1243", NA, …
## $ SMTDAYS           <dbl> 0, 0, 35, 1, 9, 16, 11, 3, 0, 10, 13, 0, 16, 2, 14, …
## $ TOTDAYS           <dbl> 0, 0, 38, 0, 11, 17, 13, 4, 0, 0, 0, 0, 0, 0, 0, 16,…
## $ TERMDATE          <date> NA, NA, 2020-05-31, NA, 2020-11-12, 2020-12-02, 202…
## $ TERMREASON        <dbl> 1, 1, 1, 1, 1, 1, 4, 1, 12, 1, 1, 1, 1, 1, 4, 1, 1, …
## $ TERMREASON_FACTOR <chr> "Success (main peak)", "Success (main peak)", "Succe…
## $ TERMNOTE          <chr> NA, NA, NA, NA, NA, NA, "Abandoned at 6650m due to h…
## $ HIGHPOINT         <dbl> 8849, 8849, 8849, 6814, 6814, 6814, 6650, 6814, 0, 6…
## $ TRAVERSE          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ SKI               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ PARAPENTE         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ CAMPS             <dbl> 3, 3, 3, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 1, 3, 2, 2, 0…
## $ ROPE              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ TOTMEMBERS        <dbl> 0, 12, 20, 14, 6, 2, 4, 1, 1, 6, 7, 7, 20, 3, 5, 1, …
## $ SMTMEMBERS        <dbl> 0, 8, 14, 9, 6, 2, 0, 1, 0, 1, 7, 7, 19, 2, 0, 1, 1,…
## $ MDEATHS           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ TOTHIRED          <dbl> 6, 0, 22, 19, 8, 1, 2, 1, 0, 6, 14, 0, 24, 3, 0, 4, …
## $ SMTHIRED          <dbl> 6, 0, 21, 14, 8, 1, 0, 1, 0, 3, 12, 0, 24, 3, 0, 4, …
## $ HDEATHS           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ NOHIRED           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ O2USED            <lgl> TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ O2NONE            <lgl> FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
## $ O2CLIMB           <lgl> TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ O2DESCENT         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ O2SLEEP           <lgl> TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ O2MEDICAL         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ O2TAKEN           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ O2UNKWN           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ OTHERSMTS         <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Climbed Lobuche", N…
## $ CAMPSITES         <chr> "BC,ABC,C1,C2,C3,Smt(26/05)", "BC,ABC,C1,C2,C3,Smt(2…
## $ ROUTEMEMO         <dbl> NA, 221011, 203869, NA, 29755, 107752, 29661, 17154,…
## $ ACCIDENTS         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ ACHIEVMENT        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "1st Arabic …
## $ AGENCY            <chr> "Holy Mountain Adventure", NA, "Holy Mountain Advent…
## $ COMRTE            <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
## $ STDRTE            <lgl> TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ PRIMRTE           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ PRIMMEM           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ PRIMREF           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ PRIMID            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ CHKSUM            <dbl> 2465291, 2465292, 2465293, 2463299, 2463299, 2463320…

However, many of the entries were not relevant for visualisation tasks due to the structure of the available data. For example, columns like ROUTE3, SUCCESS3 and ASCENT3 were entirely False or NA, rendering them unusable for analysis. Similarly, columns such as LEADERS and SPONSORS contained an overwhelming number of unique values, making them unsuitable for identifying clear trends. As such, careful feature selection was performed after our target questions were formulated, providing a cleaner dataset to work with. Further refining of the dataset was then performed while tackling each part of the analysis.

exped_tidy <- exped_tidy %>%
  select(- c(ROUTE3, SUCCESS3, ASCENT3, ROUTE4, SUCCESS4, ASCENT4, SEASON, HOST, HOST_FACTOR, LEADERS, SPONSOR, CLAIMED, BCDATE, TERMNOTE, PARAPENTE, OTHERSMTS, ROUTEMEMO, ACCIDENTS, ACHIEVMENT, STDRTE:PRIMID, CHKSUM)) %>% 
  distinct()

Additionally, due to the large number of unique peaks in the dataset, we will focus on the six most-climbed peaks for some of the visualisations below. This selection helps to prevent visual clutter that could obscure key trends and reduce the visual appeal of the plots.

Moreover, since many peaks have only a small number of recorded attempts, including them may introduce bias, as the limited sample size may not accurately reflect the overall performance of expeditions on those peaks. As such, we will choose to exclude them from our analysis.

top_peaks <- exped_tidy %>%
  group_by(PEAKID) %>%
  summarise(num_climbs = n_distinct(EXPID)) %>%
  slice_max(num_climbs, n = 6) %>%
  left_join(peaks_tidy, by = 'PEAKID') %>%
  select(PEAKID, PKNAME, HEIGHTM, `Number of Attempts` = num_climbs)

knitr::kable(top_peaks, caption = "Top 6 most-climbed peaks")
Top 6 most-climbed peaks
PEAKID PKNAME HEIGHTM Number of Attempts
EVER Everest 8849 189
AMAD Ama Dablam 6814 147
MANA Manaslu 8163 96
LHOT Lhotse 8516 79
HIML Himlung Himal 7126 51
DHA1 Dhaulagiri I 8167 35

We also explored other descriptive statistics within the dataset while formulating our topics for analysis and visualisation. We first observed that roughly 30% of expeditions were unsuccessful. This raises the question of whether success rates are relatively consistent across different peaks or if they vary due to differing peak conditions. We will investigate this in visualisation 1.

exped_tidy %>% 
  count(SUCCESS1 == TRUE)
## # A tibble: 2 × 2
##   `SUCCESS1 == TRUE`     n
##   <lgl>              <int>
## 1 FALSE                258
## 2 TRUE                 624

A key variable that may contribute to expedition success is the number of hired personnel, such as guides and porters, who support the climb. The relationship between this variable and expedition outcomes will be examined in visualisation 2.

exped_tidy %>% 
  count(TOTHIRED > 0) %>% 
  mutate(PERCENTAGE = round((n * 100 / sum(n)), 2))
## # A tibble: 2 × 3
##   `TOTHIRED > 0`     n PERCENTAGE
##   <lgl>          <int>      <dbl>
## 1 FALSE            165       18.7
## 2 TRUE             717       81.3

Lastly, we noted a wide range of reasons behind expedition terminations. These will be explored further in visualisation 3.

exped_tidy %>% 
  filter(!(TERMREASON %in% 1:3)) %>% 
  count(TERMREASON_FACTOR) %>% 
  arrange(desc(n))
## # A tibble: 11 × 2
##    TERMREASON_FACTOR                                                           n
##    <chr>                                                                   <int>
##  1 Bad weather (storms, high winds)                                           78
##  2 Bad conditions (deep snow, avalanching, falling ice, or rock)              54
##  3 Illness, AMS, exhaustion, or frostbite                                     25
##  4 Unknown                                                                    23
##  5 Did not attempt climb                                                      12
##  6 Other                                                                      12
##  7 Route technically too difficult, lack of experience, strength, or moti…     9
##  8 Accident (death or serious injury)                                          5
##  9 Lack (or loss) of supplies, support or equipment                            4
## 10 Lack of time                                                                4
## 11 Did not reach base camp                                                     2

2.2 Peaks Data Set: peaks_tidy

glimpse(peaks_tidy) 
## Rows: 480
## Columns: 29
## $ PEAKID         <chr> "AMAD", "AMPG", "ANN1", "ANN2", "ANN3", "ANN4", "ANNE",…
## $ PKNAME         <chr> "Ama Dablam", "Amphu Gyabjen", "Annapurna I", "Annapurn…
## $ PKNAME2        <chr> "Amai Dablang", "Amphu Gyabien", NA, NA, NA, NA, NA, NA…
## $ LOCATION       <chr> "Khumbu Himal", "Khumbu Himal (N of Ama Dablam)", "Anna…
## $ HEIGHTM        <dbl> 6814, 5630, 8091, 7937, 7555, 7525, 8026, 8051, 7219, 7…
## $ HEIGHTF        <dbl> 22356, 18471, 26545, 26040, 24787, 24688, 26332, 26414,…
## $ HIMAL          <dbl> 12, 12, 1, 1, 1, 1, 1, 1, 1, 2, 2, 12, 15, 13, 3, 2, 12…
## $ HIMAL_FACTOR   <chr> "Khumbu", "Khumbu", "Annapurna", "Annapurna", "Annapurn…
## $ REGION         <dbl> 2, 2, 5, 5, 5, 5, 5, 5, 5, 7, 7, 2, 4, 3, 5, 7, 2, 7, 4…
## $ REGION_FACTOR  <chr> "Khumbu-Rolwaling-Makalu", "Khumbu-Rolwaling-Makalu", "…
## $ OPEN           <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
## $ UNLISTED       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ TREKKING       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ TREKYEAR       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ RESTRICT       <chr> NA, "Opened in 2002", NA, NA, NA, NA, "Requires permit …
## $ PHOST          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1…
## $ PHOST_FACTOR   <chr> "Nepal only", "Nepal only", "Nepal only", "Nepal only",…
## $ PSTATUS        <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2…
## $ PSTATUS_FACTOR <chr> "Climbed", "Climbed", "Climbed", "Climbed", "Climbed", …
## $ PEAKMEMO       <dbl> 8, 20, 23, 31, 35, 40, 44, 46, 48, 54, 62, 64, 70, 72, …
## $ PYEAR          <dbl> 1961, 1953, 1950, 1960, 1961, 1955, 1974, 1980, 1964, 1…
## $ PSEASON        <dbl> 1, 1, 1, 1, 1, 1, 1, 3, 3, 1, 0, 1, 1, 1, 1, 3, 1, 3, 1…
## $ PEXPID         <chr> "AMAD61101", "AMPG53101", "ANN150101", "ANN260101", "AN…
## $ PSMTDATE       <chr> "Mar 13", "Apr 11", "Jun 03", "May 17", "May 06", "May …
## $ PCOUNTRY       <chr> "New Zealand, USA, UK", "UK", "France", "UK, Nepal", "I…
## $ PSUMMITERS     <chr> "Mike Gill, Wally Romanes, Barry Bishop, Michael Ward",…
## $ PSMTNOTE       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "1st ascent claimed…
## $ REFERMEMO      <dbl> NA, NA, 25, 33, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ PHOTOMEMO      <dbl> 13, NA, 26, 34, 37, 42, 45, NA, 50, 57, NA, 67, NA, NA,…

The second data set peaks_tidy contains 480 entries across 29 variables. It provides detailed information about each peak in the Himalayas. This information includes:

  1. Name and height of the peak
  2. Location
  3. Current status (Open/Closed for expeditions etc.)
  4. Details of the first recorded expedition

In our analysis, this dataset served as supplementary context to support our primary focus on exped_tidy. We also performed feature selection on peaks_tidy, excluding columns such as PSTATUS and PYEAR, which describe the first recorded ascents but were not relevant to the objectives of our study.

peaks_tidy <- peaks_tidy %>% 
  select(- c(PYEAR:PHOTOMEMO)) %>% 
  distinct()

3 Visualisation 1: Expedition Team Survival Rate by Peak and Elevation

3.1 Methodology

Upon exploring the dataset, we observed that climbers have attempted a wide range of Himalayan peaks with varying success rates. The differing characteristics of each peak, including altitude, weather conditions and terrain, meant that some peaks are inherently more challenging than others. For example, we might expect lower success rates on taller peaks like Everest, where extreme conditions and the demanding nature of the climb pose greater risks to the climbers. To better understand how these physical and environmental factors influence expedition outcomes, it would be valuable to visualise success rates across the different peaks.

To address this, we drew inspiration from the Kaplan-Meier curve, which is widely used in clinical studies to show survival rates over time. (Goel et al. 2010) Instead of tracking time, however, we adapted this approach to represent the survival rate of expedition teams based on the altitude they reached before ending their climb. This adaptation is particularly useful, as it allows us to visualise the relative difficulty of different sections of the route. For example, a sharp drop in the curve at a certain altitude may indicate a hazardous terrain feature that many teams were unable to overcome, or it might represent a region where teams commonly experience severe fatigue and are forced to turn back. These insights are crucial, as they highlight how specific physical features of the peaks can significantly impact expedition success rates.

Given the large number of peaks in the dataset, we chose to focus on the six most-climbed Himalayan peaks to maintain visual clarity. To begin, we identified these top six peaks and calculated, for each expedition, the highest point reached (HIGHPOINT) as a fraction of the total peak height (HEIGHTM). This fraction is stored as a variable called PROGRESS.

top_peaks <- exped_tidy %>%
  group_by(PEAKID) %>%
  summarise(num_climbs = n_distinct(EXPID)) %>%
  slice_max(num_climbs, n = 6) %>%
  pull(PEAKID)

top_expeds <- exped_tidy %>%
  filter(PEAKID %in% top_peaks, !HIGHPOINT == 0) %>%
  left_join(peaks_tidy, by = "PEAKID") %>%
  mutate(PROGRESS = HIGHPOINT / HEIGHTM) %>%
  mutate(REMAINING = 1 - PROGRESS) %>%
  select(EXPID, PEAKID, PKNAME, PROGRESS, REMAINING, HIGHPOINT, HEIGHTM, TERMREASON, TERMREASON_FACTOR) %>%
  arrange(PEAKID, PROGRESS)

We then divided the height of each peak into 100 equal segments and, for each segment, calculated the proportion of teams that reached at least that altitude. This value is stored as percent_remaining, indicating how many teams “survived” to that point in the climb.

df1 <- data.frame(progress = seq(0, 1, by = 0.01)) %>%
  mutate(AMAD = map_int(progress, ~ sum(top_expeds$PROGRESS >= .x & top_expeds$PEAKID == 'AMAD'))/140,
         DHA1 = map_int(progress, ~ sum(top_expeds$PROGRESS >= .x & top_expeds$PEAKID == 'DHA1'))/31,
         EVER = map_int(progress, ~ sum(top_expeds$PROGRESS >= .x & top_expeds$PEAKID == 'EVER'))/185,
         HIML = map_int(progress, ~ sum(top_expeds$PROGRESS >= .x & top_expeds$PEAKID == 'HIML'))/45,
         LHOT = map_int(progress, ~ sum(top_expeds$PROGRESS >= .x & top_expeds$PEAKID == 'LHOT'))/77,
         MANA = map_int(progress, ~ sum(top_expeds$PROGRESS >= .x & top_expeds$PEAKID == 'MANA'))/88) %>%
  pivot_longer(cols = -progress, names_to = "peak", values_to = "percent_remaining")

# A glimpse at the structure of the processed data frame
df1 %>% filter(peak == 'EVER') %>%
  arrange(desc(progress)) %>%
  relocate(peak) %>%
  head(8)
## # A tibble: 8 × 3
##   peak  progress percent_remaining
##   <chr>    <dbl>             <dbl>
## 1 EVER      1                0.908
## 2 EVER      0.99             0.914
## 3 EVER      0.98             0.919
## 4 EVER      0.97             0.919
## 5 EVER      0.96             0.924
## 6 EVER      0.95             0.924
## 7 EVER      0.94             0.924
## 8 EVER      0.93             0.924

Finally, we created a plot showing the proportion of teams that reached each altitude interval relative to their progress towards the peak. A grey reference line is then added showing the average performance across all six peaks, allowing for comparison. We used facet_wrap() to generate separate plots for each peak, ensuring that each visualisation is clear and focused.

While we zoomed in on key segments of the plots to highlight areas of variation, we also acknowledged that trimming parts of the x-axis might distort interpretation. To address this, we added a label at the end of each line to indicate the final success rate, which corresponds to the proportion of teams that successfully reached the top of each peak.

reference_line <- df1 %>%
  group_by(progress) %>%
  summarise(percent_remaining = mean(percent_remaining))

peak_names <- c('AMAD' = 'Ama Dablam (6814m)',
                'DHA1' = 'Dhaulagiri I (8167m)',
                'EVER' = 'Everest (8849m)',
                'HIML' = 'Himlung Himal (7126m)',
                'LHOT' = 'Lhotse (8516m)',
                'MANA' = 'Manaslu (8163m)')

ggplot() +
  geom_line(data = df1, 
            aes(x = progress, y = percent_remaining, color = peak), 
            size = 1) +
  geom_line(data = reference_line, 
            aes(x = progress, y = percent_remaining), 
            color = "grey", size = 1) +
  geom_point(data = df1 %>% filter(progress == 1),
             aes(x = progress, y = percent_remaining, color = peak), shape = 16, size = 2) +
  geom_text(data = df1 %>% filter(progress == 1),
             aes(x = progress, y = percent_remaining, color = peak, label = round(percent_remaining, 2)),
             nudge_y = -0.03, size = 3, fontface = "bold") +
  facet_wrap(~ peak, labeller = as_labeller(peak_names)) +  
  scale_x_continuous(labels = scales::percent_format(accuracy = 1), 
                     breaks = c(0.5, 0.75, 1.0),
                     limits = c(0.6, 1.01)) +  
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), 
                     breaks = c(0.5, 0.75, 1.0), 
                     limits = c(0.49, 1)) +
  coord_cartesian(clip = "off") +
  theme_minimal() +
  theme(legend.position = 'none', 
        axis.text = element_text(color = 'grey40', size = 9),
        axis.title = element_text(color = 'grey10', size = 9),
        strip.text.x = element_text(face = "bold"), 
        plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, size = 9)) +
  labs(y = 'Teams Reaching or Surpassing This Height (%)',
       x = "Height Attained (% of Peak)", 
       title = 'Expedition Team Survival Rate by Elevation for the Top 6 Most-Climbed Peaks',
       subtitle = "Grey line indicates average performance of teams across all peaks. The number indicated at the end of
 each line shows the proportion of teams successfully summiting at the peak.")

3.2 Discussion

In general, we see from the visualisations that shorter peaks below the ‘Death Zone’, including Himlung Himal and Ama Dablam, have higher than average success rates. For peaks above 8000m however, it is surprising that difficulty, measured by final success rates and the magnitude of dips in the survival curves, is not necessarily linked to peak elevation. For instance, Everest, despite being the highest mountain, shows fewer dropouts along the ascent with a cumulative 91% success rate, significantly higher than shorter peaks such as Dhaulagiri I, which has a success rate of just 58%.

This result may be explained by the commercialisation of climbing on popular peaks. With a high number of climbers, peaks such as Everest have benefited from better infrastructure, including more accurate weather forecasts, experienced guides, and well-established fixed rope systems. (Ma, 2020) These elements could improve team performance and increase the success rate of expeditions. In contrast, less popular peaks like Dhaulagiri I may lack such support, thus contributing to their lower success rates despite being shorter.

We also observe a sharp dip in the survival curve of Manaslu near the summit. This is because many expeditions stopped at the foresummit rather than reaching the true summit, which is not visible from the main route and requires crossing a dangerous ridge. The existence of a taller summit was only verified in 2021 when drone footage confirmed the location of the true summit. Before this happened, many people believed the false summit was the highest point of the peak and turned back after reaching it. (Horrell, 2021) As a result, only 59% of expeditions are recorded as reaching the actual summit.

4 Visualisation 2: Hired Personnel Count for Successful and Unsuccessful Expeditions

4.1 Methodology

We observed that nearly 82% of expeditions involved hiring support personnel. As such, we intended to investigate the correlation between the number of hired members and expedition success rates. We hypothesise that a greater number of hired members would lead to higher odds of success, as these experienced individuals carry out critical duties such as “breaking trail and fixing rope ahead of the other climbers, as well as transporting supplies and guiding” (Sherman et al., 2013). Understanding this relationship could help improve expedition planning by balancing safety considerations with financial costs.

For this visualisation, a grouped box plot was generated, plotting TOTHIRED against PEAKID to compare the distribution of hired personnel for the six most frequently attempted peaks. Each expedition was further sub-categorised into success or failure according to SUCCESS1. By filtering for HIGHPOINT values greater than 0, we ensured that only summit attempts were included, excluding expeditions aborted before the climb began.

A box plot is particularly effective for this analysis because it visualises the distribution, median, and variability in the number of hired personnel across expedition outcomes. By breaking it down by peak and success status, we can visually assess whether successful expeditions generally employ more support staff, and whether this trend holds consistently across different peaks. This is important, as reliance on hired help may vary depending on the technical difficulty or altitude of the peak — some may demand greater logistical support, while others may not benefit as clearly from it.

# Get top 6 most common peaks
top_peaks2 <- exped_tidy %>%
  filter(HIGHPOINT >0) %>% 
  count(PEAKID, sort = TRUE) %>%
  slice_max(n, n = 6)

count_peaks2 <- exped_tidy %>% 
  filter(PEAKID %in% top_peaks2$PEAKID, 
         TOTMEMBERS != 0,
         HIGHPOINT >0) %>% 
  count(PEAKID,SUCCESS1) %>% 
  left_join(peaks_tidy %>% select(PEAKID, PKNAME), by = "PEAKID") %>% 
  pivot_wider(names_from = SUCCESS1, values_from = n) %>%
  mutate(total = `FALSE` + `TRUE`) %>% 
  arrange(total) %>% 
  mutate(label = paste0(PKNAME, " \n(", `TRUE`, ":", `FALSE`, " attempts)")) %>% 
  pull(label)


# Filter and sort expeditions in top 6 peaks
sorted_data <- exped_tidy %>%
  filter(PEAKID %in% top_peaks2$PEAKID) %>% 
  mutate(PEAKID = factor(PEAKID, levels = rev(top_peaks2$PEAKID))) %>% 
  arrange(desc(PEAKID))


# boxplot of TOTCOUNT for each peak
ggplot(sorted_data, aes(x = TOTHIRED, y = PEAKID, fill = SUCCESS1)) +
  geom_boxplot(width = 0.6, alpha = 0.4, outlier.shape = NA, linewidth = 0.7) +
  scale_fill_manual(name = "Expedition Status", labels = c("Success", "Failure"), values = c("TRUE" = "green", "FALSE" = "red"), breaks = c(TRUE,FALSE)) +
  scale_x_continuous(limits = c(0, 50), breaks = seq(0,50,10)) +
  scale_y_discrete(labels = count_peaks2) +
  labs(
    x = "Total Members",
    y = "Peaks",
    title = "Hired Member Trends by Expedition Outcome (Top 6 Peaks)",
    subtitle = "The split of success:failure attempts at each peak are reflected in the y-axis labels."
  ) +
  theme_minimal() +
  theme(axis.text = element_text(color = 'grey40', size = 9),
        axis.title = element_text(color = 'grey10', size = 9),
        strip.text.x = element_text(face = "bold"), 
        plot.title = element_text(hjust = 0.5, size = 13, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, size = 9),
        legend.position = c(0.85, 0.12),
        legend.background = element_rect(fill = "white"))

4.2 Discussion

A mostly consistent trend observed across the six expedition peaks is that a greater number of hired personnel appears to correlate with higher odds of success. This is evident from the consistently higher median number of hired members in the success subcategory compared to the failure subcategory across all peaks.

A plausible explanation for this observation is that hired personnel often bear the brunt of carrying logistics for the expedition team. Hiring greater numbers of Sherpas (climbing guides), who are physiologically adapted to the oxygen-poor, high-altitude environment, helps reduce the physical strain on less-acclimatised foreign climbers, thereby increasing the likelihood of success (Unsung Heroes: The Sherpas of Everest, 2025).

However, Lhotse interestingly shows a higher median TOTHIRED in the failure subcategory compared to successful expeditions. Additionally, in the case of Himlung Himal, the median for failed expeditions is equal to the lower quartile. These irregularities could be attributed to small sample bias in the failure cases, which may have distorted the summary statistics represented in the box plot.

5 Visualisation 3: The Number of Terminations Against the Highest Point Reached

5.1 Methodology

Lastly, we aimed to explore the reasons why expedition teams failed to reach the summit and whether these reasons varied depending on the maximum altitude reached during the attempt. Identifying such patterns could offer valuable insights for future teams to better prepare and plan, increasing their chances of reaching the summit and returning safely.

A bar chart was chosen for this visualisation due to its effectiveness in displaying discrete data categories (TERMREASON_FACTOR) and their relative magnitudes (HIGHPOINT). The inclusion of a slider and an autoplay option enables users to adjust the high points reached by expedition teams, making it easy to visualise how the frequency of each termination reason changes with altitude. Altogether, this approach offers a straightforward ranking of termination reasons, highlighting the most prevalent issues at each altitude band. Do note that the code chunk for the barplot will not knit in the .rmd folder as it is run on shiny, the link to the shiny app will be below the code chunk.

# first 3 reasons are just about success and irrelevant to our plot,
# shortening the termination reasons for easier readability
exped_tidy_onlyfail = exped_tidy %>%
  select(PEAKID, SEASON_FACTOR, TERMREASON, TERMREASON_FACTOR, HIGHPOINT) %>%
  filter(!TERMREASON %in% c(1,2,3)) %>%
  mutate(TERMREASON_FACTOR = case_when(
    TERMREASON == 4 ~ "Bad Weather",
    TERMREASON == 5 ~ "Bad Conditions",
    TERMREASON == 10 ~ "Too Difficult",
    TRUE ~ TERMREASON_FACTOR
  )) %>%
  glimpse()
## Rows: 228
## Columns: 5
## $ PEAKID            <chr> "AMAD", "AMAD", "HIML", "MANA", "MANA", "ANN1", "BAR…
## $ SEASON_FACTOR     <chr> "Autumn", "Autumn", "Autumn", "Winter", "Winter", "S…
## $ TERMREASON        <dbl> 4, 12, 4, 5, 5, 5, 5, 7, 12, 7, 5, 5, 7, 7, 7, 7, 7,…
## $ TERMREASON_FACTOR <chr> "Bad Weather", "Did not attempt climb", "Bad Weather…
## $ HIGHPOINT         <dbl> 6650, 0, 6400, 7050, 6550, 7400, 6900, 5750, 0, 6100…

We removed the rows that resulted in a successful summit as we are focusing on the reasons for terminations. To improve the readability of the chart, we also only selected the Top 7 reasons for terminations. Termination factors that were long or ambiguous were also relabeled for clarity, reducing visual clutter in the plot and making it easier to understand.

ui <- fluidPage(
  titlePanel("Top termination reasons below highpoint threshold"),
  
  sidebarLayout(
    sidebarPanel(
      sliderInput("highpointInput", 
                  "Max Highpoint (meters):", 
                  min = 0, max = 9000, value = 5000, step = 500,
                  animate = animationOptions(interval = 1000, loop = TRUE))
    ),
    
    mainPanel(
      plotOutput("failPlot")
    )
  )
)

server <- function(input, output) {
  
  output$failPlot <- renderPlot({
    
    # Filter by highpoint
    filtered_data <- exped_tidy_onlyfail %>%
      filter(HIGHPOINT < input$highpointInput)
    
    # Count term reasons
    term_counts <- filtered_data %>%
      count(TERMREASON_FACTOR, sort = TRUE) %>%
      slice_max(n, n = 7)
    
    # Count and plot
    term_counts %>%
      ggplot() +
      geom_col(
        aes(
          x = reorder(str_wrap(TERMREASON_FACTOR, 15), n),
          y = n,
          fill = TERMREASON_FACTOR
        ),
        position = "dodge2"
      ) +
      geom_text(aes(x = reorder(str_wrap(TERMREASON_FACTOR, 15), n),
                    y = n,
                    label = n),
                hjust = -.5, size = 5) +
      scale_y_continuous(
        limits = c(0, 80)
      ) +
      coord_flip() +
      theme_minimal() +
      labs(
        title = paste("Termination Reasons (Highpoint <", input$highpointInput, "m)"),
        x = NULL,
        y = "Count"
      ) +
      theme(legend.position = "none", 
            axis.text.y = element_text(size = 14),
            axis.title.y = element_text(size = 14))
  })
}

shinyApp(ui = ui, server = server)

Access the shiny app here via https://joellovesdata.shinyapps.io/project/ .

5.2 Discussion

The main trends we observed between 1000 and 5000 meters indicate that the number of terminations remained relatively stable. This may be due to the comparatively tame environmental conditions at these altitudes, where teams are not yet significantly affected by altitude sickness, as oxygen levels are still relatively sufficient.

From 5000 meters and above, however, we observed a sharp increase in the number of terminations. The most common reasons were bad weather and poor conditions, which aligns with the fact that environmental challenges become more extreme at higher altitudes. Climbers in these zones face severe hypoxia, cold, and dehydration, pushing them to their physical and psychological limits (Huey R.B, 2001).

Interestingly, a notable number of terminations also occurred between 0 and 1000 meters. These were primarily attributed to bad weather or the team making no attempt at all. It is likely that teams encountering early signs of unfavorable weather chose to abort the expedition at the outset, as it is safer to turn back at lower elevations than to face increased risks higher up the mountain.

Lastly, for terminations occurring above 6000 meters, medical-related reasons increased significantly. This aligns with the fact that oxygen levels decrease at higher altitudes which leads to lower oxygen saturation levels in the body(Mathew & Sharma, 2023). Additionally, at altitudes above 5500m, it is also not possible to acclimate which contributes to worsening physical conditions(Prince et al., 2023). The harsher conditions at these elevations lead to a greater likelihood of injury or illness. Additionally, being farther from base camp at such altitudes means that teams have limited access to supplies and support, making medical emergencies more dangerous and difficult to manage.

6 Citations

Goel, M. K., Khanna, P., & Kishore, J. (2010). Understanding survival analysis: Kaplan-Meier estimate. International journal of Ayurveda research, 1(4), 274–278. https://doi.org/10.4103/0974-7788.76794

Horrell, M. (2024, December 12). Amazing drone photos of the summit of Manaslu help to set the record straight. https://www.markhorrell.com/blog/2021/amazing-drone-photos-of-the-summit-of-manaslu-help-to-set-the-record-straight/

Huey, R. B., & Eguskitza, X. (2001). Limits to human performance: elevated risks on high mountains. The Journal of experimental biology, 204(Pt 18), 3115–3119. https://doi.org/10.1242/jeb.204.18.3115

Ma, M. (2020, August 26). Mount Everest summit success rates double, death rate stays the same over last 30 years. UW News. https://www.washington.edu/news/2020/08/26/mount-everest-summit-success-rates-double-death-rate-stays-the-same-over-last-30-years/

Mathew, T. M., & Sharma, S. (2023, April 10). High Altitude Oxygenation. Nih.gov; StatPearls Publishing. https://www.ncbi.nlm.nih.gov/books/NBK539701/

Prince, T. S., Thurman, J., & Huebner, K. (2023, July 10). Acute Mountain Sickness. Nih.gov; StatPearls Publishing. https://www.ncbi.nlm.nih.gov/books/NBK430716/

Sherman, E. L., Chatman, J. A., & University of California, Berkeley, Haas School of Business. (2013). National Diversity Under pressure: group composition and expedition success in Himalayan mountaineering [Journal-article]. https://www.haas.berkeley.edu/wp-content/uploads/Sherman-Chatman-2013.pdf

Unsung Heroes: The Sherpas of Everest | Populous. (2025, April 16). Populous. https://populous.com/article/unsung-heroes-the-sherpas-of-everest