For this assignment I have used the Himalayan Database, which is available through the TidyTuesday project. It has data on mountain climbing expeditions in the Himalayas from 1905 to 2019. I thought it would be interesting because it has a lot of different variables to explore and it tells a story about how mountaineering has changed over time.
The dataset has three tables. One has expedition-level info, one has info about each peak, and one has individual climber records. I merged them together to do the analysis.
I looked at a few things: how many expeditions happened each year, whether season affects success, how altitude affects success and deaths, and which countries send the most climbers.
# loading the packages I need
library(data.table)
library(ggplot2)
library(RColorBrewer)
library(scales)
# reading the data directly from github using fread
exped <- fread("https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-09-22/expeditions.csv")
peaks <- fread("https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-09-22/peaks.csv")
members <- fread("https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-09-22/members.csv")
# checking how many rows each table has
cat("expeditions:", nrow(exped), "rows\n")
## expeditions: 10364 rows
cat("peaks:", nrow(peaks), "rows\n")
## peaks: 468 rows
cat("members:", nrow(members), "rows\n")
## members: 76519 rows
I merged the three tables together. First, computed summit rate from the members table since that’s where individual success is recorded. Then I joined it with the expedition and peaks tables.
# first computed how many members summited per expedition
# decided to exclude hired staff (guides, porters) and only count actual climbing members
summit_agg <- members[
hired == FALSE,
.(total_members = .N,
summitters = sum(success, na.rm = TRUE)),
by = expedition_id
]
summit_agg[, summit_rate := summitters / pmax(total_members, 1)]
# now merging expeditions with peak info
dt <- merge(
exped,
peaks[, .(peak_id, height_metres, climbing_status, first_ascent_year, first_ascent_country)],
by = "peak_id",
all.x = TRUE
)
# then merged in the summit rate we computed
dt <- merge(
dt,
summit_agg[, .(expedition_id, summit_rate, total_members, summitters)],
by = "expedition_id",
all.x = TRUE
)
cat("merged table has", nrow(dt), "rows\n")
## merged table has 10364 rows
# keeping only expeditions from 1950 onwards
# also removing rows where height or summit rate is missing
dt <- dt[!is.na(year) & year >= 1950 & !is.na(height_metres) & !is.na(summit_rate)]
cat("rows after filtering:", nrow(dt), "\n")
## rows after filtering: 10276
# did anyone die on this expedition
dt[, any_death := (member_deaths + hired_staff_deaths) > 0]
# grouping peaks by height into four categories
dt[, height_cat := fcase(
height_metres >= 8000, "8000m+",
height_metres >= 7000, "7000-7999m",
height_metres >= 6000, "6000-6999m",
default = "Below 6000m"
)]
# making it an ordered factor so plots show in the right order
dt[, height_cat := factor(height_cat,
levels = c("Below 6000m", "6000-6999m", "7000-7999m", "8000m+"))]
# same for season
dt[, season := factor(season,
levels = c("Spring", "Summer", "Autumn", "Winter", "Unknown"))]
# which decade each expedition happened
dt[, decade := floor(year / 10) * 10]
# yearly summary for the trend plot
yearly <- dt[,
.(n_expeditions = .N,
avg_summit = mean(summit_rate, na.rm = TRUE),
pct_o2 = mean(oxygen_used, na.rm = TRUE)),
by = year
][order(year)]
# death rate by decade and altitude band
decade_deaths <- dt[
!is.na(height_cat) & !is.na(decade),
.(death_rate = mean(any_death, na.rm = TRUE),
n = .N),
by = .(decade, height_cat)
][n >= 5]
# summary by nationality from the members table
nation_summary <- members[
hired == FALSE & !is.na(citizenship),
.(n_members = .N,
summit_rate = mean(success, na.rm = TRUE)),
by = citizenship
][n_members >= 100][order(-n_members)]
# termination reasons by season for the stacked bar chart
top_reasons <- dt[, .N, by = termination_reason][order(-N)][1:6, termination_reason]
term_plot <- dt[
termination_reason %in% top_reasons & season != "Unknown",
.N,
by = .(season, termination_reason)
]
# made a simple theme to apply to all plots
my_theme <- theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 13),
plot.subtitle = element_text(size = 10, color = "gray40"),
axis.title = element_text(size = 10),
legend.position = "bottom",
panel.grid.minor = element_blank()
)
theme_set(my_theme)
# line chart with points colored by average summit rate that year
ggplot(yearly, aes(x = year, y = n_expeditions)) +
geom_line(color = "steelblue", linewidth = 0.9) +
geom_point(aes(color = avg_summit), size = 2) +
scale_color_distiller(palette = "RdYlGn", direction = 1,
name = "Avg summit rate",
labels = percent_format(accuracy = 1)) +
labs(title = "Number of Himalayan expeditions per year",
subtitle = "Point color shows average summit success rate that year",
x = "Year",
y = "Number of expeditions")
Expeditions have gone up a lot since the 1950s. There’s a big drop around 2015 which I think is related to the Everest avalanche in 2015 and the earthquake(most probably). The point color shows that early expeditions had pretty decent success rates, probably because they were big planned national expeditions going to specific peaks.
# boxplot comparing summit rates across seasons
ggplot(dt[season != "Unknown"],
aes(x = season, y = summit_rate, fill = season)) +
geom_boxplot(outlier.alpha = 0.3) +
scale_fill_brewer(palette = "Set2", name = "Season") +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
labs(title = "Summit success rate by season",
x = "Season",
y = "Summit rate")
Spring is clearly the best season for climbing. Winter expeditions have very spread out results which makes sense because conditions are unpredictable. Summer is almost avoided entirely, probably because of monsoon season.
# violin plot with a boxplot inside to show distribution and summary stats
ggplot(dt[!is.na(height_cat)],
aes(x = height_cat, y = summit_rate, fill = height_cat)) +
geom_violin(alpha = 0.6, trim = TRUE) +
geom_boxplot(width = 0.12, fill = "white", outlier.alpha = 0.2) +
scale_fill_brewer(palette = "Blues", name = "Height") +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
labs(title = "Summit rate by peak altitude",
subtitle = "Violin shows the full distribution, box inside shows median and IQR",
x = "Altitude category",
y = "Summit rate")
The higher the peak, the lower and more varied the success rates. For 8000m+ peaks a lot of expeditions end with zero summits.
# bar chart of top 15 nationalities
top15 <- head(nation_summary, 15)
ggplot(top15,
aes(x = reorder(citizenship, n_members),
y = n_members,
fill = summit_rate)) +
geom_col() +
coord_flip() +
scale_fill_distiller(palette = "YlGn", direction = 1,
name = "Summit rate",
labels = percent_format(accuracy = 1)) +
scale_y_continuous(labels = comma) +
labs(title = "Top 15 nationalities by total climbers",
subtitle = "Only nationalities with at least 100 climbers shown",
x = "Nationality",
y = "Total climbers")
The USA and Japan send the most climbers by far. The color shows summit rate – some smaller nations actually have quite high success rates which might mean only very experienced climbers from those countries attempt these peaks.
# line chart showing how death rates changed over time by altitude band
ggplot(decade_deaths,
aes(x = decade, y = death_rate,
color = height_cat, group = height_cat)) +
geom_line(linewidth = 1.1) +
geom_point(aes(size = n), alpha = 0.7) +
scale_color_brewer(palette = "Spectral", name = "Altitude") +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
scale_size_continuous(name = "Expeditions", range = c(2, 7)) +
labs(title = "Expedition death rate by decade and altitude",
subtitle = "Share of expeditions with at least one death; point size = number of expeditions",
x = "Decade",
y = "Death rate")
Death rates have gone down a lot on high peaks since the 1950s. This is probably because of better equipment, weather forecasting, and rescue options. But 8000m+ peaks are still the most dangerous by far.
# scatter plot with trend lines, split by altitude band
o2_plot <- dt[!is.na(height_cat) & !is.na(oxygen_used) & !is.na(summit_rate)]
o2_plot[, o2_label := ifelse(oxygen_used, "Oxygen used", "No oxygen")]
ggplot(o2_plot,
aes(x = total_members, y = summit_rate, color = o2_label)) +
geom_point(alpha = 0.15, size = 0.8) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
facet_wrap(~height_cat, scales = "free_x") +
scale_color_brewer(palette = "Set1", name = "Oxygen") +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
scale_x_log10() +
labs(title = "Oxygen use and summit success by altitude band",
x = "Expedition size (log scale)",
y = "Summit rate")
On 8000m+ peaks, using oxygen makes a clear difference. On lower peaks it doesn’t really matter which makes sense since altitude sickness is mainly a problem above around 7500m.
# stacked bar chart showing termination reasons by season
ggplot(term_plot,
aes(x = season, y = N, fill = termination_reason)) +
geom_col(position = "fill") +
scale_fill_brewer(palette = "Set3", name = "Reason") +
scale_y_continuous(labels = percent_format()) +
labs(title = "How expeditions end, by season",
subtitle = "Top 6 most common termination reasons",
x = "Season",
y = "Proportion")
Most spring and autumn expeditions end in success. In winter, a much bigger share ends due to bad weather or conditions. This confirms what we saw in the boxplot earlier – winter is genuinely harder.
# scatter with labels on the tallest peaks
peak_firsts <- dt[
!is.na(height_metres) & !is.na(year),
.(first_year = min(year),
height = first(height_metres),
name = first(peak_name)),
by = peak_id
]
label_peaks <- peak_firsts[order(-height)][1:12]
ggplot(peak_firsts, aes(x = first_year, y = height)) +
geom_point(aes(color = height), alpha = 0.6, size = 2) +
geom_text(data = label_peaks, aes(label = name),
size = 2.8, hjust = -0.1, vjust = 0) +
scale_color_distiller(palette = "YlOrRd", direction = 1,
name = "Height (m)", labels = comma) +
scale_y_continuous(labels = comma) +
labs(title = "Peak height vs year of first expedition",
subtitle = "Labels show the 12 tallest peaks",
x = "Year of first expedition",
y = "Height (metres)")
The tallest peaks like Everest and K2 had expeditions very early. Smaller peaks show up later in the data, either because they were climbed later or because records weren’t kept as well for smaller expeditions.
Looking at the Himalayan database, a few things stand out. Expedition numbers grew a lot over the decades but deaths have actually gone down, probably because of better gear and technology. Altitude is the biggest factor in whether an expedition succeeds. Spring is by far the best season. Using oxygen helps a lot on the highest peaks but doesn’t matter much below 7000m.
It was interesting to see that even on the most dangerous peaks, success rates have improved over time. But 8000m+ peaks are still responsible for almost all the deaths in the dataset.