Apply9

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

members <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv')

## Rows: 76519 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): expedition_id, member_id, peak_id, peak_name, season, sex, citizen...
## dbl  (5): year, age, highpoint_metres, death_height_metres, injury_height_me...
## lgl  (6): hired, success, solo, oxygen_used, died, injured
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Explore data

Goal: Is there any relationship between peaks and seasons in terms of deaths? (I’m assuming this is your goal)

Preparing data Please understand why Julia prepared the employment data the way shed did.

She wanted to investigate the relationship between demographics and occupation in employment. To this end, she transformed the data so demographic variables became columns and occupations are row names. Also, the values inside the matrix should be scaled employment ratios.

Plan to make the similar transformation for your climbing data. You want to investigate the relationship between peaks and season in terms of deaths. To this end, you need to transform the data so seasons become columns and peaks are row names. Also, the values inside the matrix should be scaled seasonal death ratio in yearly total.

# Calculate total deaths by season and peaks
members_tidy <- members %>% 
    group_by(peak_name, season) %>%
    summarise(died = sum(died)) %>%
    ungroup()

## `summarise()` has grouped output by 'peak_name'. You can override using the
## `.groups` argument.

# Get the yearly total deaths. There is no TOTAL in the season variable
members_tidy %>%
    group_by(peak_name) %>%
    summarise(died_yearly = sum(died)) %>%
    ungroup()

## # A tibble: 391 × 2
##    peak_name          died_yearly
##    <chr>                    <int>
##  1 Aichyn                       0
##  2 Ama Dablam                  32
##  3 Amotsang                     0
##  4 Amphu Gyabjen                0
##  5 Amphu I                      0
##  6 Amphu Middle                 0
##  7 Anidesh Chuli                0
##  8 Annapurna I                 72
##  9 Annapurna I East             1
## 10 Annapurna I Middle           3
## # ℹ 381 more rows

members_demo <- members_tidy %>%
  filter(season %in% c("Winter", "Spring", "Autumn")) %>%
  pivot_wider(names_from = season, values_from = died, values_fill = 0) %>%
  # janitor::clean_names() %>%  # No need. The column name are already clean.
  left_join(members_tidy %>%
                group_by(peak_name) %>%
                summarise(died_yearly = sum(died)) %>%
                ungroup()) %>%
    
    # Removes outliers, peaks with no deaths and more than 300
    filter(died_yearly > 0, died_yearly < 100) %>%
    
    # Calculate the seasonal percent of death per peak
    mutate(across(c(Autumn, Spring, Winter), ~ . / died_yearly),
           died_yearly = log(died_yearly),
           across(where(is.numeric), ~ as.numeric(scale(.))))

## Joining with `by = join_by(peak_name)`

members_demo

## # A tibble: 85 × 5
##    peak_name           Autumn  Spring  Winter died_yearly
##    <chr>                <dbl>   <dbl>   <dbl>       <dbl>
##  1 Ama Dablam          0.167  -0.128  -0.0987      1.72  
##  2 Annapurna I        -0.262   0.219   0.149       2.35  
##  3 Annapurna I East    0.907  -0.821  -0.302      -0.994 
##  4 Annapurna I Middle -0.847  -0.821   4.03       -0.134 
##  5 Annapurna II       -0.408   0.566  -0.302       0.409 
##  6 Annapurna III      -0.262   0.412  -0.302       0.726 
##  7 Annapurna IV        0.381  -0.266  -0.302       0.266 
##  8 Annapurna South    -0.737   0.912  -0.302       0.634 
##  9 Api Main           -1.72    1.26    1.32        0.0913
## 10 Baruntse            0.0974  0.0325 -0.302       1.01  
## # ℹ 75 more rows

Implementing k-means clustering

# members_clust <- kmeans(select(members_demo, - mountain), centers = 3)
members_clust <- kmeans(select(members_demo, - peak_name), centers = 3)
summary(members_clust)

##              Length Class  Mode   
## cluster      85     -none- numeric
## centers      12     -none- numeric
## totss         1     -none- numeric
## withinss      3     -none- numeric
## tot.withinss  1     -none- numeric
## betweenss     1     -none- numeric
## size          3     -none- numeric
## iter          1     -none- numeric
## ifault        1     -none- numeric

library(broom)
tidy(members_clust)

## # A tibble: 3 × 7
##   Autumn Spring  Winter died_yearly  size withinss cluster
##    <dbl>  <dbl>   <dbl>       <dbl> <int>    <dbl> <fct>  
## 1 -0.847 -0.821  4.03        -0.372     4     8.26 1      
## 2  0.769 -0.677 -0.298       -0.243    49    36.9  2      
## 3 -1.07   1.14  -0.0480       0.418    32    77.2  3

augment(members_clust, members_demo) %>%
  ggplot(aes(died_yearly, peak_name, color = .cluster)) +
  geom_point()

Choosing k

kclusts <-
  tibble(k = 1:9) %>%
  mutate(
    kclust = map(k, ~ kmeans(select(members_demo, - peak_name), .x)),
    tidied = map(kclust, tidy),
    glanced = map(kclust, glance),
    augmented = map(kclust, augment, members_demo)
  )

kclusts %>%
  unnest(glanced) %>%
  ggplot(aes(k, tot.withinss)) +
  geom_line(alpha = 0.8) +
  geom_point(size = 2)

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

members_clust <- kmeans(select(members_demo, - peak_name), centers = 4)

p <- augment(members_clust, members_demo) %>%
  ggplot(aes(died_yearly, Spring, color = .cluster, name = peak_name)) +
  geom_point(alpha = 0.8)

ggplotly(p)

The objective of this analysis is to uncover potential patterns or correlations between the occurrence of deaths in mountain climbing expeditions and specific seasons. By focusing on peaks and seasonal data, the study seeks to determine if certain times of the year are more hazardous for climbers.

The dataset is a comprehensive collection of climbing expeditions’ records, detailing demographics, logistics, and outcomes, with 21 variables including peak names, seasons, and fatalities. It quantifies expedition specifics like the climbers’ nationality, use of oxygen, and whether they were hired, succeeded, suffered injuries, or died.

Central variables are categorical and numerical, such as peak names (categorical), seasons (categorical), and the number of deaths (numerical). These variables are crucial for analyzing trends and patterns in the data, specifically focusing on how the climbing fatalities correlate with different seasons across various peaks.

Data transformation restructured the comprehensive logs into a format conducive to modeling, aggregating deaths by peak and season, and normalizing values to reflect proportional seasonal death risks. This allowed for clearer comparisons and pattern recognition, focusing analysis on the interaction between seasonality and mortality rates in climbing activities.
The analysis utilizes k-means clustering, a method that segregates data into k distinct clusters by minimizing within-cluster variances. This technique is effective for identifying natural groupings in multidimensional data, which in this context, helps to reveal patterns in seasonal fatalities across various mountain peaks.

To determine the optimal k for clustering, the analysis employs the “elbow method,” plotting the total within-cluster sum of squares (WSS) against a range of k values. The “elbow” point, where the rate of decrease sharply changes, suggests the most appropriate number of clusters to use.

The analysis findings indicate that there are discernible patterns in fatalities related to specific seasons and peaks. Clustering the data unveiled groupings that suggest certain peaks are more lethal in particular seasons, which could inform safety strategies and precautionary measures for future climbing expeditions.

Apply9

Anton Jellvik

2023-11-03

Explore data

Implementing k-means clustering

Choosing k