Association rules computed on Himalayan dataset

Objective and Methodology

The aim of this study is to use association rules to identify patterns and dependencies related to Himalayan expeditions. The data comes from the Himalayan Database [https://www.himalayandatabase.com/] and includes expeditions from 1990 to 2024. The starting point of this analysis, 1990, marks the beginning of the commercial era of Himalayan climbing. The analysis focuses on expedition-level data rather than individual climbers, as this approach provides a better understanding of both the risks and the overall safety of the expeditions. In this study, an expedition is considered successful only when all its members return safely. The Apriori algorithm was used to perform the analysis, with each row in the dataset representing a unique expedition.

Dataset preparation

For the analysis, 12 variables were selected, resulting in a dataset of 7,110 unique expeditions. Most variables were converted to categorical data to ensure easier interpretation of the results. The variables used include:

  1. expid: Expedition ID.
  2. peakid: Mountain peak ID.
  3. year: Year of the expedition (grouped into decades: 1990-99, 2000-09, 2010-19, and 2020+).
  4. season: Season of the expedition (spring, summer, autumn, winter).
  5. host: Host country (Nepal or China).
  6. success: Whether the mountain peak was successfully reached (in any of the four attempts).
  7. smtdays: Number of days it took to reach the peak, or the highest point reached if the summit was not achieved (grouped into 1-5 days, 6-15 days, 16-25 days, 26-35 days, 36-40 days, 41-50 days, 51-60 days, and over 2 months).
  8. highpoint: The highest point reached during the expedition (grouped by altitude: 4,000m, 5,000m, 6,000m, 7,000m, or 8,000m peaks).
  9. camps: Number of camps set up during the expedition (grouped into 0, 1-2, 3-4, or 5+ camps). totmembers: Total number of expedition members (grouped into solo, 2-8 members, 9-15 members, or 16+ members).
  10. tothired: Total number of hired personnel (e.g., guides or porters) (grouped into solo expeditions, 1 person hired, 2 people hired, 3-6 people hired, 7-10 people hired, 11-20 people hired, and more than 20 people hired).
  11. o2used: Whether oxygen was used during the climb (with oxygen or without oxygen).
  12. death: A variable created from mdeaths (member deaths) and hdeaths (hired personnel deaths). This variable indicates whether anyone died during the expedition and their role (grouped into: no one died, sherpa died, member died, or both member and sherpa died). After creating this variable, mdeaths and hdeaths were removed from the dataset.

The final dataset consisted of the following 12 variables: “peakid”, “year”, “season”, “host”, “smtdays”, “highpoint”, “camps”, “totmembers”, “tothired”, “o2used”, “success”, and “death”. Moreover only peaks with more than 20 expeditions recorded after 1990 were included in the analysis.

Frequency plots were created for these variables to provide a visual representation of their distributions.

par(mfrow = c(1, 3))  

variables <- c("year", "season", "host", "smtdays", "highpoint", "camps",
               "totmembers", "tothired", "o2used", "success", "death","peakid")


for (var in variables) {
  if (var %in% names(Himalayan)) {
    counts <- table(Himalayan[[var]])
    barplot(counts, 
            main = paste("Barplot for", var),  
            col = rainbow(length(counts)),                
            cex.main = 1, 
            las=3,
            cex.lab = 1)                  
  }
}

par(mfrow = c(1, 1))

Apriori algorithm and key measures

“The Apriori algorithm was proposed by Agrawal and Srikant in 1994. Apriori is designed to operate on databases containing transactions (for example, collections of items bought by customers, or details of a website frequentation or IP addresses[2]). Given a threshold C, the Apriori algorithm identifies the item sets which are subsets of at least C transactions in the database. Apriori uses a”bottom up” approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found.” [(https://en.wikipedia.org/wiki/Apriori_algorithm)]

We can distinguish three common metrics used to evaluate the quality of Association Rules: support, confidence, and lift.

Support

Support measures how often the joint itemset appears in the database. Simply put, it is a frequency-based measure of the itemset’s occurrence in the dataset.

Confidence

Confidence is expressed as a percentage and indicates how often the rule’s consequent (Y) appears among all the groups that contain the rule’s antecedent (X). It serves as an indicator of the rule’s reliability (IBM, 2021a). A higher confidence value suggests a stronger rule.

X: antecedent itemset Y: consequent itemset

Lift

Lift is a ratio that compares the confidence of a rule to its expected confidence. It measures the likelihood of co-occurrence between X and Y. The lift value can range from 0 to infinity and is interpreted as follows:

Value greater than 1: X and Y are positively dependent. Value equal to 1: X and Y are independent, meaning no meaningful rule can be derived. Value less than 1: X and Y are negatively dependent. The presence of X reduces the likelihood of Y occurring (IBM, 2021b).

Now, data needs to be transformed into a format that can be used by the read.transactions function.

## [1] 7111
##     items             
## [1] {camps,           
##      death,           
##      highpoint,       
##      host,            
##      o2used,          
##      peakid,          
##      season,          
##      smtdays,         
##      success,         
##      tothired,        
##      totmembers,      
##      year}            
## [2] {1,               
##      1-2 camps,       
##      15-25 days,      
##      2-8 members,     
##      AMAD,            
##      Before 2000,     
##      Nepal,           
##      no one died,     
##      six-thousanders, 
##      solo expedition, 
##      spring,          
##      summit reached,  
##      without oxygen}  
## [3] {2,               
##      2-8 members,     
##      2 people hired,  
##      3-4 camps,       
##      5-15 days,       
##      AMAD,            
##      autumn,          
##      Before 2000,     
##      Nepal,           
##      no one died,     
##      six-thousanders, 
##      summit reached,  
##      without oxygen}  
## [4] {1-2 camps,       
##      3,               
##      5-15 days,       
##      9-15 members,    
##      AMAD,            
##      autumn,          
##      Before 2000,     
##      Nepal,           
##      no one died,     
##      six-thousanders, 
##      solo expedition, 
##      summit reached,  
##      without oxygen}  
## [5] {1-2 camps,       
##      2-8 members,     
##      4,               
##      5-15 days,       
##      AMAD,            
##      autumn,          
##      Before 2000,     
##      Nepal,           
##      no one died,     
##      six-thousanders, 
##      solo expedition, 
##      summit reached,  
##      without oxygen}

As the data was transformed, now we should check how frequent our values are.
Since expeditions to higher altitudes require extensive preparation and are far more complex than typical mountain hikes, I expect that many of the variables will exhibit similar frequencies. This is not surprising given the nature of the phenomenon being studied.

Below are charts presenting item frequency for 20 the most popular values - in relative and absolute terms:

itemFrequencyPlot(expedition, topN=20, type="relative", col="lightblue",main="ItemFrequency")

itemFrequencyPlot(expedition, type = "absolute", topN = 20, col = "lightgreen", main = "Item Frequency - Absolute")

Assosciation Rules - Apriori Algorithm

Now let’s move on to the Apriori Algorithm. The support level was set to 0.01, confidence to 0.8 and min length of the rules is 2. We received 1903 observations which in my opinion are too much.

rules1<-apriori(expedition, parameter=list(supp=0.1, conf=0.8, minlen=2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 711 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[7191 item(s), 7111 transaction(s)] done [0.04s].
## sorting and recoding items ... [30 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 done [0.02s].
## writing ... [1902 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

I’ve decided to check a few different paramethers set up. Eventually ending up with support level =0.25 and confidence level=0.75. With such paramethers, apriori shows 141 rules.

rules2<-apriori(expedition, parameter=list(supp=0.25, conf=0.75, minlen=2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.75    0.1    1 none FALSE            TRUE       5    0.25      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1777 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[7191 item(s), 7111 transaction(s)] done [0.05s].
## sorting and recoding items ... [19 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [141 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(rules2)
## set of 141 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3  4 
## 35 73 33 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   2.986   3.000   4.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift       
##  Min.   :0.2500   Min.   :0.7500   Min.   :0.2579   Min.   :0.9762  
##  1st Qu.:0.2669   1st Qu.:0.8506   1st Qu.:0.2980   1st Qu.:1.0076  
##  Median :0.2964   Median :0.9275   Median :0.3440   Median :1.1076  
##  Mean   :0.3218   Mean   :0.8963   Mean   :0.3612   Mean   :1.2648  
##  3rd Qu.:0.3496   3rd Qu.:0.9524   3rd Qu.:0.3915   3rd Qu.:1.4878  
##  Max.   :0.6853   Max.   :0.9911   Max.   :0.7254   Max.   :1.9709  
##      count     
##  Min.   :1778  
##  1st Qu.:1898  
##  Median :2108  
##  Mean   :2288  
##  3rd Qu.:2486  
##  Max.   :4873  
## 
## mining info:
##        data ntransactions support confidence
##  expedition          7111    0.25       0.75
##                                                                                call
##  apriori(data = expedition, parameter = list(supp = 0.25, conf = 0.75, minlen = 2))

As 141 rules might be thught to interprent, Let’s take a look on the lift.

hist(quality(rules2)$lift,
     breaks = 30,
     col='pink',
     main = "Lift distribution", 
     xlab = "Lift", 
     ylab = "number of items"
)

Since lift around 1 implies independent itemsets which are not in our field of inetrest, I’ll remove them. After that 66 rules were obtained.

rules_apriori_1 <- subset(rules2, lift >= 1.2)
hist(quality(rules_apriori_1)$lift,
     breaks = 30,
     col='pink',
     main = "Lift distribution", 
     xlab = "Lift", 
     ylab = "number of items"
)

length(rules_apriori_1) 
## [1] 66
summary(rules_apriori_1)
## set of 66 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3  4 
## 13 32 21 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   3.121   4.000   4.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift      
##  Min.   :0.2500   Min.   :0.7512   Min.   :0.2582   Min.   :1.265  
##  1st Qu.:0.2669   1st Qu.:0.8359   1st Qu.:0.3001   1st Qu.:1.410  
##  Median :0.2825   Median :0.8636   Median :0.3293   Median :1.493  
##  Mean   :0.2983   Mean   :0.8682   Mean   :0.3460   Mean   :1.547  
##  3rd Qu.:0.3177   3rd Qu.:0.8894   3rd Qu.:0.3789   3rd Qu.:1.605  
##  Max.   :0.4279   Max.   :0.9911   Max.   :0.5483   Max.   :1.971  
##      count     
##  Min.   :1778  
##  1st Qu.:1898  
##  Median :2009  
##  Mean   :2121  
##  3rd Qu.:2260  
##  Max.   :3043  
## 
## mining info:
##        data ntransactions support confidence
##  expedition          7111    0.25       0.75
##                                                                                call
##  apriori(data = expedition, parameter = list(supp = 0.25, conf = 0.75, minlen = 2))

Visualization of the rules on the plots:

plot(rules_apriori_1, 
     method = "graph", 
     measure = "support", 
     colors = c("#9933cc", "#ffccff")
)

In both plots to main focus areas are around: eight-thousanders, 3-4 camps, summit reached, with oxygen.

plot(rules_apriori_1, method="paracoord", control=list(reorder=TRUE))

Let’s try to find out more about it by taking a look on the rules.

inspect(head(sort(rules_apriori_1, by="confidence", decreasing=TRUE),10))
##      lhs                  rhs                   support confidence  coverage     lift count
## [1]  {3-4 camps,                                                                           
##       summit reached,                                                                      
##       with oxygen}     => {eight-thousanders} 0.2669104  0.9911227 0.2693011 1.969230  1898
## [2]  {summit reached,                                                                      
##       with oxygen}     => {eight-thousanders} 0.3030516  0.9777677 0.3099423 1.942695  2155
## [3]  {no one died,                                                                         
##       summit reached,                                                                      
##       with oxygen}     => {eight-thousanders} 0.2815356  0.9765854 0.2882858 1.940346  2002
## [4]  {3-4 camps,                                                                           
##       spring,                                                                              
##       summit reached}  => {eight-thousanders} 0.2593166  0.9715490 0.2669104 1.930339  1844
## [5]  {no one died,                                                                         
##       six-thousanders} => {without oxygen}    0.2505977  0.9705882 0.2581915 1.568960  1782
## [6]  {six-thousanders} => {without oxygen}    0.2579103  0.9672996 0.2666292 1.563643  1834
## [7]  {six-thousanders} => {Nepal}             0.2566446  0.9625527 0.2666292 1.327009  1825
## [8]  {3-4 camps,                                                                           
##       with oxygen}     => {eight-thousanders} 0.2999578  0.9417219 0.3185206 1.871077  2133
## [9]  {3-4 camps,                                                                           
##       no one died,                                                                         
##       with oxygen}     => {eight-thousanders} 0.2790044  0.9411765 0.2964421 1.869993  1984
## [10] {spring,                                                                              
##       with oxygen}     => {eight-thousanders} 0.2631135  0.9331671 0.2819575 1.854080  1871

Among all expeditions, approximately 27% of the cases where an expedition had 3-4 camps, reached the summit, and used oxygen also reached eight-thousanders. About 26% of expeditions that reached six-thousanders did so without using oxygen. Around 26% of expeditions that reached six-thousanders were located in Nepal. Based on those results we can observe that expeditions which had oxygen, reached summit, and used 3-4 camps also reached eight-thousanders.

Higher support for rules involving oxygen use and summit reached: It appears that summit reached and oxygen use are key factors in predicting whether an expedition reaches eight-thousanders.

Relationship between eight-thousanders and using oxygen is quite obvious- it’s hard to breath on such height so I’ll exclude oxygen to see if we can get any other results.

exp_without_oxygen <- expedition[, !itemLabels(expedition) %in% "with oxygen"]

rules_apriori_8k <-apriori(exp_without_oxygen, parameter=list(supp=0.05, conf=0.5),
                                    appearance=list(default="lhs", rhs="eight-thousanders"), control=list(verbose=F))
inspect(head(sort(rules_apriori_8k, by="confidence", decreasing=TRUE)))
##     lhs                 rhs                    support confidence   coverage     lift count
## [1] {CHOY,                                                                                 
##      summit reached} => {eight-thousanders} 0.10997047          1 0.10997047 1.986868   782
## [2] {EVER,                                                                                 
##      summit reached} => {eight-thousanders} 0.18014344          1 0.18014344 1.986868  1281
## [3] {3-4 camps,                                                                            
##      35-40 days,                                                                           
##      summit reached} => {eight-thousanders} 0.05667276          1 0.05667276 1.986868   403
## [4] {25-35 days,                                                                           
##      EVER,                                                                                 
##      summit reached} => {eight-thousanders} 0.05850091          1 0.05850091 1.986868   416
## [5] {15-25 days,                                                                           
##      CHOY,                                                                                 
##      summit reached} => {eight-thousanders} 0.07453241          1 0.07453241 1.986868   530
## [6] {China,                                                                                
##      CHOY,                                                                                 
##      summit reached} => {eight-thousanders} 0.10645479          1 0.10645479 1.986868   757

Main keytakes from this: Eight-thousanders were primarily summited on Everest or Cho Oyu, with the majority of these expeditions taking over 15 days, typically lasting around a month.

Now lets see how we can charactarise expeditions which were above eight-thousanders.

rules_apriori_8k<-apriori(exp_without_oxygen, parameter=list(supp=0.05, conf=0.5,minlen=2),
                                   appearance=list(default="rhs", lhs="eight-thousanders"), control=list(verbose=F))
inspect(head(sort(rules_apriori_8k, by="confidence", decreasing=TRUE)))
##     lhs                    rhs              support   confidence coverage 
## [1] {eight-thousanders} => {no one died}    0.4698355 0.9335010  0.5033047
## [2] {eight-thousanders} => {summit reached} 0.4279286 0.8502375  0.5033047
## [3] {eight-thousanders} => {3-4 camps}      0.4118971 0.8183850  0.5033047
## [4] {eight-thousanders} => {spring}         0.3515680 0.6985191  0.5033047
## [5] {eight-thousanders} => {Nepal}          0.2979890 0.5920648  0.5033047
## [6] {eight-thousanders} => {2-8 members}    0.2811138 0.5585359  0.5033047
##     lift      count
## [1] 0.9857626 3341 
## [2] 1.4021426 3043 
## [3] 1.4925714 2929 
## [4] 1.3586350 2500 
## [5] 0.8162414 2119 
## [6] 0.9074135 1999

The majority of expeditions to eight-thousanders were conducted during the spring season, with most taking place on the Nepalese side of the mountains. These expeditions typically involved small teams, either with 2 or 8 members, and were characterized by the use of 3 to 4 camps throughout the journey.

I believe that it would be laso intresting to find out why expeditions didn’t reach the peak.

rules.summitnotreached<-apriori(data=expedition, parameter=list(supp=0.01, conf=0.08,maxlen=3), appearance=list(default="lhs", rhs="summit not reached"), control=list(verbose=F)) 
rules.summitnotreached.byconf<-sort(rules.summitnotreached, by="confidence", decreasing=TRUE)
inspect(head(rules.summitnotreached.byconf,10))
##      lhs                    rhs                     support confidence   coverage     lift count
## [1]  {five-thousanders}  => {summit not reached} 0.03529743          1 0.03529743 2.541458   251
## [2]  {PUMO,                                                                                     
##       six-thousanders}   => {summit not reached} 0.01448460          1 0.01448460 2.541458   103
## [3]  {five-thousanders,                                                                         
##       no camp}           => {summit not reached} 0.01490648          1 0.01490648 2.541458   106
## [4]  {1-5 days,                                                                                 
##       five-thousanders}  => {summit not reached} 0.01673464          1 0.01673464 2.541458   119
## [5]  {AMAD,                                                                                     
##       five-thousanders}  => {summit not reached} 0.01251582          1 0.01251582 2.541458    89
## [6]  {five-thousanders,                                                                         
##       solo expedition}   => {summit not reached} 0.01518774          1 0.01518774 2.541458   108
## [7]  {5-15 days,                                                                                
##       five-thousanders}  => {summit not reached} 0.01476586          1 0.01476586 2.541458   105
## [8]  {1-2 camps,                                                                                
##       five-thousanders}  => {summit not reached} 0.01996906          1 0.01996906 2.541458   142
## [9]  {2010-2019,                                                                                
##       five-thousanders}  => {summit not reached} 0.01617213          1 0.01617213 2.541458   115
## [10] {2000-2009,                                                                                
##       five-thousanders}  => {summit not reached} 0.01279707          1 0.01279707 2.541458    91

Expeditions targeting five-thousand-meter peaks are more likely to fail in reaching the summit, especially when the duration is short (1-5 days) or if no camps are set up. Solo expeditions or those involving specific peaks like PUMO (Pumo Ri) and AMAD (Ama Dablam) also show a higher failure rate. Furthermore, expeditions from the 2000-2009 and 2010-2019 periods aiming at five-thousanders are associated with not summiting. In general, the combination of these factors increases the likelihood of not reaching the peak.

If peaks weren’t reached then:

rules.summitnotreached<-apriori(data=expedition, parameter=list(supp=0.01, conf=0.08,minlen=2), appearance=list(default="rhs", lhs="summit not reached"), control=list(verbose=F)) 
rules.summitnotreached.byconf<-sort(rules.summitnotreached, by="confidence", decreasing=TRUE)
inspect(head(rules.summitnotreached.byconf,10))
##      lhs                     rhs               support   confidence coverage 
## [1]  {summit not reached} => {no one died}     0.3718183 0.9449607  0.3934749
## [2]  {summit not reached} => {without oxygen}  0.3221769 0.8187991  0.3934749
## [3]  {summit not reached} => {Nepal}           0.2933483 0.7455325  0.3934749
## [4]  {summit not reached} => {2-8 members}     0.2503164 0.6361687  0.3934749
## [5]  {summit not reached} => {1-2 camps}       0.1995500 0.5071480  0.3934749
## [6]  {summit not reached} => {spring}          0.1912530 0.4860615  0.3934749
## [7]  {summit not reached} => {autumn}          0.1892842 0.4810579  0.3934749
## [8]  {summit not reached} => {2000-2009}       0.1551118 0.3942102  0.3934749
## [9]  {summit not reached} => {six-thousanders} 0.1449866 0.3684775  0.3934749
## [10] {summit not reached} => {solo expedition} 0.1445648 0.3674053  0.3934749
##      lift      count
## [1]  0.9978639 2644 
## [2]  1.3235919 2291 
## [3]  1.0278173 2086 
## [4]  1.0335379 1780 
## [5]  1.4248634 1419 
## [6]  0.9454002 1360 
## [7]  1.0388104 1346 
## [8]  1.0069067 1103 
## [9]  1.3819849 1031 
## [10] 1.2494591 1028

{summit not reached} => {no one died}: This rule indicates that when the summit was not reached during an expedition, there is a high likelihood that no one died.

{summit not reached} => {without oxygen}: This rule suggests that when the summit was not reached, the expeditions were more likely to have been conducted without oxygen.

{summit not reached} => {Nepal}: This rule shows that expeditions where the summit was not reached are more likely to have been hosted in Nepal.

Mostly not reaching the peak results in no deaths, no oxygen used, climbing from the Nepal side in smaller teams.

Analasing deaths and other features:

What happened that no one died?

rules.nodeath<-apriori(data=expedition, parameter=list(supp=0.01, conf=0.08,maxlen=3), appearance=list(default="lhs", rhs="no one died"), control=list(verbose=F)) 
rules.nodeath.byconf<-sort(rules.nodeath, by="confidence", decreasing=TRUE)
inspect(head(rules.nodeath.byconf,10))
##      lhs                              rhs           support    confidence
## [1]  {2000-2009, five-thousanders} => {no one died} 0.01279707 1.0000000 
## [2]  {AMAD, no camp}               => {no one died} 0.01532836 1.0000000 
## [3]  {AMAD, solo}                  => {no one died} 0.02151596 1.0000000 
## [4]  {1 person hired, AMAD}        => {no one died} 0.03923499 0.9928826 
## [5]  {CHOY, solo}                  => {no one died} 0.03290676 0.9915254 
## [6]  {2010-2019, AMAD}             => {no one died} 0.06229785 0.9910515 
## [7]  {HIML, seven-thousanders}     => {no one died} 0.01546899 0.9909910 
## [8]  {1-2 camps, AMAD}             => {no one died} 0.10547040 0.9907530 
## [9]  {3-6 people hired, AMAD}      => {no one died} 0.02939108 0.9905213 
## [10] {HIML, summit reached}        => {no one died} 0.01462523 0.9904762 
##      coverage   lift     count
## [1]  0.01279707 1.055985  91  
## [2]  0.01532836 1.055985 109  
## [3]  0.02151596 1.055985 153  
## [4]  0.03951624 1.048469 279  
## [5]  0.03318802 1.047036 234  
## [6]  0.06286036 1.046535 443  
## [7]  0.01560962 1.046471 110  
## [8]  0.10645479 1.046220 750  
## [9]  0.02967234 1.045975 209  
## [10] 0.01476586 1.045928 104

The association rules suggest that certain conditions in Himalayan expeditions are strongly associated with no deaths occurring during the climb. For instance, expeditions between 2000-2009 to five-thousanders or expeditions involving the Ama Dablam peak (AMAD) with no camp, solo climbs, or minimal hired help are strongly linked to no deaths, with the confidence values close to 1. Other conditions, such as climbing on Ama Dablam with 1-2 camps or solo, also show a high likelihood of survival. Additionally, expeditions to the HIML peak, particularly when the summit is reached, also have a strong association with no fatalities.

Cases when member died, cases when sherpa died:

#Member died
rules.mdeath<-apriori(data=expedition, parameter=list(supp=0.005, conf=0.08,maxlen=3), appearance=list(default="lhs", rhs="member died"), control=list(verbose=F)) 
rules.mdeath.byconf<-sort(rules.mdeath, by="confidence", decreasing=TRUE)
inspect(rules.mdeath.byconf)
##     lhs                                 rhs           support     confidence
## [1] {Before 2000, with oxygen}       => {member died} 0.005343833 0.09921671
## [2] {9-15 members, Before 2000}      => {member died} 0.005062579 0.09254499
## [3] {Before 2000, eight-thousanders} => {member died} 0.008718886 0.09253731
## [4] {Before 2000, spring}            => {member died} 0.006187597 0.08239700
##     coverage   lift     count
## [1] 0.05386022 2.672462 38   
## [2] 0.05470398 2.492755 36   
## [3] 0.09422022 2.492549 62   
## [4] 0.07509492 2.219413 44
#Sherpa died
rules.sdeath<-apriori(data=expedition, parameter=list(supp=0.001, conf=0.08,maxlen=3), appearance=list(default="lhs", rhs="sherpa died"), control=list(verbose=F)) 
rules.sdeath.byconf<-sort(rules.sdeath, by="confidence", decreasing=TRUE)
inspect(rules.sdeath.byconf)
##     lhs                             rhs               support confidence    coverage      lift count
## [1] {more than 20 people hired,                                                                     
##      Nepal}                      => {sherpa died} 0.001406272 0.13698630 0.010265785 10.588148    10
## [2] {EVER,                                                                                          
##      more than 20 people hired}  => {sherpa died} 0.001546899 0.13580247 0.011390803 10.496645    11
## [3] {more than 20 people hired,                                                                     
##      spring}                     => {sherpa died} 0.001546899 0.13580247 0.011390803 10.496645    11
## [4] {more than 20 people hired}  => {sherpa died} 0.001546899 0.12087912 0.012797075  9.343168    11
## [5] {3-4 camps,                                                                                     
##      more than 20 people hired}  => {sherpa died} 0.001125018 0.11267606 0.009984531  8.709124     8
## [6] {more than 20 people hired,                                                                     
##      summit reached}             => {sherpa died} 0.001125018 0.10256410 0.010968921  7.927536     8
## [7] {eight-thousanders,                                                                             
##      more than 20 people hired}  => {sherpa died} 0.001125018 0.09756098 0.011531430  7.540827     8
## [8] {more than 20 people hired,                                                                     
##      with oxygen}                => {sherpa died} 0.001125018 0.09411765 0.011953312  7.274680     8

Expeditions before 2000, particularly those with oxygen, larger teams of 9-15 members, or to eight-thousanders, are associated with a higher likelihood of a member dying. These patterns suggest that earlier expeditions, especially in challenging conditions, had a greater risk of fatalities.

But when it comes to sherpa deaths it is different. The left-hand side (LHS) of the rule indicates that expeditions with more than 20 people hired, particularly in Nepal, during spring, or with oxygen, are associated with the right-hand side (RHS) of the rule, which indicates that a Sherpa died during the expedition.

Other intresting rules:

#Summit reached
rules.summitreached<-apriori(data=expedition, parameter=list(supp=0.01, conf=0.08,maxlen=3), appearance=list(default="lhs", rhs="summit reached"), control=list(verbose=F)) 
rules.summitreached.byconf<-sort(rules.summitreached, by="confidence", decreasing=TRUE)
inspect(head(rules.summitreached.byconf,10))
##      lhs                             rhs                 support confidence   coverage     lift count
## [1]  {PUMO,                                                                                          
##       seven-thousanders}          => {summit reached} 0.01251582  0.9673913 0.01293770 1.595343    89
## [2]  {HIML,                                                                                          
##       seven-thousanders}          => {summit reached} 0.01476586  0.9459459 0.01560962 1.559977   105
## [3]  {eight-thousanders,                                                                             
##       more than 20 people hired}  => {summit reached} 0.01082829  0.9390244 0.01153143 1.548563    77
## [4]  {CHOY,                                                                                          
##       eight-thousanders}          => {summit reached} 0.10997047  0.9298454 0.11826747 1.533426   782
## [5]  {11-20 people hired,                                                                            
##       eight-thousanders}          => {summit reached} 0.03192237  0.9265306 0.03445366 1.527959   227
## [6]  {more than 20 people hired,                                                                     
##       with oxygen}                => {summit reached} 0.01096892  0.9176471 0.01195331 1.513309    78
## [7]  {16 and more members,                                                                           
##       3-4 camps}                  => {summit reached} 0.02756293  0.9116279 0.03023485 1.503383   196
## [8]  {16 and more members,                                                                           
##       eight-thousanders}          => {summit reached} 0.03023485  0.9110169 0.03318802 1.502375   215
## [9]  {9-15 members,                                                                                  
##       eight-thousanders}          => {summit reached} 0.08761074  0.9081633 0.09647026 1.497669   623
## [10] {11-20 people hired,                                                                            
##       with oxygen}                => {summit reached} 0.03164112  0.9036145 0.03501617 1.490168   225
#Solo expedition
rules.solo<-apriori(data=expedition, parameter=list(supp=0.01, conf=0.08,maxlen=3), appearance=list(default="lhs", rhs="solo expedition"), control=list(verbose=F)) 
rules.solo.byconf<-sort(rules.solo, by="confidence", decreasing=TRUE)
inspect(head(rules.solo.byconf,10))
##      lhs                     rhs                  support confidence   coverage     lift count
## [1]  {2000-2009,                                                                              
##       no camp}            => {solo expedition} 0.01237519  0.5986395 0.02067220 2.035832    88
## [2]  {LHOT,                                                                                   
##       without oxygen}     => {solo expedition} 0.01251582  0.5933333 0.02109408 2.017787    89
## [3]  {35-40 days,                                                                             
##       without oxygen}     => {solo expedition} 0.01026579  0.5703125 0.01800028 1.939499    73
## [4]  {eight-thousanders,                                                                      
##       without oxygen}     => {solo expedition} 0.08957952  0.5622242 0.15933061 1.911992   637
## [5]  {2-8 members,                                                                            
##       PUMO}               => {solo expedition} 0.01181268  0.5562914 0.02123471 1.891816    84
## [6]  {CHOY,                                                                                   
##       solo}               => {solo expedition} 0.01842216  0.5550847 0.03318802 1.887713   131
## [7]  {ANN1,                                                                                   
##       without oxygen}     => {solo expedition} 0.01096892  0.5454545 0.02010969 1.854963    78
## [8]  {China,                                                                                  
##       without oxygen}     => {solo expedition} 0.07734496  0.5445545 0.14203347 1.851902   550
## [9]  {25-35 days,                                                                             
##       without oxygen}     => {solo expedition} 0.03585994  0.5437100 0.06595416 1.849030   255
## [10] {no camp,                                                                                
##       without oxygen}     => {solo expedition} 0.03656307  0.5295316 0.06904795 1.800813   260
#Everest
rules.everest<-apriori(data=expedition, parameter=list(supp=0.01, conf=0.08,maxlen=3), appearance=list(default="lhs", rhs="EVER"), control=list(verbose=F)) 
rules.everest.byconf<-sort(rules.everest, by="confidence", decreasing=TRUE)
inspect(head(rules.everest.byconf,10))
##      lhs                             rhs       support confidence   coverage     lift count
## [1]  {40-50 days,                                                                          
##       China}                      => {EVER} 0.02334411  0.9940120 0.02348474 3.765807   166
## [2]  {more than 20 people hired,                                                           
##       spring}                     => {EVER} 0.01110955  0.9753086 0.01139080 3.694949    79
## [3]  {35-40 days,                                                                          
##       China}                      => {EVER} 0.02545352  0.9731183 0.02615666 3.686651   181
## [4]  {50-60 days,                                                                          
##       with oxygen}                => {EVER} 0.01350021  0.9230769 0.01462523 3.497070    96
## [5]  {50-60 days,                                                                          
##       eight-thousanders}          => {EVER} 0.01378147  0.9074074 0.01518774 3.437706    98
## [6]  {3-4 camps,                                                                           
##       50-60 days}                 => {EVER} 0.01195331  0.9042553 0.01321896 3.425764    85
## [7]  {11-20 people hired,                                                                  
##       spring}                     => {EVER} 0.02868795  0.9026549 0.03178175 3.419701   204
## [8]  {eight-thousanders,                                                                   
##       more than 20 people hired}  => {EVER} 0.01040641  0.9024390 0.01153143 3.418883    74
## [9]  {more than 20 people hired,                                                           
##       with oxygen}                => {EVER} 0.01068767  0.8941176 0.01195331 3.387358    76
## [10] {more than 20 people hired}  => {EVER} 0.01139080  0.8901099 0.01279707 3.372174    81

The rules suggest that expeditions with certain combinations, like “PUMO with seven-thousanders,” “CHOY with eight-thousanders,” or “16 and more members with 3-4 camps,” are more likely to result in a successful summit. Additionally, using oxygen or having a larger number of people hired, such as more than 20, seem to increase the likelihood of reaching the summit, particularly on high-altitude mountains like the eight-thousanders.

The LHS of the rules highlights conditions like time period (2000-2009), oxygen usage, and specific peaks (e.g., LHOT, PUMO, CHOY), with the RHS indicating that the expedition was solo. These rules show that solo expeditions are more likely when no camp is set, oxygen isn’t used, or during specific time frames and peaks. Solo expeditions are especially common for eight-thousanders and certain peaks, like CHOY, when conducted without oxygen. This suggests that solo expeditions tend to occur under more extreme or isolated conditions.

The LHS of these rules describes various expedition conditions such as duration (40-50 days, 50-60 days), location (China), number of people hired, and oxygen usage. The RHS indicates that the expedition was successful on Mount Everest (“EVER”). These rules suggest that longer expeditions, especially those lasting 40-60 days, particularly in China or with larger teams, are more likely to result in successful summit attempts on Everest. Additionally, the use of oxygen or hiring more people seems to be linked to a higher chance of reaching the summit.

Example about interpreting the plot for selected rule:

plot(rules_apriori_1, method="graph", measure="support", shading="lift", engine="html")

Rule 20: {eight-thousanders,with oxygen} => {3-4 camps}

The rule indicates that when an expedition involves climbing eight-thousanders with oxygen, it’s associated with staying at 3 or 4 camps Support = 0.3 - There’s a 30% chance of finding a expedition where they reached eight-thousanders with oxygen while staying at 3-4 camps. Confidence = 0.872 - If an expedition has climbed eight-thousanders with oxygen, there’s an 87.2% chance that the expedition stayed at 3 or 4 camps. Lift = 1.59 - The lift value of 1.59 indicates that the association between eight-thousanders with oxygen and 3-4 camps is 1.59 times more likely than if the two events were independent

Other methods

Apriori is not the only algorithm for generating association rules; ECLAT can also be used. ECLAT operates by using straightforward intersection operations for equivalence class clustering and a bottom-up traversal of the lattice. One of its key advantages is its speed, as it avoids repeatedly scanning the data to compute individual support values. I’m checking the difference for the same paramethers.

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.75    0.1    1 none FALSE            TRUE       5    0.25      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1777 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[7191 item(s), 7111 transaction(s)] done [0.04s].
## sorting and recoding items ... [19 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [142 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.25      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 1777 
## 
## create itemset ... 
## set transactions ...[7191 item(s), 7111 transaction(s)] done [0.04s].
## sorting and recoding items ... [19 item(s)] done [0.00s].
## creating bit matrix ... [19 row(s), 7111 column(s)] done [0.00s].
## writing  ... [136 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].
## [1] "Apriori time: 0.0700000000000003 seconds"
## [1] "Eclat time: 0.0600000000000023 seconds"

Why did I choose the Apriori then? My dataset contains numerous categorical variables (e.g., season, host, o2used, death). The data, which is based on attributes describing expedition properties (e.g., peakid, year, success), does not follow the typical structure of transactions, like shopping lists. Apriori is particularly well-suited for this type of data because: • It allows for an easy transformation of the dataset into a transactional format (e.g., through one-hot encoding of categorical variables). • It is more intuitive for analyzing categorical data, as association rules can be directly interpreted in the context of the attributes. On the other hand, Eclat performs better on traditional transactional datasets (e.g., “shopping baskets”), where it operates on the intersections of transactions. However, my data does not have a classic “basket” structure. Additionally, certain elements like deaths, 4,000m mountains, or two-month-long expeditions are rare occurrences in the dataset. Apriori is better equipped to handle sparse data, as it generates candidates iteratively based on item frequencies. Eclat, in contrast, creates lists of items and transaction intersections, which can result in higher memory usage for sparse datasets. Although Eclat might have been a faster solution in some scenarios, I chose to use Apriori due to its better alignment with the structure and characteristics of my data.

Future suggestions

In the future, this report could be expanded to include specific individuals who participated in the expeditions, providing insights into the roles and experiences of team members over time. Additionally, extending the date range of the analysis could offer a comparison of how the association rules evolve over the years, potentially highlighting changes in expedition strategies, success rates, or safety measures. Given the richness of this dataset, there are numerous opportunities for further analysis, such as exploring trends in the use of oxygen, the impact of different types of hires (e.g., sherpas or guides), or the frequency of accidents and fatalities over time.

Conclusion

In this analysis, the Apriori algorithm was applied to a dataset of Himalayan expeditions. The analysis generated several association rules, with key factors such as the use of oxygen, the number of hired people, and the type of mountain being prominent in the right-hand side (rhs). By adjusting parameters such as support and confidence, more specific and insightful patterns could be uncovered, shedding light on different expedition characteristics, success rates, and safety factors. This approach can be further refined to explore trends and behaviors over time, providing valuable insights into the evolution of Himalayan expeditions.