The aim of this study is to use association rules to identify patterns and dependencies related to Himalayan expeditions. The data comes from the Himalayan Database [https://www.himalayandatabase.com/] and includes expeditions from 1990 to 2024. The starting point of this analysis, 1990, marks the beginning of the commercial era of Himalayan climbing. The analysis focuses on expedition-level data rather than individual climbers, as this approach provides a better understanding of both the risks and the overall safety of the expeditions. In this study, an expedition is considered successful only when all its members return safely. The Apriori algorithm was used to perform the analysis, with each row in the dataset representing a unique expedition.
For the analysis, 12 variables were selected, resulting in a dataset of 7,110 unique expeditions. Most variables were converted to categorical data to ensure easier interpretation of the results. The variables used include:
The final dataset consisted of the following 12 variables: “peakid”, “year”, “season”, “host”, “smtdays”, “highpoint”, “camps”, “totmembers”, “tothired”, “o2used”, “success”, and “death”. Moreover only peaks with more than 20 expeditions recorded after 1990 were included in the analysis.
Frequency plots were created for these variables to provide a visual representation of their distributions.
par(mfrow = c(1, 3))
variables <- c("year", "season", "host", "smtdays", "highpoint", "camps",
"totmembers", "tothired", "o2used", "success", "death","peakid")
for (var in variables) {
if (var %in% names(Himalayan)) {
counts <- table(Himalayan[[var]])
barplot(counts,
main = paste("Barplot for", var),
col = rainbow(length(counts)),
cex.main = 1,
las=3,
cex.lab = 1)
}
}
par(mfrow = c(1, 1))
“The Apriori algorithm was proposed by Agrawal and Srikant in 1994. Apriori is designed to operate on databases containing transactions (for example, collections of items bought by customers, or details of a website frequentation or IP addresses[2]). Given a threshold C, the Apriori algorithm identifies the item sets which are subsets of at least C transactions in the database. Apriori uses a”bottom up” approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found.” [(https://en.wikipedia.org/wiki/Apriori_algorithm)]
We can distinguish three common metrics used to evaluate the quality of Association Rules: support, confidence, and lift.
Support
Support measures how often the joint itemset appears in the database. Simply put, it is a frequency-based measure of the itemset’s occurrence in the dataset.
Confidence
Confidence is expressed as a percentage and indicates how often the rule’s consequent (Y) appears among all the groups that contain the rule’s antecedent (X). It serves as an indicator of the rule’s reliability (IBM, 2021a). A higher confidence value suggests a stronger rule.
X: antecedent itemset Y: consequent itemset
Lift
Lift is a ratio that compares the confidence of a rule to its expected confidence. It measures the likelihood of co-occurrence between X and Y. The lift value can range from 0 to infinity and is interpreted as follows:
Value greater than 1: X and Y are positively dependent. Value equal to 1: X and Y are independent, meaning no meaningful rule can be derived. Value less than 1: X and Y are negatively dependent. The presence of X reduces the likelihood of Y occurring (IBM, 2021b).
Now, data needs to be transformed into a format that can be used by the read.transactions function.
## [1] 7111
## items
## [1] {camps,
## death,
## highpoint,
## host,
## o2used,
## peakid,
## season,
## smtdays,
## success,
## tothired,
## totmembers,
## year}
## [2] {1,
## 1-2 camps,
## 15-25 days,
## 2-8 members,
## AMAD,
## Before 2000,
## Nepal,
## no one died,
## six-thousanders,
## solo expedition,
## spring,
## summit reached,
## without oxygen}
## [3] {2,
## 2-8 members,
## 2 people hired,
## 3-4 camps,
## 5-15 days,
## AMAD,
## autumn,
## Before 2000,
## Nepal,
## no one died,
## six-thousanders,
## summit reached,
## without oxygen}
## [4] {1-2 camps,
## 3,
## 5-15 days,
## 9-15 members,
## AMAD,
## autumn,
## Before 2000,
## Nepal,
## no one died,
## six-thousanders,
## solo expedition,
## summit reached,
## without oxygen}
## [5] {1-2 camps,
## 2-8 members,
## 4,
## 5-15 days,
## AMAD,
## autumn,
## Before 2000,
## Nepal,
## no one died,
## six-thousanders,
## solo expedition,
## summit reached,
## without oxygen}
As the data was transformed, now we should check how frequent our
values are.
Since expeditions to higher altitudes require extensive preparation and
are far more complex than typical mountain hikes, I expect that many of
the variables will exhibit similar frequencies. This is not surprising
given the nature of the phenomenon being studied.
Below are charts presenting item frequency for 20 the most popular values - in relative and absolute terms:
itemFrequencyPlot(expedition, topN=20, type="relative", col="lightblue",main="ItemFrequency")
itemFrequencyPlot(expedition, type = "absolute", topN = 20, col = "lightgreen", main = "Item Frequency - Absolute")
Now let’s move on to the Apriori Algorithm. The support level was set to 0.01, confidence to 0.8 and min length of the rules is 2. We received 1903 observations which in my opinion are too much.
rules1<-apriori(expedition, parameter=list(supp=0.1, conf=0.8, minlen=2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.1 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 711
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[7191 item(s), 7111 transaction(s)] done [0.04s].
## sorting and recoding items ... [30 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 done [0.02s].
## writing ... [1902 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
I’ve decided to check a few different paramethers set up. Eventually ending up with support level =0.25 and confidence level=0.75. With such paramethers, apriori shows 141 rules.
rules2<-apriori(expedition, parameter=list(supp=0.25, conf=0.75, minlen=2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.75 0.1 1 none FALSE TRUE 5 0.25 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1777
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[7191 item(s), 7111 transaction(s)] done [0.05s].
## sorting and recoding items ... [19 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [141 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(rules2)
## set of 141 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 35 73 33
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 2.986 3.000 4.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.2500 Min. :0.7500 Min. :0.2579 Min. :0.9762
## 1st Qu.:0.2669 1st Qu.:0.8506 1st Qu.:0.2980 1st Qu.:1.0076
## Median :0.2964 Median :0.9275 Median :0.3440 Median :1.1076
## Mean :0.3218 Mean :0.8963 Mean :0.3612 Mean :1.2648
## 3rd Qu.:0.3496 3rd Qu.:0.9524 3rd Qu.:0.3915 3rd Qu.:1.4878
## Max. :0.6853 Max. :0.9911 Max. :0.7254 Max. :1.9709
## count
## Min. :1778
## 1st Qu.:1898
## Median :2108
## Mean :2288
## 3rd Qu.:2486
## Max. :4873
##
## mining info:
## data ntransactions support confidence
## expedition 7111 0.25 0.75
## call
## apriori(data = expedition, parameter = list(supp = 0.25, conf = 0.75, minlen = 2))
As 141 rules might be thught to interprent, Let’s take a look on the lift.
hist(quality(rules2)$lift,
breaks = 30,
col='pink',
main = "Lift distribution",
xlab = "Lift",
ylab = "number of items"
)
Since lift around 1 implies independent itemsets which are not in our
field of inetrest, I’ll remove them. After that 66 rules were
obtained.
rules_apriori_1 <- subset(rules2, lift >= 1.2)
hist(quality(rules_apriori_1)$lift,
breaks = 30,
col='pink',
main = "Lift distribution",
xlab = "Lift",
ylab = "number of items"
)
length(rules_apriori_1)
## [1] 66
summary(rules_apriori_1)
## set of 66 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 13 32 21
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 3.121 4.000 4.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.2500 Min. :0.7512 Min. :0.2582 Min. :1.265
## 1st Qu.:0.2669 1st Qu.:0.8359 1st Qu.:0.3001 1st Qu.:1.410
## Median :0.2825 Median :0.8636 Median :0.3293 Median :1.493
## Mean :0.2983 Mean :0.8682 Mean :0.3460 Mean :1.547
## 3rd Qu.:0.3177 3rd Qu.:0.8894 3rd Qu.:0.3789 3rd Qu.:1.605
## Max. :0.4279 Max. :0.9911 Max. :0.5483 Max. :1.971
## count
## Min. :1778
## 1st Qu.:1898
## Median :2009
## Mean :2121
## 3rd Qu.:2260
## Max. :3043
##
## mining info:
## data ntransactions support confidence
## expedition 7111 0.25 0.75
## call
## apriori(data = expedition, parameter = list(supp = 0.25, conf = 0.75, minlen = 2))
plot(rules_apriori_1,
method = "graph",
measure = "support",
colors = c("#9933cc", "#ffccff")
)
In both plots to main focus areas are around: eight-thousanders, 3-4
camps, summit reached, with oxygen.
plot(rules_apriori_1, method="paracoord", control=list(reorder=TRUE))
Let’s try to find out more about it by taking a look on the rules.
inspect(head(sort(rules_apriori_1, by="confidence", decreasing=TRUE),10))
## lhs rhs support confidence coverage lift count
## [1] {3-4 camps,
## summit reached,
## with oxygen} => {eight-thousanders} 0.2669104 0.9911227 0.2693011 1.969230 1898
## [2] {summit reached,
## with oxygen} => {eight-thousanders} 0.3030516 0.9777677 0.3099423 1.942695 2155
## [3] {no one died,
## summit reached,
## with oxygen} => {eight-thousanders} 0.2815356 0.9765854 0.2882858 1.940346 2002
## [4] {3-4 camps,
## spring,
## summit reached} => {eight-thousanders} 0.2593166 0.9715490 0.2669104 1.930339 1844
## [5] {no one died,
## six-thousanders} => {without oxygen} 0.2505977 0.9705882 0.2581915 1.568960 1782
## [6] {six-thousanders} => {without oxygen} 0.2579103 0.9672996 0.2666292 1.563643 1834
## [7] {six-thousanders} => {Nepal} 0.2566446 0.9625527 0.2666292 1.327009 1825
## [8] {3-4 camps,
## with oxygen} => {eight-thousanders} 0.2999578 0.9417219 0.3185206 1.871077 2133
## [9] {3-4 camps,
## no one died,
## with oxygen} => {eight-thousanders} 0.2790044 0.9411765 0.2964421 1.869993 1984
## [10] {spring,
## with oxygen} => {eight-thousanders} 0.2631135 0.9331671 0.2819575 1.854080 1871
Among all expeditions, approximately 27% of the cases where an expedition had 3-4 camps, reached the summit, and used oxygen also reached eight-thousanders. About 26% of expeditions that reached six-thousanders did so without using oxygen. Around 26% of expeditions that reached six-thousanders were located in Nepal. Based on those results we can observe that expeditions which had oxygen, reached summit, and used 3-4 camps also reached eight-thousanders.
Higher support for rules involving oxygen use and summit reached: It appears that summit reached and oxygen use are key factors in predicting whether an expedition reaches eight-thousanders.
Relationship between eight-thousanders and using oxygen is quite obvious- it’s hard to breath on such height so I’ll exclude oxygen to see if we can get any other results.
exp_without_oxygen <- expedition[, !itemLabels(expedition) %in% "with oxygen"]
rules_apriori_8k <-apriori(exp_without_oxygen, parameter=list(supp=0.05, conf=0.5),
appearance=list(default="lhs", rhs="eight-thousanders"), control=list(verbose=F))
inspect(head(sort(rules_apriori_8k, by="confidence", decreasing=TRUE)))
## lhs rhs support confidence coverage lift count
## [1] {CHOY,
## summit reached} => {eight-thousanders} 0.10997047 1 0.10997047 1.986868 782
## [2] {EVER,
## summit reached} => {eight-thousanders} 0.18014344 1 0.18014344 1.986868 1281
## [3] {3-4 camps,
## 35-40 days,
## summit reached} => {eight-thousanders} 0.05667276 1 0.05667276 1.986868 403
## [4] {25-35 days,
## EVER,
## summit reached} => {eight-thousanders} 0.05850091 1 0.05850091 1.986868 416
## [5] {15-25 days,
## CHOY,
## summit reached} => {eight-thousanders} 0.07453241 1 0.07453241 1.986868 530
## [6] {China,
## CHOY,
## summit reached} => {eight-thousanders} 0.10645479 1 0.10645479 1.986868 757
Main keytakes from this: Eight-thousanders were primarily summited on Everest or Cho Oyu, with the majority of these expeditions taking over 15 days, typically lasting around a month.
Now lets see how we can charactarise expeditions which were above eight-thousanders.
rules_apriori_8k<-apriori(exp_without_oxygen, parameter=list(supp=0.05, conf=0.5,minlen=2),
appearance=list(default="rhs", lhs="eight-thousanders"), control=list(verbose=F))
inspect(head(sort(rules_apriori_8k, by="confidence", decreasing=TRUE)))
## lhs rhs support confidence coverage
## [1] {eight-thousanders} => {no one died} 0.4698355 0.9335010 0.5033047
## [2] {eight-thousanders} => {summit reached} 0.4279286 0.8502375 0.5033047
## [3] {eight-thousanders} => {3-4 camps} 0.4118971 0.8183850 0.5033047
## [4] {eight-thousanders} => {spring} 0.3515680 0.6985191 0.5033047
## [5] {eight-thousanders} => {Nepal} 0.2979890 0.5920648 0.5033047
## [6] {eight-thousanders} => {2-8 members} 0.2811138 0.5585359 0.5033047
## lift count
## [1] 0.9857626 3341
## [2] 1.4021426 3043
## [3] 1.4925714 2929
## [4] 1.3586350 2500
## [5] 0.8162414 2119
## [6] 0.9074135 1999
The majority of expeditions to eight-thousanders were conducted during the spring season, with most taking place on the Nepalese side of the mountains. These expeditions typically involved small teams, either with 2 or 8 members, and were characterized by the use of 3 to 4 camps throughout the journey.
I believe that it would be laso intresting to find out why expeditions didn’t reach the peak.
rules.summitnotreached<-apriori(data=expedition, parameter=list(supp=0.01, conf=0.08,maxlen=3), appearance=list(default="lhs", rhs="summit not reached"), control=list(verbose=F))
rules.summitnotreached.byconf<-sort(rules.summitnotreached, by="confidence", decreasing=TRUE)
inspect(head(rules.summitnotreached.byconf,10))
## lhs rhs support confidence coverage lift count
## [1] {five-thousanders} => {summit not reached} 0.03529743 1 0.03529743 2.541458 251
## [2] {PUMO,
## six-thousanders} => {summit not reached} 0.01448460 1 0.01448460 2.541458 103
## [3] {five-thousanders,
## no camp} => {summit not reached} 0.01490648 1 0.01490648 2.541458 106
## [4] {1-5 days,
## five-thousanders} => {summit not reached} 0.01673464 1 0.01673464 2.541458 119
## [5] {AMAD,
## five-thousanders} => {summit not reached} 0.01251582 1 0.01251582 2.541458 89
## [6] {five-thousanders,
## solo expedition} => {summit not reached} 0.01518774 1 0.01518774 2.541458 108
## [7] {5-15 days,
## five-thousanders} => {summit not reached} 0.01476586 1 0.01476586 2.541458 105
## [8] {1-2 camps,
## five-thousanders} => {summit not reached} 0.01996906 1 0.01996906 2.541458 142
## [9] {2010-2019,
## five-thousanders} => {summit not reached} 0.01617213 1 0.01617213 2.541458 115
## [10] {2000-2009,
## five-thousanders} => {summit not reached} 0.01279707 1 0.01279707 2.541458 91
Expeditions targeting five-thousand-meter peaks are more likely to fail in reaching the summit, especially when the duration is short (1-5 days) or if no camps are set up. Solo expeditions or those involving specific peaks like PUMO (Pumo Ri) and AMAD (Ama Dablam) also show a higher failure rate. Furthermore, expeditions from the 2000-2009 and 2010-2019 periods aiming at five-thousanders are associated with not summiting. In general, the combination of these factors increases the likelihood of not reaching the peak.
If peaks weren’t reached then:
rules.summitnotreached<-apriori(data=expedition, parameter=list(supp=0.01, conf=0.08,minlen=2), appearance=list(default="rhs", lhs="summit not reached"), control=list(verbose=F))
rules.summitnotreached.byconf<-sort(rules.summitnotreached, by="confidence", decreasing=TRUE)
inspect(head(rules.summitnotreached.byconf,10))
## lhs rhs support confidence coverage
## [1] {summit not reached} => {no one died} 0.3718183 0.9449607 0.3934749
## [2] {summit not reached} => {without oxygen} 0.3221769 0.8187991 0.3934749
## [3] {summit not reached} => {Nepal} 0.2933483 0.7455325 0.3934749
## [4] {summit not reached} => {2-8 members} 0.2503164 0.6361687 0.3934749
## [5] {summit not reached} => {1-2 camps} 0.1995500 0.5071480 0.3934749
## [6] {summit not reached} => {spring} 0.1912530 0.4860615 0.3934749
## [7] {summit not reached} => {autumn} 0.1892842 0.4810579 0.3934749
## [8] {summit not reached} => {2000-2009} 0.1551118 0.3942102 0.3934749
## [9] {summit not reached} => {six-thousanders} 0.1449866 0.3684775 0.3934749
## [10] {summit not reached} => {solo expedition} 0.1445648 0.3674053 0.3934749
## lift count
## [1] 0.9978639 2644
## [2] 1.3235919 2291
## [3] 1.0278173 2086
## [4] 1.0335379 1780
## [5] 1.4248634 1419
## [6] 0.9454002 1360
## [7] 1.0388104 1346
## [8] 1.0069067 1103
## [9] 1.3819849 1031
## [10] 1.2494591 1028
{summit not reached} => {no one died}: This rule indicates that when the summit was not reached during an expedition, there is a high likelihood that no one died.
{summit not reached} => {without oxygen}: This rule suggests that when the summit was not reached, the expeditions were more likely to have been conducted without oxygen.
{summit not reached} => {Nepal}: This rule shows that expeditions where the summit was not reached are more likely to have been hosted in Nepal.
Mostly not reaching the peak results in no deaths, no oxygen used, climbing from the Nepal side in smaller teams.
What happened that no one died?
rules.nodeath<-apriori(data=expedition, parameter=list(supp=0.01, conf=0.08,maxlen=3), appearance=list(default="lhs", rhs="no one died"), control=list(verbose=F))
rules.nodeath.byconf<-sort(rules.nodeath, by="confidence", decreasing=TRUE)
inspect(head(rules.nodeath.byconf,10))
## lhs rhs support confidence
## [1] {2000-2009, five-thousanders} => {no one died} 0.01279707 1.0000000
## [2] {AMAD, no camp} => {no one died} 0.01532836 1.0000000
## [3] {AMAD, solo} => {no one died} 0.02151596 1.0000000
## [4] {1 person hired, AMAD} => {no one died} 0.03923499 0.9928826
## [5] {CHOY, solo} => {no one died} 0.03290676 0.9915254
## [6] {2010-2019, AMAD} => {no one died} 0.06229785 0.9910515
## [7] {HIML, seven-thousanders} => {no one died} 0.01546899 0.9909910
## [8] {1-2 camps, AMAD} => {no one died} 0.10547040 0.9907530
## [9] {3-6 people hired, AMAD} => {no one died} 0.02939108 0.9905213
## [10] {HIML, summit reached} => {no one died} 0.01462523 0.9904762
## coverage lift count
## [1] 0.01279707 1.055985 91
## [2] 0.01532836 1.055985 109
## [3] 0.02151596 1.055985 153
## [4] 0.03951624 1.048469 279
## [5] 0.03318802 1.047036 234
## [6] 0.06286036 1.046535 443
## [7] 0.01560962 1.046471 110
## [8] 0.10645479 1.046220 750
## [9] 0.02967234 1.045975 209
## [10] 0.01476586 1.045928 104
The association rules suggest that certain conditions in Himalayan expeditions are strongly associated with no deaths occurring during the climb. For instance, expeditions between 2000-2009 to five-thousanders or expeditions involving the Ama Dablam peak (AMAD) with no camp, solo climbs, or minimal hired help are strongly linked to no deaths, with the confidence values close to 1. Other conditions, such as climbing on Ama Dablam with 1-2 camps or solo, also show a high likelihood of survival. Additionally, expeditions to the HIML peak, particularly when the summit is reached, also have a strong association with no fatalities.
Cases when member died, cases when sherpa died:
#Member died
rules.mdeath<-apriori(data=expedition, parameter=list(supp=0.005, conf=0.08,maxlen=3), appearance=list(default="lhs", rhs="member died"), control=list(verbose=F))
rules.mdeath.byconf<-sort(rules.mdeath, by="confidence", decreasing=TRUE)
inspect(rules.mdeath.byconf)
## lhs rhs support confidence
## [1] {Before 2000, with oxygen} => {member died} 0.005343833 0.09921671
## [2] {9-15 members, Before 2000} => {member died} 0.005062579 0.09254499
## [3] {Before 2000, eight-thousanders} => {member died} 0.008718886 0.09253731
## [4] {Before 2000, spring} => {member died} 0.006187597 0.08239700
## coverage lift count
## [1] 0.05386022 2.672462 38
## [2] 0.05470398 2.492755 36
## [3] 0.09422022 2.492549 62
## [4] 0.07509492 2.219413 44
#Sherpa died
rules.sdeath<-apriori(data=expedition, parameter=list(supp=0.001, conf=0.08,maxlen=3), appearance=list(default="lhs", rhs="sherpa died"), control=list(verbose=F))
rules.sdeath.byconf<-sort(rules.sdeath, by="confidence", decreasing=TRUE)
inspect(rules.sdeath.byconf)
## lhs rhs support confidence coverage lift count
## [1] {more than 20 people hired,
## Nepal} => {sherpa died} 0.001406272 0.13698630 0.010265785 10.588148 10
## [2] {EVER,
## more than 20 people hired} => {sherpa died} 0.001546899 0.13580247 0.011390803 10.496645 11
## [3] {more than 20 people hired,
## spring} => {sherpa died} 0.001546899 0.13580247 0.011390803 10.496645 11
## [4] {more than 20 people hired} => {sherpa died} 0.001546899 0.12087912 0.012797075 9.343168 11
## [5] {3-4 camps,
## more than 20 people hired} => {sherpa died} 0.001125018 0.11267606 0.009984531 8.709124 8
## [6] {more than 20 people hired,
## summit reached} => {sherpa died} 0.001125018 0.10256410 0.010968921 7.927536 8
## [7] {eight-thousanders,
## more than 20 people hired} => {sherpa died} 0.001125018 0.09756098 0.011531430 7.540827 8
## [8] {more than 20 people hired,
## with oxygen} => {sherpa died} 0.001125018 0.09411765 0.011953312 7.274680 8
Expeditions before 2000, particularly those with oxygen, larger teams of 9-15 members, or to eight-thousanders, are associated with a higher likelihood of a member dying. These patterns suggest that earlier expeditions, especially in challenging conditions, had a greater risk of fatalities.
But when it comes to sherpa deaths it is different. The left-hand side (LHS) of the rule indicates that expeditions with more than 20 people hired, particularly in Nepal, during spring, or with oxygen, are associated with the right-hand side (RHS) of the rule, which indicates that a Sherpa died during the expedition.
Other intresting rules:
#Summit reached
rules.summitreached<-apriori(data=expedition, parameter=list(supp=0.01, conf=0.08,maxlen=3), appearance=list(default="lhs", rhs="summit reached"), control=list(verbose=F))
rules.summitreached.byconf<-sort(rules.summitreached, by="confidence", decreasing=TRUE)
inspect(head(rules.summitreached.byconf,10))
## lhs rhs support confidence coverage lift count
## [1] {PUMO,
## seven-thousanders} => {summit reached} 0.01251582 0.9673913 0.01293770 1.595343 89
## [2] {HIML,
## seven-thousanders} => {summit reached} 0.01476586 0.9459459 0.01560962 1.559977 105
## [3] {eight-thousanders,
## more than 20 people hired} => {summit reached} 0.01082829 0.9390244 0.01153143 1.548563 77
## [4] {CHOY,
## eight-thousanders} => {summit reached} 0.10997047 0.9298454 0.11826747 1.533426 782
## [5] {11-20 people hired,
## eight-thousanders} => {summit reached} 0.03192237 0.9265306 0.03445366 1.527959 227
## [6] {more than 20 people hired,
## with oxygen} => {summit reached} 0.01096892 0.9176471 0.01195331 1.513309 78
## [7] {16 and more members,
## 3-4 camps} => {summit reached} 0.02756293 0.9116279 0.03023485 1.503383 196
## [8] {16 and more members,
## eight-thousanders} => {summit reached} 0.03023485 0.9110169 0.03318802 1.502375 215
## [9] {9-15 members,
## eight-thousanders} => {summit reached} 0.08761074 0.9081633 0.09647026 1.497669 623
## [10] {11-20 people hired,
## with oxygen} => {summit reached} 0.03164112 0.9036145 0.03501617 1.490168 225
#Solo expedition
rules.solo<-apriori(data=expedition, parameter=list(supp=0.01, conf=0.08,maxlen=3), appearance=list(default="lhs", rhs="solo expedition"), control=list(verbose=F))
rules.solo.byconf<-sort(rules.solo, by="confidence", decreasing=TRUE)
inspect(head(rules.solo.byconf,10))
## lhs rhs support confidence coverage lift count
## [1] {2000-2009,
## no camp} => {solo expedition} 0.01237519 0.5986395 0.02067220 2.035832 88
## [2] {LHOT,
## without oxygen} => {solo expedition} 0.01251582 0.5933333 0.02109408 2.017787 89
## [3] {35-40 days,
## without oxygen} => {solo expedition} 0.01026579 0.5703125 0.01800028 1.939499 73
## [4] {eight-thousanders,
## without oxygen} => {solo expedition} 0.08957952 0.5622242 0.15933061 1.911992 637
## [5] {2-8 members,
## PUMO} => {solo expedition} 0.01181268 0.5562914 0.02123471 1.891816 84
## [6] {CHOY,
## solo} => {solo expedition} 0.01842216 0.5550847 0.03318802 1.887713 131
## [7] {ANN1,
## without oxygen} => {solo expedition} 0.01096892 0.5454545 0.02010969 1.854963 78
## [8] {China,
## without oxygen} => {solo expedition} 0.07734496 0.5445545 0.14203347 1.851902 550
## [9] {25-35 days,
## without oxygen} => {solo expedition} 0.03585994 0.5437100 0.06595416 1.849030 255
## [10] {no camp,
## without oxygen} => {solo expedition} 0.03656307 0.5295316 0.06904795 1.800813 260
#Everest
rules.everest<-apriori(data=expedition, parameter=list(supp=0.01, conf=0.08,maxlen=3), appearance=list(default="lhs", rhs="EVER"), control=list(verbose=F))
rules.everest.byconf<-sort(rules.everest, by="confidence", decreasing=TRUE)
inspect(head(rules.everest.byconf,10))
## lhs rhs support confidence coverage lift count
## [1] {40-50 days,
## China} => {EVER} 0.02334411 0.9940120 0.02348474 3.765807 166
## [2] {more than 20 people hired,
## spring} => {EVER} 0.01110955 0.9753086 0.01139080 3.694949 79
## [3] {35-40 days,
## China} => {EVER} 0.02545352 0.9731183 0.02615666 3.686651 181
## [4] {50-60 days,
## with oxygen} => {EVER} 0.01350021 0.9230769 0.01462523 3.497070 96
## [5] {50-60 days,
## eight-thousanders} => {EVER} 0.01378147 0.9074074 0.01518774 3.437706 98
## [6] {3-4 camps,
## 50-60 days} => {EVER} 0.01195331 0.9042553 0.01321896 3.425764 85
## [7] {11-20 people hired,
## spring} => {EVER} 0.02868795 0.9026549 0.03178175 3.419701 204
## [8] {eight-thousanders,
## more than 20 people hired} => {EVER} 0.01040641 0.9024390 0.01153143 3.418883 74
## [9] {more than 20 people hired,
## with oxygen} => {EVER} 0.01068767 0.8941176 0.01195331 3.387358 76
## [10] {more than 20 people hired} => {EVER} 0.01139080 0.8901099 0.01279707 3.372174 81
The rules suggest that expeditions with certain combinations, like “PUMO with seven-thousanders,” “CHOY with eight-thousanders,” or “16 and more members with 3-4 camps,” are more likely to result in a successful summit. Additionally, using oxygen or having a larger number of people hired, such as more than 20, seem to increase the likelihood of reaching the summit, particularly on high-altitude mountains like the eight-thousanders.
The LHS of the rules highlights conditions like time period (2000-2009), oxygen usage, and specific peaks (e.g., LHOT, PUMO, CHOY), with the RHS indicating that the expedition was solo. These rules show that solo expeditions are more likely when no camp is set, oxygen isn’t used, or during specific time frames and peaks. Solo expeditions are especially common for eight-thousanders and certain peaks, like CHOY, when conducted without oxygen. This suggests that solo expeditions tend to occur under more extreme or isolated conditions.
The LHS of these rules describes various expedition conditions such as duration (40-50 days, 50-60 days), location (China), number of people hired, and oxygen usage. The RHS indicates that the expedition was successful on Mount Everest (“EVER”). These rules suggest that longer expeditions, especially those lasting 40-60 days, particularly in China or with larger teams, are more likely to result in successful summit attempts on Everest. Additionally, the use of oxygen or hiring more people seems to be linked to a higher chance of reaching the summit.
plot(rules_apriori_1, method="graph", measure="support", shading="lift", engine="html")
Rule 20: {eight-thousanders,with oxygen} => {3-4 camps}
The rule indicates that when an expedition involves climbing eight-thousanders with oxygen, it’s associated with staying at 3 or 4 camps Support = 0.3 - There’s a 30% chance of finding a expedition where they reached eight-thousanders with oxygen while staying at 3-4 camps. Confidence = 0.872 - If an expedition has climbed eight-thousanders with oxygen, there’s an 87.2% chance that the expedition stayed at 3 or 4 camps. Lift = 1.59 - The lift value of 1.59 indicates that the association between eight-thousanders with oxygen and 3-4 camps is 1.59 times more likely than if the two events were independent
Apriori is not the only algorithm for generating association rules; ECLAT can also be used. ECLAT operates by using straightforward intersection operations for equivalence class clustering and a bottom-up traversal of the lattice. One of its key advantages is its speed, as it avoids repeatedly scanning the data to compute individual support values. I’m checking the difference for the same paramethers.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.75 0.1 1 none FALSE TRUE 5 0.25 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1777
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[7191 item(s), 7111 transaction(s)] done [0.04s].
## sorting and recoding items ... [19 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [142 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.25 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 1777
##
## create itemset ...
## set transactions ...[7191 item(s), 7111 transaction(s)] done [0.04s].
## sorting and recoding items ... [19 item(s)] done [0.00s].
## creating bit matrix ... [19 row(s), 7111 column(s)] done [0.00s].
## writing ... [136 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
## [1] "Apriori time: 0.0700000000000003 seconds"
## [1] "Eclat time: 0.0600000000000023 seconds"
Why did I choose the Apriori then? My dataset contains numerous categorical variables (e.g., season, host, o2used, death). The data, which is based on attributes describing expedition properties (e.g., peakid, year, success), does not follow the typical structure of transactions, like shopping lists. Apriori is particularly well-suited for this type of data because: • It allows for an easy transformation of the dataset into a transactional format (e.g., through one-hot encoding of categorical variables). • It is more intuitive for analyzing categorical data, as association rules can be directly interpreted in the context of the attributes. On the other hand, Eclat performs better on traditional transactional datasets (e.g., “shopping baskets”), where it operates on the intersections of transactions. However, my data does not have a classic “basket” structure. Additionally, certain elements like deaths, 4,000m mountains, or two-month-long expeditions are rare occurrences in the dataset. Apriori is better equipped to handle sparse data, as it generates candidates iteratively based on item frequencies. Eclat, in contrast, creates lists of items and transaction intersections, which can result in higher memory usage for sparse datasets. Although Eclat might have been a faster solution in some scenarios, I chose to use Apriori due to its better alignment with the structure and characteristics of my data.
In the future, this report could be expanded to include specific individuals who participated in the expeditions, providing insights into the roles and experiences of team members over time. Additionally, extending the date range of the analysis could offer a comparison of how the association rules evolve over the years, potentially highlighting changes in expedition strategies, success rates, or safety measures. Given the richness of this dataset, there are numerous opportunities for further analysis, such as exploring trends in the use of oxygen, the impact of different types of hires (e.g., sherpas or guides), or the frequency of accidents and fatalities over time.
In this analysis, the Apriori algorithm was applied to a dataset of Himalayan expeditions. The analysis generated several association rules, with key factors such as the use of oxygen, the number of hired people, and the type of mountain being prominent in the right-hand side (rhs). By adjusting parameters such as support and confidence, more specific and insightful patterns could be uncovered, shedding light on different expedition characteristics, success rates, and safety factors. This approach can be further refined to explore trends and behaviors over time, providing valuable insights into the evolution of Himalayan expeditions.
IBM. (2021a). Confidence in an association rule. Retrieved from [https://www.ibm.com/support/knowledgecenter/en/SSEPGG_9.7.0/com.ibm.im.model.doc/c_confidence_in_an_association_rule.html]
IBM. (2021b). Lift in an association rule. Retrieved from [https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.1.0/com.ibm.im.model.doc/c_lift_in_an_association_rule.html]