The Titanic disaster remains one of the most analyzed historical events due to its tragic loss of life and the notable patterns among survivors (and probably very popular and available dataset :D). Using association rules, this project aims to uncover hidden relationships between passenger attributes and survival outcomes leveraging Apriori algorithm to find meaningful associations in the Titanic dataset, revealing patterns in demographics, ticket class, family structure, and embarkation points.This project is conducted using the Titanic dataset from Kaggle (https://www.kaggle.com/c/titanic/data?select=train.csv), transformed into a transaction-based format for association rule mining.
str(titanic_data)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
summary(titanic_data)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
To prepare dataset for analysis I performed some transformations. Firstly, to deal with NA values in “Age” variable, I changed them to median to reduce their impact, then converted “Age” to district variable with levels: “Child” (0-12 years), “Teenager” (13-18 years), “Adult” (19-60 years) and “Senior” (over 60 years). For variables “SibSp” (indicating how many siblings or spouses did particular passenger have present on board) and ”Parch” (the same for parents or children) I changed numerical values to levels “yes” if they had any and “no” if the value was 0 for better interpretation. Nulls in “Embarked” I changed to mode, and for better visibility I changed values 0 and 1 in “Survived” variable to “yes” and “no”.
summary(titanic_data)
## Survived Pclass Sex SibSp Parch Embarked
## No :549 1:216 female:314 no :608 no :678 Cherbourg :168
## Yes:342 2:184 male :577 yes:283 yes:213 Queenstown : 77
## 3:491 Southampton:646
##
## AgeGroup
## Child : 69
## Teenager: 70
## Adult :730
## Senior : 22
titanic_transactions <- as(titanic_data, "transactions")
I chose to perform data association project using Apriori algorithm as Titanic dataset contains many categorical attributes and this algorithms allows intuitive visualization of this kind of data, doesn’t require pre-defined dependent variable which helps to find various rules.
After experimenting a little with parameters I decided to choose support = 0.3 and confidence = 0.8 for general exploration of rules. Which leaves us with number of 119 rules in total.
rules <- apriori(titanic_transactions,
parameter = list(support = 0.3, confidence = 0.8, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.3 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 267
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[18 item(s), 891 transaction(s)] done [0.00s].
## sorting and recoding items ... [10 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [119 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(sort(rules, by="lift")[1:20])
## lhs rhs support confidence coverage lift count
## [1] {Survived=No,
## SibSp=no,
## Parch=no,
## Embarked=Southampton} => {Sex=male} 0.3063973 0.9349315 0.3277217 1.443716 273
## [2] {Survived=No,
## SibSp=no,
## Parch=no,
## AgeGroup=Adult} => {Sex=male} 0.3524130 0.9317507 0.3782267 1.438804 314
## [3] {Survived=No,
## SibSp=no,
## Parch=no} => {Sex=male} 0.3894501 0.9278075 0.4197531 1.432715 347
## [4] {Survived=No,
## SibSp=no,
## Embarked=Southampton} => {Sex=male} 0.3153760 0.9213115 0.3423120 1.422684 281
## [5] {Survived=No,
## Parch=no,
## Embarked=Southampton,
## AgeGroup=Adult} => {Sex=male} 0.3198653 0.9163987 0.3490460 1.415097 285
## [6] {Survived=No,
## SibSp=no,
## AgeGroup=Adult} => {Sex=male} 0.3670034 0.9159664 0.4006734 1.414430 327
## [7] {Survived=No,
## Parch=no,
## AgeGroup=Adult} => {Sex=male} 0.4118967 0.9152120 0.4500561 1.413265 367
## [8] {Survived=No,
## Parch=no,
## Embarked=Southampton} => {Sex=male} 0.3546577 0.9132948 0.3883277 1.410304 316
## [9] {Pclass=3,
## Sex=male} => {Survived=No} 0.3367003 0.8645533 0.3894501 1.403128 300
## [10] {Survived=No,
## Parch=no} => {Sex=male} 0.4534231 0.9078652 0.4994388 1.401920 404
## [11] {Survived=No,
## SibSp=no} => {Sex=male} 0.4051627 0.9070352 0.4466891 1.400638 361
## [12] {Sex=male,
## Embarked=Southampton,
## AgeGroup=Adult} => {Survived=No} 0.3512907 0.8505435 0.4130191 1.380390 313
## [13] {Sex=male,
## SibSp=no,
## Parch=no,
## Embarked=Southampton} => {Survived=No} 0.3063973 0.8504673 0.3602694 1.380267 273
## [14] {Sex=male,
## Parch=no,
## Embarked=Southampton} => {Survived=No} 0.3546577 0.8471850 0.4186308 1.374940 316
## [15] {Sex=male,
## SibSp=no,
## Parch=no} => {Survived=No} 0.3894501 0.8442822 0.4612795 1.370229 347
## [16] {Sex=male,
## SibSp=no,
## Embarked=Southampton} => {Survived=No} 0.3153760 0.8438438 0.3737374 1.369517 281
## [17] {Sex=male,
## Parch=no,
## Embarked=Southampton,
## AgeGroup=Adult} => {Survived=No} 0.3198653 0.8431953 0.3793490 1.368464 285
## [18] {Sex=male,
## SibSp=no,
## AgeGroup=Adult} => {Survived=No} 0.3670034 0.8406170 0.4365881 1.364280 327
## [19] {Sex=male,
## SibSp=no,
## Parch=no,
## AgeGroup=Adult} => {Survived=No} 0.3524130 0.8395722 0.4197531 1.362584 314
## [20] {Sex=male,
## Parch=no} => {Survived=No} 0.4534231 0.8347107 0.5432099 1.354694 404
As variable “Sex” doesn’t make much sense as rhs, I adjusted the code to exclude it.
rules_filtered <- subset(rules, !(rhs %pin% "Sex=male" | rhs %pin% "Sex=female"))
length(rules_filtered)
## [1] 101
inspect(sort(rules_filtered, by="lift")[1:20])
## lhs rhs support confidence coverage lift count
## [1] {Pclass=3,
## Sex=male} => {Survived=No} 0.3367003 0.8645533 0.3894501 1.403128 300
## [2] {Sex=male,
## Embarked=Southampton,
## AgeGroup=Adult} => {Survived=No} 0.3512907 0.8505435 0.4130191 1.380390 313
## [3] {Sex=male,
## SibSp=no,
## Parch=no,
## Embarked=Southampton} => {Survived=No} 0.3063973 0.8504673 0.3602694 1.380267 273
## [4] {Sex=male,
## Parch=no,
## Embarked=Southampton} => {Survived=No} 0.3546577 0.8471850 0.4186308 1.374940 316
## [5] {Sex=male,
## SibSp=no,
## Parch=no} => {Survived=No} 0.3894501 0.8442822 0.4612795 1.370229 347
## [6] {Sex=male,
## SibSp=no,
## Embarked=Southampton} => {Survived=No} 0.3153760 0.8438438 0.3737374 1.369517 281
## [7] {Sex=male,
## Parch=no,
## Embarked=Southampton,
## AgeGroup=Adult} => {Survived=No} 0.3198653 0.8431953 0.3793490 1.368464 285
## [8] {Sex=male,
## SibSp=no,
## AgeGroup=Adult} => {Survived=No} 0.3670034 0.8406170 0.4365881 1.364280 327
## [9] {Sex=male,
## SibSp=no,
## Parch=no,
## AgeGroup=Adult} => {Survived=No} 0.3524130 0.8395722 0.4197531 1.362584 314
## [10] {Sex=male,
## Parch=no} => {Survived=No} 0.4534231 0.8347107 0.5432099 1.354694 404
## [11] {Sex=male,
## SibSp=no} => {Survived=No} 0.4051627 0.8317972 0.4870932 1.349966 361
## [12] {Sex=male,
## Parch=no,
## AgeGroup=Adult} => {Survived=No} 0.4118967 0.8303167 0.4960718 1.347563 367
## [13] {Sex=male,
## AgeGroup=Adult} => {Survived=No} 0.4534231 0.8295688 0.5465769 1.346349 404
## [14] {Sex=male,
## Embarked=Southampton} => {Survived=No} 0.4085297 0.8253968 0.4949495 1.339578 364
## [15] {Sex=male} => {Survived=No} 0.5252525 0.8110919 0.6475870 1.316362 468
## [16] {Pclass=3,
## Embarked=Southampton} => {Survived=No} 0.3209877 0.8101983 0.3961841 1.314912 286
## [17] {Sex=male,
## SibSp=no,
## Embarked=Southampton,
## AgeGroup=Adult} => {Parch=no} 0.3243547 0.9730640 0.3333333 1.278761 289
## [18] {Survived=No,
## Sex=male,
## SibSp=no,
## Embarked=Southampton} => {Parch=no} 0.3063973 0.9715302 0.3153760 1.276746 273
## [19] {Sex=male,
## SibSp=no,
## Embarked=Southampton} => {Parch=no} 0.3602694 0.9639640 0.3737374 1.266802 321
## [20] {Survived=No,
## Sex=male,
## Parch=no,
## Embarked=Southampton} => {SibSp=no} 0.3063973 0.8639241 0.3546577 1.266047 273
This left us with 101 rules, that also need to be adjusted in terms of correlation as in this study I don’t want to focus on uncorrelated variables.
hist(quality(rules)$lift,
breaks = 30,
col='navy',
main = "Lift distribution",
xlab = "Lift",
ylab = "number of items"
)
There’s not much negatively correlated rules, all close to 1, so I decided to limit the scope of rules to the ones with lift value above 1.2.
rules_uncorr <- subset(rules, lift >= 1.2)
hist(quality(rules_uncorr)$lift,
breaks = 30,
col='navy',
main = "Lift distribution",
xlab = "Lift",
ylab = "number of items"
)
length(rules_uncorr)
## [1] 62
At the end it gives us 62 different rules, which is much better number to visualize.
summary(rules_uncorr)
## set of 62 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 2 18 29 13
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 4.000 3.855 4.000 5.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.3064 Min. :0.8029 Min. :0.3154 Min. :1.202
## 1st Qu.:0.3277 1st Qu.:0.8434 1st Qu.:0.3799 1st Qu.:1.246
## Median :0.3620 Median :0.8598 Median :0.4153 Median :1.278
## Mean :0.3758 Mean :0.8814 Mean :0.4280 Mean :1.309
## 3rd Qu.:0.4085 3rd Qu.:0.9266 3rd Qu.:0.4593 3rd Qu.:1.369
## Max. :0.5421 Max. :0.9731 Max. :0.6476 Max. :1.444
## count
## Min. :273.0
## 1st Qu.:292.0
## Median :322.5
## Mean :334.9
## 3rd Qu.:364.0
## Max. :483.0
##
## mining info:
## data ntransactions support confidence
## titanic_transactions 891 0.3 0.8
## call
## apriori(data = titanic_transactions, parameter = list(support = 0.3, confidence = 0.8, minlen = 2))
plot(rules_uncorr,
method = "graph",
measure = "support",
colors = c("lightblue", "navy")
)
Chart suggests that men who embarked from Southampton and traveled alone (SibSp=no, Parch=no) in the 3rd class were least likely to survive.
plot(rules_uncorr, method="paracoord", control=list(reorder=TRUE))
This chart drives us to very similar conclusion that men traveling from Southampton alone in 3rd class had poor survival chances.
Why Southampton?
The most passengers traveled from there, so it’s not surprising result.
rules.Survived <- apriori(titanic_transactions,
parameter = list(support = 0.08, confidence = 0.6, minlen = 2), appearance=list(default="lhs", rhs="Survived=Yes"), control=list(verbose=F))
rules.Survived.bylift<-sort(rules.Survived, by="lift", decreasing=TRUE)
inspect(head(rules.Survived.bylift))
## lhs rhs support confidence coverage lift count
## [1] {Pclass=1,
## Sex=female,
## AgeGroup=Adult} => {Survived=Yes} 0.08866442 0.9753086 0.09090909 2.540936 79
## [2] {Pclass=1,
## Sex=female} => {Survived=Yes} 0.10213244 0.9680851 0.10549944 2.522116 91
## [3] {Sex=female,
## Parch=no,
## AgeGroup=Adult} => {Survived=Yes} 0.14927048 0.7964072 0.18742985 2.074850 133
## [4] {Sex=female,
## SibSp=no,
## Parch=no,
## AgeGroup=Adult} => {Survived=Yes} 0.09652076 0.7889908 0.12233446 2.055529 86
## [5] {Sex=female,
## Parch=no} => {Survived=Yes} 0.17171717 0.7886598 0.21773288 2.054666 153
## [6] {Sex=female,
## SibSp=no} => {Survived=Yes} 0.15375982 0.7873563 0.19528620 2.051270 137
Adult women travelling in 1st class had the best chances to survive. Around 15% of people that survived were adult women without children, also around 15% - women without spouses or siblings and in total around 10% of survivors were adult women travelling alone (without siblings, spouses, parents or children).
Children and teenagers didn’t have big representation on board (around 16% of total number of passengers), so I needed to adjust parameters to much smaller.
## lhs rhs support confidence coverage lift count
## [1] {Pclass=2,
## AgeGroup=Child} => {Survived=Yes} 0.01907969 1.0000000 0.01907969 2.605263 17
## [2] {Pclass=2,
## Parch=yes,
## AgeGroup=Child} => {Survived=Yes} 0.01907969 1.0000000 0.01907969 2.605263 17
## [3] {Pclass=2,
## Embarked=Southampton,
## AgeGroup=Child} => {Survived=Yes} 0.01683502 1.0000000 0.01683502 2.605263 15
## [4] {Pclass=2,
## Parch=yes,
## Embarked=Southampton,
## AgeGroup=Child} => {Survived=Yes} 0.01683502 1.0000000 0.01683502 2.605263 15
## [5] {SibSp=no,
## AgeGroup=Child} => {Survived=Yes} 0.01571268 0.8235294 0.01907969 2.145511 14
## [6] {Sex=female,
## SibSp=no,
## AgeGroup=Teenager} => {Survived=Yes} 0.02020202 0.7826087 0.02581369 2.038902 18
## [7] {Sex=female,
## AgeGroup=Teenager} => {Survived=Yes} 0.03030303 0.7500000 0.04040404 1.953947 27
## [8] {Sex=female,
## Embarked=Southampton,
## AgeGroup=Teenager} => {Survived=Yes} 0.01571268 0.7368421 0.02132435 1.919668 14
Young kids definitely had better chances than teenagers, which is not surprising. There’s also visible association with 2nd class, but probably because not many children traveled on 1st. In a group of teenagers teenagers, women had better chances to survive similarly to general analysis. Around 3% of all passengers that survived were teenage girls which is around 38% of all teenagers.
summary(titanic_data)
## Survived Pclass Sex SibSp Parch Embarked
## No :549 1:216 female:314 no :608 no :678 Cherbourg :168
## Yes:342 2:184 male :577 yes:283 yes:213 Queenstown : 77
## 3:491 Southampton:646
##
## AgeGroup
## Child : 69
## Teenager: 70
## Adult :730
## Senior : 22
rules.FamilySurvived <- apriori(titanic_transactions,
parameter = list(support = 0.02, confidence = 0.4, minlen = 2),
appearance = list(lhs=c("SibSp=yes","SibSp=no", "Parch=yes", "Parch=no"), rhs="Survived=Yes"),
control = list(verbose=F))
rules.FamilySurvived.bylift <- sort(rules.FamilySurvived, by="lift", decreasing=TRUE)
inspect(rules.FamilySurvived.bylift)
## lhs rhs support confidence coverage
## [1] {SibSp=no, Parch=yes} => {Survived=Yes} 0.05274972 0.6619718 0.07968575
## [2] {Parch=yes} => {Survived=Yes} 0.12233446 0.5117371 0.23905724
## [3] {SibSp=yes, Parch=no} => {Survived=Yes} 0.07856341 0.4964539 0.15824916
## [4] {SibSp=yes} => {Survived=Yes} 0.14814815 0.4664311 0.31762065
## [5] {SibSp=yes, Parch=yes} => {Survived=Yes} 0.06958474 0.4366197 0.15937149
## lift count
## [1] 1.724611 47
## [2] 1.333210 109
## [3] 1.293393 70
## [4] 1.215176 132
## [5] 1.137509 62
From this we can take that people with no siblings or spouses but with parent or child on board had better chances to survive. Around 12% of people that survived had parent or child on board and around 15% had sibling or spouse. This may make some sense when it comes to motivation to rescue someone, but both confidence and support are not high enough to consider it as valuable insights.
rules.1stClassDied <- apriori(titanic_transactions,
parameter = list(support = 0.01, confidence = 0.6, minlen = 2),
appearance = list(default="lhs", rhs="Survived=No"),
control = list(verbose=F))
rules.1stClassDied <- subset(rules.1stClassDied, lhs %pin% "Pclass=1")
rules.1stClassDied.bylift <- sort(rules.1stClassDied, by="lift", decreasing=TRUE)
inspect(head(rules.1stClassDied.bylift))
## lhs rhs support confidence coverage lift count
## [1] {Pclass=1,
## Sex=male,
## AgeGroup=Senior} => {Survived=No} 0.01234568 0.9166667 0.01346801 1.487705 11
## [2] {Pclass=1,
## Sex=male,
## SibSp=no,
## AgeGroup=Senior} => {Survived=No} 0.01010101 0.9000000 0.01122334 1.460656 9
## [3] {Pclass=1,
## SibSp=no,
## AgeGroup=Senior} => {Survived=No} 0.01010101 0.8181818 0.01234568 1.327869 9
## [4] {Pclass=1,
## AgeGroup=Senior} => {Survived=No} 0.01234568 0.7857143 0.01571268 1.275176 11
## [5] {Pclass=1,
## Sex=male,
## SibSp=no,
## Parch=yes} => {Survived=No} 0.01010101 0.6923077 0.01459035 1.123581 9
## [6] {Pclass=1,
## Sex=male,
## Parch=yes,
## AgeGroup=Adult} => {Survived=No} 0.01234568 0.6875000 0.01795735 1.115779 11
Again the results are expected. Senior men travelling alone had the worse chances to survive. However, support level is very low here, as not that many people traveled in first class, and what’s more relevant, there was not many seniors in Titanic.
The application of association rules effectively uncovered key survival patterns among Titanic passengers. The results confirmed well-known trends, such as the higher survival rates of women and children, particularly in first and second class, while also revealing the impact (or maybe rather correlation) of embarkation city and family presence. Notably, passengers from Cherbourg had a better survival rate, likely due to a higher proportion of first-class travelers, whereas men traveling alone in third class from Southampton had the lowest chances of survival.
While the Apriori method proved valuable in identifying meaningful relationships, further improvements could enhance the analysis. Incorporating additional features such as ticket pricing or cabin location could provide further insights. Future work could also compare association rule mining with predictive models, offering a broader perspective on survival probabilities and experimentation area.
Overall, even if I’m aware that this was one of most basic choices for dataset, I’m not disappointed by the results of analysis. I find some insights about cities of departure or company during the travel interesting.