The aim of this report is to introduce the reader with the basics of association rules. Moreover, we will use CBA (Classification Based on Apriori Rules) algorithm to create a model which will try to predict the probability of survival during Titanic crash.

Introduction

Association rules are rules presenting association or correlation between itemsets. The direct motivation in their development can be find in the problem of basket in the shop. We would like to create a setup of all products in a store to ensure both: customers with the most comfortable shopping and shop owner with as high profit as possible. In other words we would like to find following implication: \[ client \ buys \ A \implies client \ buys \ B \] In other words, an association rule is a conditional event that states the occurence of event B if event A has happened previously. What is more, one can compare various association rules using following metrics:

To understand this measures better we can state few points: higher support for a rule indicates that it should apply to large amount of cases, high confidence means that the rule should be correct often and lastly, high lift indicates that particular rule is not just a coincidence.
One of the most commonly used association rules mining algorithm is called apriori. It proceeds by identifying the frequent individual items in the data set and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by apriori can be used to determine association rules which highlight general trends in the data. Apriori uses breadth-first search (search algorithm through tree data structure) and a hash tree structure (extended version of hash tables) to count candidate item sets efficiently.

CBA algorithm and it’s application to Titanic data set

In this part we will show the usage of associaion rules on creating classifier that will help in predicting the probability of survivng Titanic crash. In the first step let us load all required libraries and data set. We are going to use following modues: arulesViz, arulesCBA, caret.

load(
  '../Downloads/titanic.raw.rdata'
)

titanic_data <- titanic.raw
summary(titanic_data)
##   Class         Sex          Age       Survived  
##  1st :325   Female: 470   Adult:2092   No :1490  
##  2nd :285   Male  :1731   Child: 109   Yes: 711  
##  3rd :706                                        
##  Crew:885

Our data contains 2201 observations of four variables. The key one is Survived which indicates whether a passenger died in a crash. We would like to use other variables: Class, Sex, Age to estimate a survival probability. Let us now investigate all existing association rules in our data:

rules <- apriori(
  titanic_data, parameter = list(
    supp=0.5,conf=0.9, target="rules"), control = list(verbose=F)
  )
inspect(rules)
##     lhs                        rhs         support   confidence lift    
## [1] {}                      => {Age=Adult} 0.9504771 0.9504771  1.000000
## [2] {Survived=No}           => {Sex=Male}  0.6197183 0.9154362  1.163995
## [3] {Survived=No}           => {Age=Adult} 0.6533394 0.9651007  1.015386
## [4] {Sex=Male}              => {Age=Adult} 0.7573830 0.9630272  1.013204
## [5] {Sex=Male,Survived=No}  => {Age=Adult} 0.6038164 0.9743402  1.025106
## [6] {Age=Adult,Survived=No} => {Sex=Male}  0.6038164 0.9242003  1.175139
##     count
## [1] 2092 
## [2] 1364 
## [3] 1438 
## [4] 1667 
## [5] 1329 
## [6] 1329

Apriori algorithm outlined six rules with support and confidence beyond declared level (we set lift and support quite high in order to get the strongest ones). We can see that there is a substantiall difference in lift among other rules for the rule of survived the crash while being Adult Male. On the purpose of further development of our model let’s decrease support and lift requirements:

rules_2 <- apriori(
  data = titanic_data, parameter = list(
    supp = 0.001,conf = 0.08),
  appearance = list(
    default = "lhs",
    rhs = c( "Survived=Yes","Survived=No" )), control=list(verbose=F)
)
rules_2
## set of 75 rules

In this setup we obtained more rules but one has to remember that most of them is nor relevant. Let’s remove redundant rules and plot only important ones:

rules_2_sorted<- sort(rules_2, by="confidence", decreasing=TRUE)
redundand_rules <- is.redundant(rules_2_sorted)
relevant_rules <- rules_2_sorted[redundand_rules == FALSE]
plot(relevant_rules)

One can notice that we have five rules with very high confidence and lift but with low support and roughly four rules with high both confidence and support but very low lift. In the last step let us use apriori algorithm to estimate mentioned porbability. For this purpose we will use caret module to create data partition (which is understood by splitting data into training and test set). Then using CBA module we will define two classifiers with different bonds on support and confidence. We will check obtained accuracy of classification.

indexes <- createDataPartition(titanic_data$Survived, p=0.8, list = F)
train <- titanic_data[indexes,]
test <- titanic_data[-indexes,]

classificator_strong <- CBA(
  Survived ~ ., data = train, supp = 0.1, conf=0.3, verbose = FALSE
  )
## [1] "Survived"
## [1] "Class"    "Sex"      "Age"      "Survived"
classificator_weak <- CBA(
  Survived ~ ., train, supp = 0.001, conf = 0.8, verbose = FALSE
  )
## [1] "Survived"
## [1] "Class"    "Sex"      "Age"      "Survived"

Having estimated models, let’s compare the performance:

predicted_strong <- predict(classificator_strong, test)
predicted_weak <- predict(classificator_weak, test)
 
cross_tab_strong<- table(predicted = predicted_strong, true = test$Survived)
cross_tab_weak<- table(predicted = predicted_weak, true = test$Survived)

accuracy_strong <- (cross_tab_strong[1,1]+cross_tab_strong[2,2])/sum(cross_tab_strong)
accuracy_weak <- (cross_tab_weak[1,1]+cross_tab_weak[2,2])/sum(cross_tab_weak)

Accuracies for both estimators are:

paste(accuracy_strong, accuracy_weak)
## [1] "0.781818181818182 0.695454545454545"

One can notice that as we decreased bonds for support the accuracy of our model decreased substatntially. The best result with 79% accuracy is obtained for the model taking into account rules with support greater that 0.1 and confidence above 0.3. Let’s check what kind of rules were distinguished by this classifier:

##     lhs                               rhs           support   confidence
## [1] {Class=3rd,Sex=Male,Age=Adult} => {Survived=No} 0.1794435 0.8494624 
## [2] {Sex=Male,Age=Adult}           => {Survived=No} 0.6007950 0.7972871 
##     lift     count
## [1] 1.254952  316 
## [2] 1.177871 1058

Two rules were the most relevant: both indicates on low probability of survivng when being Adult Male. The first one also takes into account Class of the sleeping cabin placement.

Summary

In this report we introduced the reader with the concept of association rules and showed that this approach can be used for building classification models similarly to other machine learning approaches resulting in similar accuracy.