In huge data sets in different types of databases, association rules are “if-then” statements that assist to illustrate the likelihood of links between datasets. This study examines survivors and non-survivors of the sinking of the titanic.
library(arules)
library(arulesViz)
The following data is taken from kaggle.
# Create data as a factor
titanic <- read.csv("tested.csv", header = T, colClasses = "factor")
# Remove all NA data and make sure there are no more missing values
titanic <- na.omit(titanic)
sum(is.na(titanic))
## [1] 0
# Remove the columns I don't use
titanic <- titanic[-c(7,8,9,10,11,12)]
# Use random values to fill in the blank age column
titanic$Age <- ifelse(is.na(titanic$Age), rand_sample, titanic$Age)
# Provide age categories to facilitate analysis
category <- function(age){
if (age <= 15){
return("children")
} else if (age <= 20){
return("teenager")
} else if (age <= 60){
return("adult")
} else {
return("oldster")
}
}
# Apply the function to the age column in the dataframe
titanic$Age_group <- sapply(titanic$Age, category)
# Create the rules based on non-survivors
rules <- apriori(data = titanic, parameter = list(minlen = 2, sup = 0.02, conf = 0.1, target ="rules") , appearance = list(rhs=c("Survived=0"), default="lhs"), control = list(verbose=F))
# Sort the rules
rules.sorted <- sort(rules, by = "support")
# Inspect the first 5 rows of the sorted rules
inspect(rules.sorted[1:5], linebreak = FALSE)
## lhs rhs support confidence coverage
## [1] {Sex=male} => {Survived=0} 0.6363636 1.0000000 0.6363636
## [2] {Age_group=adult} => {Survived=0} 0.3660287 0.6455696 0.5669856
## [3] {Sex=male, Age_group=adult} => {Survived=0} 0.3660287 1.0000000 0.3660287
## [4] {Pclass=3} => {Survived=0} 0.3492823 0.6697248 0.5215311
## [5] {Pclass=3, Sex=male} => {Survived=0} 0.3492823 1.0000000 0.3492823
## lift count
## [1] 1.571429 266
## [2] 1.014467 153
## [3] 1.571429 153
## [4] 1.052425 146
## [5] 1.571429 146
# Visualizing Association Rules of non-survivors
plot(rules.sorted, jitter=0)
# Create the rules based on survivors
rules <- apriori(data = titanic, parameter = list(minlen = 2, sup = 0.02, conf = 0.1, target ="rules") , appearance = list(rhs=c("Survived=1"), default="lhs"), control = list(verbose=F))
# Sort the rules
rules.sorted <- sort(rules, by = "support")
# Inspect the first 5 rows of the sorted rules
inspect(rules.sorted[1:5], linebreak = FALSE)
## lhs rhs support confidence
## [1] {Sex=female} => {Survived=1} 0.3636364 1.0000000
## [2] {Age_group=adult} => {Survived=1} 0.2009569 0.3544304
## [3] {Sex=female, Age_group=adult} => {Survived=1} 0.2009569 1.0000000
## [4] {Pclass=3} => {Survived=1} 0.1722488 0.3302752
## [5] {Pclass=3, Sex=female} => {Survived=1} 0.1722488 1.0000000
## coverage lift count
## [1] 0.3636364 2.7500000 152
## [2] 0.5669856 0.9746835 84
## [3] 0.2009569 2.7500000 84
## [4] 0.5215311 0.9082569 72
## [5] 0.1722488 2.7500000 72
# Visualizing Association Rules of survivors
plot(rules.sorted, jitter=0)
To sum up, the analysis performed on the Titanic data set using the apriori algorithm. It highlights the fact that the analysis is divided into two parts, one for individuals who did not survive and another for those who did survive. The results are sorted based on their support and visualized. The final conclusion highlights the findings in terms of survival rates among men and women and the distribution of survival based on factors such as age and passenger class.According to the observations, more over 30% of adult males from class 3 passengers did not survive, whereas approximately 20% of adult females from class 3 passengers survived.