Association rules - Study of Titanic survivors and non-survivors

Data

The following data is taken from kaggle.

# Create data as a factor
titanic <- read.csv("tested.csv", header = T, colClasses = "factor")

# Remove all NA data and make sure there are no more missing values
titanic <- na.omit(titanic)
sum(is.na(titanic))

## [1] 0

# Remove the columns I don't use
titanic <- titanic[-c(7,8,9,10,11,12)]

# Use random values to fill in the blank age column
titanic$Age <- ifelse(is.na(titanic$Age), rand_sample, titanic$Age)

# Provide age categories to facilitate analysis
category <- function(age){
  if (age <= 15){
    return("children")
  } else if (age <= 20){
    return("teenager")
  } else if (age <= 60){
    return("adult")
  } else {
    return("oldster")
  }
}

# Apply the function to the age column in the dataframe
titanic$Age_group <- sapply(titanic$Age, category)

Rules

Apriori is a common algorithm used in data mining. It is used to find the most common items and significant correlations in a dataset.

# Create the rules based on non-survivors
rules <- apriori(data = titanic, parameter = list(minlen = 2, sup = 0.02, conf = 0.1, target ="rules") , appearance = list(rhs=c("Survived=0"), default="lhs"), control = list(verbose=F))

# Sort the rules
rules.sorted <- sort(rules, by = "support")

# Inspect the first 5 rows of the sorted rules
inspect(rules.sorted[1:5], linebreak = FALSE)

##     lhs                            rhs          support   confidence coverage 
## [1] {Sex=male}                  => {Survived=0} 0.6363636 1.0000000  0.6363636
## [2] {Age_group=adult}           => {Survived=0} 0.3660287 0.6455696  0.5669856
## [3] {Sex=male, Age_group=adult} => {Survived=0} 0.3660287 1.0000000  0.3660287
## [4] {Pclass=3}                  => {Survived=0} 0.3492823 0.6697248  0.5215311
## [5] {Pclass=3, Sex=male}        => {Survived=0} 0.3492823 1.0000000  0.3492823
##     lift     count
## [1] 1.571429 266  
## [2] 1.014467 153  
## [3] 1.571429 153  
## [4] 1.052425 146  
## [5] 1.571429 146

# Visualizing Association Rules of non-survivors
plot(rules.sorted, jitter=0)

According to this data, 63% of males did not survive the Titanic’s sinking, with 36% of them being adults and 34% were class 3 passengers. Furthermore, the figure shows that the higher the lift, the redder the lift. It can be noticed that the solid red support value is above 0.6 and the solid red confidence value is around amount 1. This corresponds to the first line of inspect output, where the support is 0.63, the confidence value is 1, and the lift value is 1.57.

# Create the rules based on survivors
rules <- apriori(data = titanic, parameter = list(minlen = 2, sup = 0.02, conf = 0.1, target ="rules") , appearance = list(rhs=c("Survived=1"), default="lhs"), control = list(verbose=F))

# Sort the rules
rules.sorted <- sort(rules, by = "support")


# Inspect the first 5 rows of the sorted rules
inspect(rules.sorted[1:5], linebreak = FALSE)

##     lhs                              rhs          support   confidence
## [1] {Sex=female}                  => {Survived=1} 0.3636364 1.0000000 
## [2] {Age_group=adult}             => {Survived=1} 0.2009569 0.3544304 
## [3] {Sex=female, Age_group=adult} => {Survived=1} 0.2009569 1.0000000 
## [4] {Pclass=3}                    => {Survived=1} 0.1722488 0.3302752 
## [5] {Pclass=3, Sex=female}        => {Survived=1} 0.1722488 1.0000000 
##     coverage  lift      count
## [1] 0.3636364 2.7500000 152  
## [2] 0.5669856 0.9746835  84  
## [3] 0.2009569 2.7500000  84  
## [4] 0.5215311 0.9082569  72  
## [5] 0.1722488 2.7500000  72

# Visualizing Association Rules of survivors
plot(rules.sorted, jitter=0)

According to this data, 36% of women survived the Titanic’s sinking, 20% of them were adults and 17% were class 3 passengers. Additionally, it is obvious that the more red the lift is depending on the plot. As can be observed, the solid red support value is over 0.3 and the solid red confidence value is set to 1. This is consistent with the first line output from inspect, where the support is 0.36, the confidence is 1, and the lift is 2.75.

Conclusion

To sum up, the analysis performed on the Titanic data set using the apriori algorithm. It highlights the fact that the analysis is divided into two parts, one for individuals who did not survive and another for those who did survive. The results are sorted based on their support and visualized. The final conclusion highlights the findings in terms of survival rates among men and women and the distribution of survival based on factors such as age and passenger class.According to the observations, more over 30% of adult males from class 3 passengers did not survive, whereas approximately 20% of adult females from class 3 passengers survived.

Association rules - Study of Titanic survivors and non-survivors

Amabel Nabila

Introduction

Library

Data

Rules

Apriori is a common algorithm used in data mining. It is used to find the most common items and significant correlations in a dataset.

Conclusion

References