1. Introduction

This project analyzes the Titanic dataset to discover patterns related to passenger survival. We utilize the Apriori Algorithm to generate association rules, identifying which demographic factors (Class, Gender, Port) were most strongly linked to survival.

  1. Data Preparation

We load the dataset directly from an external repository and perform necessary cleaning steps:

  1. Converting numerical variables to categorical factors.

  2. Removing missing values (NA).

  3. Transforming the dataframe into a transaction matrix.

# Load Data direct from URL
url<-"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df<-read.csv(url)

# Select relevant columns
df_clean<-df %>% select(Survived, Pclass, Sex, Embarked)

# Convert numbers to Factors (Categories)
df_clean$Survived<-factor(df_clean$Survived, labels = c("Died", "Survived"))
df_clean$Pclass<-factor(df_clean$Pclass)
df_clean$Sex<-factor(df_clean$Sex)
df_clean$Embarked<-factor(df_clean$Embarked)

# Remove missing data
df_clean<-na.omit(df_clean)

# Convert to Transactions format
trans<-as(df_clean, "transactions")
  1. Data Exploration

Before generating rules, we visualize the most frequent items in the dataset. This helps us understand the composition of the passengers.

itemFrequencyPlot(trans,
                  topN=10,
                  type="absolute",
                  col=brewer.pal(8, "Pastel2"),
                  main="Top 10 Most Frequent Passenger Traits")

  1. Modeling (Apriori Algorithm)

We apply the Apriori algorithm with a confidence threshold of 10% and a support of 0.5%. We specifically target rules where the Right-Hand Side (RHS) is “Survived”.

# Run Algorithm
rules<-apriori(trans, 
                 parameter=list(supp = 0.005,conf=0.1, minlen=2),
                 appearance=list(default="lhs",rhs="Survived=Survived"),
                 control=list(verbose=FALSE))

# Sort by Lift to find the strongest patterns
rules_sorted<-sort(rules,by="lift",decreasing=TRUE)
  1. Results

The table below displays the top rules sorted by Lift. A Lift greater than 1.0 indicates a positive correlation with survival.

inspect(head(rules_sorted,10))
##      lhs              rhs                     support confidence    coverage     lift count
## [1]  {Pclass=2,                                                                            
##       Sex=female,                                                                          
##       Embarked=C}  => {Survived=Survived} 0.007856341  1.0000000 0.007856341 2.605263     7
## [2]  {Pclass=1,                                                                            
##       Sex=female,                                                                          
##       Embarked=C}  => {Survived=Survived} 0.047138047  0.9767442 0.048260382 2.544676    42
## [3]  {Pclass=1,                                                                            
##       Sex=female}  => {Survived=Survived} 0.102132435  0.9680851 0.105499439 2.522116    91
## [4]  {Pclass=1,                                                                            
##       Sex=female,                                                                          
##       Embarked=S}  => {Survived=Survived} 0.051627385  0.9583333 0.053872054 2.496711    46
## [5]  {Pclass=2,                                                                            
##       Sex=female}  => {Survived=Survived} 0.078563412  0.9210526 0.085297419 2.399584    70
## [6]  {Pclass=2,                                                                            
##       Sex=female,                                                                          
##       Embarked=S}  => {Survived=Survived} 0.068462402  0.9104478 0.075196409 2.371956    61
## [7]  {Sex=female,                                                                          
##       Embarked=C}  => {Survived=Survived} 0.071829405  0.8767123 0.081930415 2.284066    64
## [8]  {Sex=female,                                                                          
##       Embarked=Q}  => {Survived=Survived} 0.030303030  0.7500000 0.040404040 1.953947    27
## [9]  {Sex=female}  => {Survived=Survived} 0.261503928  0.7420382 0.352413019 1.933205   233
## [10] {Pclass=3,                                                                            
##       Sex=female,                                                                          
##       Embarked=Q}  => {Survived=Survived} 0.026936027  0.7272727 0.037037037 1.894737    24
  1. Visualizations

6.1 Network Graph

The network graph visualizes the connections between passenger traits and survival. Stronger associations are shown with larger circles.

plot(head(rules_sorted, 10), method = "graph", 
     control = list(type = "items"), main = "Network Graph: Predictors of Survival")
## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

6.2 Grouped Matrix

This matrix groups similar rules to reveal broader patterns in survival factors.

plot(head(rules_sorted, 15), method = "grouped", 
     main = "Grouped Matrix of Survival Factors")
## Available control parameters (with default values):
## k     =  20
## aggr.fun  =  function (x, ...)  UseMethod("mean")
## rhs_max   =  10
## lhs_label_items   =  2
## col   =  c("#EE0000FF", "#EEEEEEFF")
## groups    =  NULL
## engine    =  ggplot2
## verbose   =  FALSE

  1. Conclusion

The analysis successfully identifies the “Women and Children First” protocol. The rule Sex=female, Pclass=1 => Survived=Survived exhibits the highest lift, confirming that social class and gender were the primary determinants of survival on the Titanic.