This project analyzes the Titanic dataset to discover patterns related to passenger survival. We utilize the Apriori Algorithm to generate association rules, identifying which demographic factors (Class, Gender, Port) were most strongly linked to survival.
We load the dataset directly from an external repository and perform necessary cleaning steps:
Converting numerical variables to categorical factors.
Removing missing values (NA).
Transforming the dataframe into a transaction matrix.
# Load Data direct from URL
url<-"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df<-read.csv(url)
# Select relevant columns
df_clean<-df %>% select(Survived, Pclass, Sex, Embarked)
# Convert numbers to Factors (Categories)
df_clean$Survived<-factor(df_clean$Survived, labels = c("Died", "Survived"))
df_clean$Pclass<-factor(df_clean$Pclass)
df_clean$Sex<-factor(df_clean$Sex)
df_clean$Embarked<-factor(df_clean$Embarked)
# Remove missing data
df_clean<-na.omit(df_clean)
# Convert to Transactions format
trans<-as(df_clean, "transactions")Before generating rules, we visualize the most frequent items in the dataset. This helps us understand the composition of the passengers.
itemFrequencyPlot(trans,
topN=10,
type="absolute",
col=brewer.pal(8, "Pastel2"),
main="Top 10 Most Frequent Passenger Traits")We apply the Apriori algorithm with a confidence threshold of 10% and a support of 0.5%. We specifically target rules where the Right-Hand Side (RHS) is “Survived”.
# Run Algorithm
rules<-apriori(trans,
parameter=list(supp = 0.005,conf=0.1, minlen=2),
appearance=list(default="lhs",rhs="Survived=Survived"),
control=list(verbose=FALSE))
# Sort by Lift to find the strongest patterns
rules_sorted<-sort(rules,by="lift",decreasing=TRUE)The table below displays the top rules sorted by Lift. A Lift greater than 1.0 indicates a positive correlation with survival.
## lhs rhs support confidence coverage lift count
## [1] {Pclass=2,
## Sex=female,
## Embarked=C} => {Survived=Survived} 0.007856341 1.0000000 0.007856341 2.605263 7
## [2] {Pclass=1,
## Sex=female,
## Embarked=C} => {Survived=Survived} 0.047138047 0.9767442 0.048260382 2.544676 42
## [3] {Pclass=1,
## Sex=female} => {Survived=Survived} 0.102132435 0.9680851 0.105499439 2.522116 91
## [4] {Pclass=1,
## Sex=female,
## Embarked=S} => {Survived=Survived} 0.051627385 0.9583333 0.053872054 2.496711 46
## [5] {Pclass=2,
## Sex=female} => {Survived=Survived} 0.078563412 0.9210526 0.085297419 2.399584 70
## [6] {Pclass=2,
## Sex=female,
## Embarked=S} => {Survived=Survived} 0.068462402 0.9104478 0.075196409 2.371956 61
## [7] {Sex=female,
## Embarked=C} => {Survived=Survived} 0.071829405 0.8767123 0.081930415 2.284066 64
## [8] {Sex=female,
## Embarked=Q} => {Survived=Survived} 0.030303030 0.7500000 0.040404040 1.953947 27
## [9] {Sex=female} => {Survived=Survived} 0.261503928 0.7420382 0.352413019 1.933205 233
## [10] {Pclass=3,
## Sex=female,
## Embarked=Q} => {Survived=Survived} 0.026936027 0.7272727 0.037037037 1.894737 24
6.1 Network Graph
The network graph visualizes the connections between passenger traits and survival. Stronger associations are shown with larger circles.
plot(head(rules_sorted, 10), method = "graph",
control = list(type = "items"), main = "Network Graph: Predictors of Survival")## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
6.2 Grouped Matrix
This matrix groups similar rules to reveal broader patterns in survival factors.
## Available control parameters (with default values):
## k = 20
## aggr.fun = function (x, ...) UseMethod("mean")
## rhs_max = 10
## lhs_label_items = 2
## col = c("#EE0000FF", "#EEEEEEFF")
## groups = NULL
## engine = ggplot2
## verbose = FALSE
The analysis successfully identifies the “Women and Children First” protocol. The rule Sex=female, Pclass=1 => Survived=Survived exhibits the highest lift, confirming that social class and gender were the primary determinants of survival on the Titanic.