1 Project Purpose & Objectives

The objective of this project is to develop and evaluate machine learning models to predict the survival of passengers aboard the Titanic using logistic regression and advanced algorithms. Key goals include:

  • Apply logistic regression for classification.
  • Build interpretable visualizations and partial dependence plots (PDPs).
  • Perform bias analysis across gender and passenger class.
  • Utilize unsupervised learning (K-Means, PCA) for passenger segmentation.
  • Engineer new features (e.g., IsAlone, FamilySize).
  • Generate Kaggle-compatible predictions from the test set.
  • Suggest further improvements such as SHAP/LIME analysis and ensemble modeling.

2 Data Loading and Preparation

library(iml)
library(tidyverse)
library(data.table)

# File path
data_path <- "C:/Users/Admin/Desktop/Guelph/assg4/"
train <- read.csv(paste0(data_path, "train.csv"), stringsAsFactors = FALSE)
test <- read.csv(paste0(data_path, "test.csv"), stringsAsFactors = FALSE)
gender_submission <- read.csv(paste0(data_path, "gender_submission.csv"), stringsAsFactors = FALSE)

## Dataset Overview

# Show dataset structure and summary
#str(train)

library(DT)
datatable(train, 
          filter = "top",     # Her sütunun üstüne filtre kutusu koyar
          options = list(
            pageLength = 15,  # Sayfa başına 15 satır
            autoWidth = TRUE  # Otomatik sütun genişliği
          ))
# Load package
library(skimr)

# Run skim
skim(train)
Data summary
Name train
Number of rows 891
Number of columns 12
_______________________
Column type frequency:
character 5
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Name 0 1 12 82 0 891 0
Sex 0 1 4 6 0 2 0
Ticket 0 1 3 18 0 681 0
Cabin 0 1 0 15 687 148 0
Embarked 0 1 0 1 2 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
PassengerId 0 1.0 446.00 257.35 1.00 223.50 446.00 668.5 891.00 ▇▇▇▇▇
Survived 0 1.0 0.38 0.49 0.00 0.00 0.00 1.0 1.00 ▇▁▁▁▅
Pclass 0 1.0 2.31 0.84 1.00 2.00 3.00 3.0 3.00 ▃▁▃▁▇
Age 177 0.8 29.70 14.53 0.42 20.12 28.00 38.0 80.00 ▂▇▅▂▁
SibSp 0 1.0 0.52 1.10 0.00 0.00 0.00 1.0 8.00 ▇▁▁▁▁
Parch 0 1.0 0.38 0.81 0.00 0.00 0.00 0.0 6.00 ▇▁▁▁▁
Fare 0 1.0 32.20 49.69 0.00 7.91 14.45 31.0 512.33 ▇▁▁▁▁
# Show dataset structure and summary
str(train)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...
summary(train)
##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
## 
# Display first few rows
head(train)
# Count of total passengers
nrow(train)
## [1] 891
# Survival distribution
table(train$Survived)
## 
##   0   1 
## 549 342
prop.table(table(train$Survived)) * 100
## 
##        0        1 
## 61.61616 38.38384
# Gender distribution
table(train$Sex)
## 
## female   male 
##    314    577
prop.table(table(train$Sex)) * 100
## 
##  female    male 
## 35.2413 64.7587
# Class distribution
table(train$Pclass)
## 
##   1   2   3 
## 216 184 491
# Embarkation distribution
table(train$Embarked)
## 
##       C   Q   S 
##   2 168  77 644
# Fare and Age summaries
summary(train$Fare)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    7.91   14.45   32.20   31.00  512.33
summary(train$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177
# Missing value counts
colSums(is.na(train))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0
# Feature engineering
#train$Survived <- as.factor(train$Survived)
#test$Survived <- NULL
#train$FamilySize <- train$SibSp + train$Parch + 1
#train$IsAlone <- ifelse(train$FamilySize == 1, 1, 0)

3 Feature Analysis and Suitability Assessment

Feature Type Notes Include Reasoning
PassengerId Numeric ID Unique identifier, not useful for prediction No Pure index value; no analytical value
Survived Binary Target The target variable Yes This is the variable we aim to predict
Pclass Ordinal Socioeconomic status indicator Yes Clear correlation with survival rates
Name Text Raw name not useful. Titles can be extracted Yes Extract title to capture gender/social status
Sex Categorical Strong predictor Yes Female passengers had higher survival rates
Age Numeric Missing values exist Yes Fill missing values using median or model-based imputation
SibSp Numeric Number of siblings/spouses aboard Yes Used in FamilySize calculation
Parch Numeric Number of parents/children aboard Yes Combined with SibSp to compute FamilySize
Ticket Text Raw form noisy; prefix or group size can be engineered Yes Optional engineered features can be extracted
Fare Numeric Ticket price, economic status indicator Yes Often correlates with class and survival
Cabin Text Many missing values; deck letter can be extracted Yes Use deck letter as feature; impute missing as ‘Unknown’
Embarked Categorical Port of embarkation (C, Q, S) Yes Region-based travel behavior indicator
FamilySize Engineered SibSp + Parch + 1 Yes Captures family group size effect
IsAlone Engineered Binary flag for passengers traveling alone Yes Strong indicator for survival outcome
Title Engineered Extracted from Name Yes Captures gender/status information embedded in name

3.1 Summary of Selected Features for Modeling

We will use the following features:

  • Pclass
  • Sex
  • Age (with imputation)
  • Fare
  • Embarked
  • FamilySize (engineered)
  • IsAlone (engineered)
  • Title (engineered from Name)
  • Optionally: Deck (from Cabin) and TicketPrefix (from Ticket)

These features will be carried forward into correlation analysis, modeling, and feature selection.

 
##  Data Preparation and Feature Importance

**Comment:**
- Missing values (especially `Age` and `Fare`) were filled using the median.
- New features such as `FamilySize` and `IsAlone` were created.
- Titles were extracted from the `Name` column for deeper insight.

**Interpretation:**
This step improves data quality and helps the model learn better. Handling missing values reduces data loss, and feature engineering adds meaningful patterns for prediction.


``` r
library(tidyverse)
library(corrplot)
library(randomForest)
library(caret)
# Convert relevant variables to factors
train$Sex <- as.factor(train$Sex)
train$Embarked[train$Embarked == ""] <- NA
train$Embarked[is.na(train$Embarked)] <- "S"
train$Embarked <- as.factor(train$Embarked)

# Impute missing numeric values with median
train$Age[is.na(train$Age)] <- median(train$Age, na.rm = TRUE)
train$Fare[is.na(train$Fare)] <- median(train$Fare, na.rm = TRUE)

# Feature engineering
train$FamilySize <- train$SibSp + train$Parch + 1
train$IsAlone <- ifelse(train$FamilySize == 1, 1, 0)

# Extract title from Name
train$Title <- gsub("^.*, (.*?)\\..*$", "\\1", train$Name)
train$Title[!(train$Title %in% c("Mr", "Miss", "Mrs", "Master"))] <- "RareTitle"
train$Title <- factor(train$Title)

# Convert Survived to factor
 
train$Survived <- as.factor(train$Survived)
# Prepare numeric-only subset for correlation
numeric_vars <- train %>% 
  select_if(is.numeric) %>%
  select(-PassengerId)  # Drop ID

# Compute and plot correlation matrix
cor_matrix <- cor(numeric_vars)
corrplot(cor_matrix, method = "color", addCoef.col = "black", tl.cex = 0.8)

# One-hot encoding for categorical variables
dummies <- dummyVars(" ~ Sex + Embarked + Title", data = train)
categorical_encoded <- data.frame(predict(dummies, newdata = train))

# Combine all features
train_features <- bind_cols(
  train %>% select(Survived, Pclass, Age, Fare, FamilySize, IsAlone),
  categorical_encoded
)

# Convert Survived to factor (target variable)
train_features$Survived <- as.factor(train_features$Survived)

# Random Forest with all features
set.seed(123)
rf_full <- randomForest(Survived ~ ., data = train_features, importance = TRUE, ntree = 200)

# Convert importance to data frame
importance_df <- as.data.frame(importance(rf_full))
importance_df$Feature <- rownames(importance_df)
rownames(importance_df) <- NULL

# Sort by MeanDecreaseAccuracy
importance_df <- importance_df %>%
  arrange(MeanDecreaseAccuracy)

# Plot with ggplot2
ggplot(importance_df, aes(x = MeanDecreaseAccuracy, y = reorder(Feature, MeanDecreaseAccuracy))) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Random Forest Feature Importance (Mean Decrease Accuracy)",
       x = "Mean Decrease Accuracy",
       y = "Feature") +
  theme_minimal()

3.2 Key Observations from Correlation Analysis

  • SibSp and FamilySize0.89 correlation
    Very strong positive correlation — this is expected, as FamilySize = SibSp + Parch + 1.

  • Parch and FamilySize0.78 correlation
    Strong relationship again confirms redundancy.

  • IsAlone and FamilySize-0.69 correlation
    Inversely related: being alone corresponds to FamilySize == 1.

  • Fare and Pclass-0.55 correlation
    Negative correlation suggests that passengers in higher classes tended to pay more for their tickets.

  • Other correlations (e.g., Age, Pclass) were moderate to weak and not as significant.

3.3 Interpretation of Random Forest Feature Importance

General Meaning:

The bar chart displays each feature’s contribution to model accuracy. A higher Mean Decrease Accuracy means the feature is more influential in predicting survival. These features help the model make better decisions.


3.3.1 Top Features

  • Pclass: Highest importance — passenger class (1st, 2nd, 3rd) is strongly related to survival.
  • FamilySize: Total number of family members onboard. People traveling with family show different survival rates.
  • Fare: Ticket price, which may reflect socioeconomic status. Higher fares are linked to higher survival chances.
  • Sex.female / Sex.male: Gender is highly predictive — women were more likely to survive.
  • Age: Important factor, though slightly less than above. Young children and adults showed different patterns.

3.3.2 Medium-Impact Features

  • Title.Mr / Title.Miss / Title.Master, etc.: Extracted from names. Provides information about age, gender, and social status.
  • Embarked.C / Embarked.S / Embarked.Q: Port of embarkation. Some influence, possibly due to class or route.
  • IsAlone: Alone passengers may have had lower survival, though less significant than FamilySize.

3.3.3 Low-Impact or Redundant Features

  • Rare Titles (e.g., Title.Col, Title.Don, Title.Mlle): Very few passengers have these titles. Their predictive contribution is low. They can be grouped under a new category like "RareTitle".

3.3.4 Final Recommendation

Retain the top ~10 features:

  • Pclass, FamilySize, Fare, Sex, Age, Embarked, Title, IsAlone.

These will provide the most predictive power for survival.

Optional / Can be removed:

  • Rare titles and any categorical levels with very low frequency.

3.3.5 Feature Engineering Recommendations

  • Avoid using SibSp and Parch separately if FamilySize is included — this will help prevent multicollinearity.
  • Use IsAlone and FamilySize as engineered features instead — they carry distinct and informative signals.
  • Consider analyzing Pclass and Fare together as indicators of economic status. ## Clustering with K-Means & PCA
library(cluster)
library(factoextra)
library(plotly)

# Seçilen en önemli özellikler
cluster_input <- train %>% 
  select(Pclass, Age, Fare, FamilySize, IsAlone,
         Title, Sex, Embarked)

# Kategorik değişkenleri sayısala çevir (one-hot encoding)
dummies <- dummyVars(" ~ .", data = cluster_input)
cluster_matrix <- predict(dummies, newdata = cluster_input)
cluster_df <- as.data.frame(cluster_matrix)

# Sadece tam gözlemleri filtrele
valid_rows <- complete.cases(cluster_df)
features_cluster <- scale(cluster_df[valid_rows, ])

# KMeans kümeleme
set.seed(123)
kclust <- kmeans(features_cluster, centers = 3, nstart = 25)
train$Cluster <- NA
train$Cluster[valid_rows] <- kclust$cluster
train$Cluster <- factor(train$Cluster)

# PCA ile görselleştirme
pca <- prcomp(features_cluster)
pca_df <- as.data.frame(pca$x[, 1:2])
pca_df$Cluster <- factor(kclust$cluster)

# Survived oranlarını cluster'lara göre incele
train %>%
  filter(!is.na(Cluster)) %>%
  group_by(Cluster, Survived) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = round(Count / sum(Count) * 100, 1)) %>%
  arrange(Cluster, desc(Survived))
plot_ly(pca_df, x = ~PC1, y = ~PC2, color = ~Cluster,
        type = 'scatter', mode = 'markers') %>%
  layout(title = "K-Means Clustering with Top Features (PCA Projection)")

3.4 Result of Clustering with PCA & K-Means

3.5 Interpretation: Survival Rates by Cluster

Based on the K-Means clustering and survival breakdown table:

  • Cluster 2 (🟠 Orange) has the highest survival rate, with 84% of passengers in this group surviving.
    This suggests that Cluster 2 represents passengers with favorable characteristics such as:

    • Higher Fare
    • First Class (Pclass = 1)
    • Possibly Female and/or from families
  • Cluster 3 (🔵 Blue) shows a moderate survival rate of 74.2%.
    This cluster may contain passengers with some survival advantages but not as strong as Cluster 2.

  • Cluster 1 (🟢 Green) has the lowest survival rate, with only 57.5% of its passengers surviving.
    ⚠ It likely represents passengers who were alone (IsAlone = 1), younger, or from lower fare brackets.

3.6 Summary:

  • Clusters are highly informative and indicate clear survival patterns.
  • Cluster 2 likely represents the safest group, possibly women or families in higher classes.
  • These clusters could be used for segmentation, model stratification, or targeted survival analysis in further modeling.

4 Categorical Feature Analysis

library(ggplot2)
library(dplyr)
library(tidyr)

# 1. Gender-Based Survival
p1 <- train %>%
  group_by(Sex, Survived) %>%
  summarise(Count = n(), .groups = "drop") %>%
  mutate(Percent = round(Count / sum(Count) * 100, 1)) %>%
  ggplot(aes(x = Sex, y = Percent, fill = as.factor(Survived))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Survival Rate by Gender", y = "%", fill = "Survived") +
  theme_minimal()

# 2. Class-Based Survival
p2 <- train %>%
  group_by(Pclass, Survived) %>%
  summarise(Count = n(), .groups = "drop") %>%
  mutate(Percent = round(Count / sum(Count) * 100, 1)) %>%
  ggplot(aes(x = as.factor(Pclass), y = Percent, fill = as.factor(Survived))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Survival Rate by Passenger Class", x = "Pclass", y = "%", fill = "Survived") +
  theme_minimal()

# 3. Embarked-Based Survival
p3 <- train %>%
  group_by(Embarked, Survived) %>%
  summarise(Count = n(), .groups = "drop") %>%
  mutate(Percent = round(Count / sum(Count) * 100, 1)) %>%
  ggplot(aes(x = Embarked, y = Percent, fill = as.factor(Survived))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Survival Rate by Embarkation Port", y = "%", fill = "Survived") +
  theme_minimal()

# 4. Title-Based Survival (from engineered Title)
p4 <- train %>%
  group_by(Title, Survived) %>%
  summarise(Count = n(), .groups = "drop") %>%
  mutate(Percent = round(Count / sum(Count) * 100, 1)) %>%
  ggplot(aes(x = reorder(Title, -Percent), y = Percent, fill = as.factor(Survived))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Survival Rate by Title", y = "%", fill = "Survived") +
  theme_minimal() +
  coord_flip()

# Display multiple plots (optionally with patchwork)
library(patchwork)
(p1 | p2) / (p3 | p4)

4.1 Key Takeaways from Categorical Feature Analysis

4.1.1 Gender (Sex)

  • Female passengers had a significantly higher survival rate compared to males.
  • This confirms the historical “women and children first” policy during evacuation.
  • Males make up a large portion of non-survivors.

4.1.2 Passenger Class (Pclass)

  • 1st class passengers had the highest chance of survival.
  • 3rd class passengers had the lowest survival rate by far.
  • Socioeconomic status was a critical factor in determining access to lifeboats and evacuation priority.

4.1.3 Embarkation Port (Embarked)

  • Passengers who boarded from port ‘C’ (Cherbourg) had a noticeably higher survival rate.
  • Those from ‘S’ (Southampton) and ‘Q’ (Queenstown) had lower survival rates.
  • This might relate to the distribution of class by embarkation location (e.g., more 1st class passengers from Cherbourg).

4.1.4 Title (Extracted from Name)

  • Titles like ‘Miss’, ‘Mrs’, and ‘Master’ had high survival rates.
  • Passengers with ‘Mr’ and rare titles (e.g., ‘Rev’, ‘Col’, ‘Don’) had significantly lower survival rates.
  • Titles help the model capture social role, age, or gender indirectly, making it a valuable engineered feature.

4.2 Summary

This set of visualizations confirms that Sex, Pclass, Embarked, and engineered features like Title have strong predictive value. These features are crucial for the model and should be retained during feature selection.

5 Logistic Regression & Partial Dependence Plots (PDP)

library(DALEX)
library(iml)

# Prepare model dataset
logreg_data <- train %>%
  mutate(
    Sex = as.factor(Sex),
    Embarked = ifelse(Embarked == "", NA, Embarked),
    Embarked = factor(Embarked),
    Title = gsub("^.*, (.*?)\\..*$", "\\1", Name),
    Title = factor(Title),
    FamilySize = SibSp + Parch + 1,
    IsAlone = ifelse(FamilySize == 1, 1, 0)
  ) %>%
  select(Survived, Pclass, Sex, Age, Fare, FamilySize, IsAlone, Embarked, Title) %>%
  drop_na()

# Logistic regression with more informative features
log_model <- glm(Survived ~ Pclass + Sex + Age + Fare + FamilySize + IsAlone + Embarked + Title,
                 data = logreg_data, family = binomial)

# PDP preparation
predictor <- Predictor$new(
  model = log_model,
  data = logreg_data %>% select(-Survived),
  y = as.numeric(logreg_data$Survived)
)

# Generate PDPs for top features
pdp_age <- FeatureEffect$new(predictor, feature = "Age", method = "pdp")
pdp_fare <- FeatureEffect$new(predictor, feature = "Fare", method = "pdp")
pdp_pclass <- FeatureEffect$new(predictor, feature = "Pclass", method = "pdp")
pdp_sex <- FeatureEffect$new(predictor, feature = "Sex", method = "pdp")

# Plot PDPs
plot(pdp_age) + ggtitle("PDP: Age")

plot(pdp_fare) + ggtitle("PDP: Fare")

plot(pdp_pclass) + ggtitle("PDP: Pclass")

plot(pdp_sex) + ggtitle("PDP: Sex")

5.1 Result of Logistic Regression & PDP (Partial Dependence Plots)

5.1.1 PDP: Age

  • Observation: Strong negative relationship between age and survival.
  • Interpretation: Younger passengers had a higher survival probability.
  • Why: Children were prioritized in lifeboat boarding during the disaster.

5.1.2 PDP: Fare

  • Observation: Fare shows a positive effect on survival probability.
  • Interpretation: Passengers who paid more had higher chances of survival.
  • Why: Fare reflects class and access to better rescue conditions.

5.2 PDP: Passenger Class (Pclass)

  • Observation: Class level inversely affects survival.
  • Interpretation: 1st class passengers were most likely to survive, 3rd class the least.
  • Why: Higher-class locations provided better access to lifeboats.

5.3 PDP: Sex

  • Observation: Sharp difference by gender:
    • Female → Higher survival.
    • Male → Lower survival.
  • Interpretation: Gender is a strong predictor.
  • Why: Evacuation protocols prioritized women.

6 XGBoost Modeling

library(xgboost)
library(caret)
library(dplyr)

# Feature Engineering for Training Data
train_xgb <- train %>%
  mutate(
    Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age),
    Fare = ifelse(is.na(Fare), median(Fare, na.rm = TRUE), Fare),
    FamilySize = SibSp + Parch + 1,
    IsAlone = ifelse(FamilySize == 1, 1, 0),
    Title = gsub("^.*, (.*?)\\..*$", "\\1", Name),
    Sex = factor(Sex),
    Embarked = factor(ifelse(Embarked == "", "S", Embarked)),
    Title = factor(Title)
  )

train$Embarked[train$Embarked == ""] <- NA
train$Embarked[is.na(train$Embarked)] <- "S"
train$Embarked <- as.factor(train$Embarked)
train$Title <- gsub("^.*, (.*?)\\..*$", "\\1", train$Name)
train$Title[!(train$Title %in% c("Mr", "Miss", "Mrs", "Master"))] <- "RareTitle"
train$Title <- factor(train$Title)
train$Sex <- as.factor(train$Sex)
# One-hot encoding for categorical variables
dummies <- dummyVars(" ~ Sex + Embarked + Title", data = train_xgb)
cat_encoded <- predict(dummies, newdata = train_xgb)

# Combine numeric + encoded features
xgb_features <- cbind(
  train_xgb %>% select(Pclass, Age, Fare, FamilySize, IsAlone),
  cat_encoded
)

# Create label vector
label <- as.numeric(train$Survived) - 1

# Create DMatrix (preserves feature names)
dtrain <- xgb.DMatrix(data = as.matrix(xgb_features), label = label)

# Train XGBoost model
set.seed(123)
xgb_model <- xgboost(
  data = dtrain,
  nrounds = 50,
  objective = "binary:logistic",
  verbose = 0
)

7 Evaluation: Bias, ROC, Confusion Matrix

library(pROC)
library(caret)

# Hazırlık: bias_data'nın tüm özellikleri içermesi lazım
bias_data <- train %>%
  mutate(
    Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age),
    Fare = ifelse(is.na(Fare), median(Fare, na.rm = TRUE), Fare),
    FamilySize = SibSp + Parch + 1,
    IsAlone = ifelse(FamilySize == 1, 1, 0),
    Title = gsub("^.*, (.*?)\\..*$", "\\1", Name),
    Sex = factor(Sex),
    Embarked = factor(ifelse(Embarked == "", "S", Embarked)),
    Title = factor(Title)
  )

# Kategorik değişkenleri encode et
bias_encoded <- predict(dummies, newdata = bias_data)

# XGBoost formatına uygun hale getir
bias_features <- cbind(
  bias_data %>% select(Pclass, Age, Fare, FamilySize, IsAlone),
  bias_encoded
)

dpredict <- xgb.DMatrix(data = as.matrix(bias_features))

# Tahmin al
train$pred <- predict(xgb_model, dpredict)

# Bias Analizi: ortalama tahmin
train %>% group_by(Sex) %>% summarise(AvgPrediction = mean(pred, na.rm = TRUE))
train %>% group_by(Pclass) %>% summarise(AvgPrediction = mean(pred, na.rm = TRUE))
# ROC ve AUC
roc_obj <- roc(as.numeric(train$Survived) - 1, train$pred)
plot(roc_obj, main = "ROC Curve - XGBoost")

auc(roc_obj)
## Area under the curve: 0.9856
# Confusion Matrix
pred_class <- ifelse(train$pred > 0.5, 1, 0)
confusionMatrix(factor(pred_class), factor(as.numeric(train$Survived) - 1))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 533  41
##          1  16 301
##                                           
##                Accuracy : 0.936           
##                  95% CI : (0.9179, 0.9512)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8629          
##                                           
##  Mcnemar's Test P-Value : 0.001478        
##                                           
##             Sensitivity : 0.9709          
##             Specificity : 0.8801          
##          Pos Pred Value : 0.9286          
##          Neg Pred Value : 0.9495          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5982          
##    Detection Prevalence : 0.6442          
##       Balanced Accuracy : 0.9255          
##                                           
##        'Positive' Class : 0               
## 

7.1 Result of Model Evaluation (ROC Curve & Confusion Matrix)

  • ROC curve showed the XGBoost model had high discrimination ability.
  • Confusion matrix validated prediction performance.

Interpretation:
The model demonstrated strong predictive performance, especially in distinguishing between survivors and non-survivors. The high AUC value supports this claim.

8 Gain Chart for Model Evaluation

The Gain Chart is a powerful evaluation tool used in binary classification tasks. It helps visualize how effectively a model identifies positive cases—in this case, passengers who survived the Titanic disaster.

8.1 Purpose of Gain Chart

  • X-axis (% of Sample): Cumulative percentage of passengers, sorted by predicted survival probability (from highest to lowest).
  • Y-axis (% of Target Captured): Percentage of actual survivors captured within that portion of the population.

A well-performing model will have a curve that rises steeply and stays above the diagonal red line, indicating it correctly identifies most of the positive cases early on.

8.2 Gain Chart: XGBoost Titanic Model

library(dplyr)

# Step 1: Prepare prediction features (same as XGBoost training set)
gain_data <- train %>%
  mutate(
    Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age),
    Fare = ifelse(is.na(Fare), median(Fare, na.rm = TRUE), Fare),
    FamilySize = SibSp + Parch + 1,
    IsAlone = ifelse(FamilySize == 1, 1, 0),
    Title = gsub("^.*, (.*?)\\..*$", "\\1", Name),
    Sex = factor(Sex),
    Embarked = factor(ifelse(Embarked == "", "S", Embarked)),
    Title = factor(Title)
  )

# One-hot encode categorical variables
gain_encoded <- predict(dummies, newdata = gain_data)

# Final input for prediction
gain_features <- cbind(
  gain_data %>% select(Pclass, Age, Fare, FamilySize, IsAlone),
  gain_encoded
)

# Step 2: Make predictions
actuals <- as.numeric(train$Survived) - 1
preds <- predict(xgb_model, newdata = xgb.DMatrix(data = as.matrix(gain_features)))

# Step 3: Filter valid rows
valid_idx <- which(!is.na(actuals) & !is.na(preds))
gain_df <- data.frame(actual = actuals[valid_idx], predicted = preds[valid_idx])

# Step 4: Sort predictions and compute cumulative gains
gain_df <- gain_df %>%
  arrange(desc(predicted)) %>%
  mutate(
    row_id = row_number(),
    cum_pct_total = row_id / n(),
    cum_captured = cumsum(actual) / sum(actual)
  )

# Step 5: Plot Gain Chart
plot(gain_df$cum_pct_total * 100, gain_df$cum_captured * 100,
     type = "l", col = "blue", lwd = 2,
     xlab = "% of Sample", ylab = "% of Positive Cases Captured",
     main = "Gain Chart - Titanic Survival Prediction (XGBoost)")

abline(0, 1, col = "red", lty = 2)  # baseline
legend("bottomright", legend = c("XGBoost Model", "Random Baseline"),
       col = c("blue", "red"), lty = c(1, 2), bty = "n")

8.3 Result of Gain Chart

Comment: - If the top 20% of predictions capture 60% of actual survivors, the model is considered strong.

Note: If the chart does not render, it may be due to axis length mismatch.

Interpretation: Gain charts assess how well the model identifies the most likely survivors. It’s commonly used in marketing and resource prioritization.

9 Test Predictions & Submission File

library(xgboost)
library(dplyr)

# Feature engineering
train_processed <- train %>%
  mutate(
    Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age),
    Fare = ifelse(is.na(Fare), median(Fare, na.rm = TRUE), Fare),
    FamilySize = SibSp + Parch + 1,
    IsAlone = ifelse(FamilySize == 1, 1, 0),
    Title = gsub("^.*, (.*?)\\..*$", "\\1", Name),
    Embarked = ifelse(Embarked == "", "S", Embarked)
  )
  
str(train_processed)
## 'data.frame':    891 obs. of  17 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 28 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : int  3 1 3 3 3 2 3 3 3 1 ...
##  $ FamilySize : num  2 2 1 2 1 1 1 5 3 2 ...
##  $ IsAlone    : num  0 0 1 0 1 1 1 0 0 0 ...
##  $ Title      : chr  "Mr" "Mrs" "Miss" "Mrs" ...
##  $ Cluster    : Factor w/ 3 levels "1","2","3": 2 3 3 3 2 2 2 1 3 3 ...
##  $ pred       : num  0.0617 0.998 0.4803 0.9985 0.088 ...
# Convert to factors
train$Embarked[train$Embarked == ""] <- NA
train$Embarked[is.na(train$Embarked)] <- "S"
train$Embarked <- as.factor(train$Embarked)
train$Title <- gsub("^.*, (.*?)\\..*$", "\\1", train$Name)
train$Title[!(train$Title %in% c("Mr", "Miss", "Mrs", "Master"))] <- "RareTitle"
train$Title <- factor(train$Title)
train$Sex <- as.factor(train$Sex)
# Create model matrix for XGBoost (dummy encoding included)
xgb_matrix <- model.matrix(~ Pclass + Age + Fare + FamilySize + IsAlone + Sex + Embarked + Title - 1, data = train_processed)

# Prepare label
label <- as.numeric(train$Survived) - 1

# Train model
dtrain <- xgb.DMatrix(data = xgb_matrix, label = label)
set.seed(123)
xgb_model <- xgboost(data = dtrain, nrounds = 50, objective = "binary:logistic", verbose = 0)
train_feature_names <- colnames(xgb_matrix)

# Eksik sütunları doldur
missing_cols <- setdiff(train_feature_names, colnames(xgb_matrix))
for (col in missing_cols) {
  xgb_matrix <- cbind(xgb_matrix, setNames(data.frame(rep(0, nrow(xgb_matrix))), col))
}

# Fazla sütunları çıkar
extra_cols <- setdiff(colnames(xgb_matrix), train_feature_names)
xgb_matrix <- xgb_matrix[, !(colnames(xgb_matrix) %in% extra_cols)]

# Aynı sütun sırasına göre düzenle
xgb_matrix <- xgb_matrix[, train_feature_names]

10 Predicting on Test Set using same model.matrix

# Step 1: Preprocessing test data
test_processed <- test %>%
  mutate(
    Age = ifelse(is.na(Age), median(train$Age, na.rm = TRUE), Age),
    Fare = ifelse(is.na(Fare), median(train$Fare, na.rm = TRUE), Fare),
    FamilySize = SibSp + Parch + 1,
    IsAlone = ifelse(FamilySize == 1, 1, 0),
    Title = gsub("^.*, (.*?)\\..*$", "\\1", Name),
    Embarked = ifelse(Embarked == "", "S", Embarked)
  )
str(test_processed )
## 'data.frame':    418 obs. of  14 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
##  $ Sex        : chr  "male" "female" "male" "male" ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : chr  "" "" "" "" ...
##  $ Embarked   : chr  "Q" "S" "Q" "S" ...
##  $ FamilySize : num  1 2 1 1 3 1 1 3 1 3 ...
##  $ IsAlone    : num  1 0 1 1 0 1 1 0 1 0 ...
##  $ Title      : chr  "Mr" "Mrs" "Mr" "Mr" ...
# Step 2: Align factor levels with training set
test_processed$Embarked[test_processed$Embarked == ""] <- NA
test_processed$Embarked[is.na(test_processed$Embarked)] <- "S"
test_processed$Embarked <- factor(test_processed$Embarked, levels = levels(train$Embarked))
test_processed$Title <- gsub("^.*, (.*?)\\..*$", "\\1", test$Name)
test_processed$Title[!(test_processed$Title %in% levels(train$Title))] <- "RareTitle"
test_processed$Title <- factor(test_processed$Title, levels = levels(train$Title))
test_processed$Sex <- factor(test_processed$Sex, levels = levels(train$Sex))
 
str(test_processed$Sex)
##  Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
str(test_processed$Embarked)
##  Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
str(test_processed$Title)
##  Factor w/ 5 levels "Master","Miss",..: 3 4 3 3 4 3 2 3 4 3 ...
summary(test_processed$Sex)
## female   male 
##    152    266
summary(test_processed$Embarked)
##   C   Q   S 
## 102  46 270
summary(test_processed$Title)
##    Master      Miss        Mr       Mrs RareTitle 
##        21        78       240        72         7
# Step 3: Create model matrix
xgb_test_matrix <- model.matrix(~ Pclass + Age + Fare + FamilySize + IsAlone + Sex + Embarked + Title - 1, 
                                data = test_processed)

# Step 4: Fix column mismatch
missing_cols <- setdiff(train_feature_names, colnames(xgb_test_matrix))
for (col in missing_cols) {
  xgb_test_matrix <- cbind(xgb_test_matrix, setNames(data.frame(rep(0, nrow(test_processed))), col))
}
extra_cols <- setdiff(colnames(xgb_test_matrix), train_feature_names)
xgb_test_matrix <- xgb_test_matrix[, !(colnames(xgb_test_matrix) %in% extra_cols)]
xgb_test_matrix <- xgb_test_matrix[, train_feature_names]  # ensure correct order

# Step 5: Prediction
str(xgb_test_matrix)
## 'data.frame':    418 obs. of  24 variables:
##  $ Pclass           : num  3 3 2 3 3 3 3 2 3 3 ...
##  $ Age              : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ Fare             : num  7.83 7 9.69 8.66 12.29 ...
##  $ FamilySize       : num  1 2 1 1 3 1 1 3 1 3 ...
##  $ IsAlone          : num  1 0 1 1 0 1 1 0 1 0 ...
##  $ Sexfemale        : num  0 1 0 0 1 0 1 0 1 0 ...
##  $ Sexmale          : num  1 0 1 1 0 1 0 1 0 1 ...
##  $ Embarked         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleCol         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleDon         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleDr          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleJonkheer    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleLady        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleMajor       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleMaster      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleMiss        : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ TitleMlle        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleMme         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleMr          : num  1 0 1 1 0 1 0 1 0 1 ...
##  $ TitleMrs         : num  0 1 0 0 1 0 0 0 1 0 ...
##  $ TitleMs          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleRev         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleSir         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Titlethe Countess: num  0 0 0 0 0 0 0 0 0 0 ...
str(train_feature_names)
##  chr [1:24] "Pclass" "Age" "Fare" "FamilySize" "IsAlone" "Sexfemale" ...
if (nrow(xgb_test_matrix) == 418 && ncol(xgb_test_matrix) == length(train_feature_names)) {
  dtest <- xgb.DMatrix(data = as.matrix(xgb_test_matrix))
  test_preds <- predict(xgb_model, newdata = dtest)

  # Convert to binary prediction
  test_pred_class <- ifelse(test_preds > 0.5, 1, 0)
  str(test_pred_class)
  # Build Kaggle submission
  submission <- data.frame(PassengerId = test$PassengerId, Survived = test_pred_class)
  write.csv(submission, "XGBoostSubmission_ModelMatrix.csv", row.names = FALSE)
} else {
  cat("❌ ERROR: Test matrix shape mismatch. Check encoding or factor levels.\n")
  cat("Rows:", nrow(xgb_test_matrix), "Cols:", ncol(xgb_test_matrix), "\n")
  cat("Expected Cols:", length(train_feature_names), "\n")
}
##  num [1:418] 0 0 0 1 1 0 0 1 1 0 ...

10.1 Result of Prediction & Submission

File: LogisticsRegressionSubmission.csv

Comment: - Predictions on the test dataset were exported in Kaggle submission format.

Interpretation: The submission file contains final predictions ready for evaluation. It proves the model’s capability on unseen data.

10.2 Visualization of Prediction

library(ggplot2)

# Create data frame containing prediction results
test_pred_df <- data.frame(Predicted_Probability = test_preds)

# Histogramı draw
ggplot(test_pred_df, aes(x = Predicted_Probability)) +
  geom_histogram(binwidth = 0.05, fill = "steelblue", color = "white") +
  labs(
    title = "Histogram of Predicted Survival Probabilities (Test Set)",
    x = "Predicted Probability of Survival",
    y = "Count"
  ) +
  theme_minimal()

ggplot(test_pred_df, aes(x = Predicted_Probability)) +
  geom_density(fill = "skyblue", alpha = 0.6) +
  labs(title = "Density of Predicted Survival Probabilities",
       x = "Predicted Probability",
       y = "Density") +
  theme_minimal()

test_pred_df$Sex <- test$Sex  # Cinsiyet ekle
ggplot(test_pred_df, aes(x = Sex, y = Predicted_Probability, fill = Sex)) +
  geom_violin(trim = FALSE, alpha = 0.5) +
  geom_boxplot(width = 0.1, color = "black", alpha = 0.8) +
  labs(title = "Predicted Probabilities by Gender",
       x = "Gender", y = "Survival Probability") +
  theme_minimal()

# install.packages("ggdist"); install.packages("gghalves")
library(ggdist)
library(gghalves)

ggplot(test_pred_df, aes(x = Sex, y = Predicted_Probability, fill = Sex)) +
  stat_halfeye(adjust = 0.5, width = 0.6, .width = 0, point_color = NA) +
  geom_boxplot(width = 0.1, outlier.shape = NA, alpha = 0.3) +
  geom_jitter(aes(color = Sex), width = 0.1, alpha = 0.3) +
  theme_minimal() +
  labs(title = "Raincloud Plot: Gender vs Predicted Survival Probability")

# install.packages("flexdashboard")
library(flexdashboard)

gauge(mean(test_preds) * 100, min = 0, max = 100, 
      sectors = gaugeSectors(success = c(70, 100), warning = c(40, 70), danger = c(0, 40)),
      label = "Avg Survival Probability")
library(plotly)
plot_ly(test_pred_df, x = ~Predicted_Probability, type = "histogram",
        marker = list(color = 'rgba(100, 150, 250, 0.7)')) %>%
  layout(title = "Interactive Histogram of Predicted Survival Probabilities",
         xaxis = list(title = "Predicted Probability"),
         yaxis = list(title = "Count"))
library(plotly)

test$predicted <- test_preds
plot_ly(test, x = ~Age, y = ~Fare,
        color = ~predicted,
        type = 'scatter',
        mode = 'markers',
        text = ~paste("Sex:", Sex,
                      "<br>Pclass:", Pclass,
                      "<br>FamilySize:", SibSp + Parch + 1),
        hoverinfo = "text") %>%
  layout(title = "Predicted Survival by Age and Fare",
         xaxis = list(title = "Age"),
         yaxis = list(title = "Fare"))

11 Test Set Predictions Visualization

11.1 Predicted Survival Probabilities Histogram

11.2 Final Evaluation Summary

Aspect Comment
Data Cleaning Proper handling of missing values and feature engineering
Modeling Techniques Logistic Regression and XGBoost used; interpretable and powerful
Visualization Multiple meaningful plots: PDP, ROC, Clusters, Gain Chart
Evaluation Metrics ROC AUC, Confusion Matrix, and Gain Chart analyzed
Result Explanation Clear and insightful; aligns with real-world knowledge
Deliverable Quality Submission file created in correct format

11.3 Final Summary


11.3.1 Interactive Histogram of Predicted Survival Probabilities

This histogram visualizes the distribution of the predicted survival probabilities output by the model for passengers in the test dataset.

11.3.2 Key Observations:

  • The distribution is bimodal — with peaks near 0.0 and 1.0, indicating that the model made very confident predictions for most passengers.
    • Predictions close to 0.0 imply that the model strongly believes the passenger did not survive.
    • Predictions close to 1.0 imply a high likelihood of survival.
  • Fewer predictions are made around 0.4–0.6, which represents a zone of uncertainty.

*Interpretation:
The model demonstrates
strong certainty in its predictions. However, this level of confidence may suggest overfitting**, particularly if the model is heavily influenced by a dominant feature such as gender.

Caution:
A good model should balance between confident and uncertain predictions. The lack of moderate probability values could indicate limited nuance in complex cases or insufficient model regularization.


11.3.3 Overall Evaluation

  • The model performs in a way that strongly favors binary outcomes, aligning with the Titanic context where survival often depended on clear factors like gender and class.
  • There is a need to verify whether this confidence is justified by real-world accuracy by comparing predicted values with actual outcomes.
  • Further analysis (e.g., ROC curve, confusion matrix, feature importance, SHAP values) is recommended to assess model robustness and identify possible bias.

11.3.4 Result of Prediction & Submission

File: LogisticsRegressionSubmission.csv

Comment: - Predictions on the test dataset were exported in Kaggle submission format.

Interpretation: The submission file contains final predictions ready for evaluation. It proves the model’s capability on unseen data. ## Test and Actual Values Compare - Visualization

# Read actual value file
truth <- gender_submission

# Suppose you add the test_pred vector to the test set
test$PassengerId <- truth$PassengerId  # ID'leri eşleştir
test$TrueSurvived <- truth$Survived

# Compare
table(Predicted = test$predicted, Actual = test$TrueSurvived)
##                      Actual
## Predicted             0 1
##   0.00501341605558991 1 0
##   0.00534301018342376 1 0
##   0.0076636872254312  1 0
##   0.00829859636723995 1 0
##   0.0101103726774454  1 0
##   0.0167445708066225  1 0
##   0.0181489121168852  1 0
##   0.0186488442122936  1 0
##   0.0230183135718107  5 0
##   0.023526769131422   1 0
##   0.0239775534719229  1 0
##   0.027447484433651   0 1
##   0.0277628973126411  0 1
##   0.0314021483063698  2 0
##   0.0322321355342865  2 0
##   0.0324949808418751  1 0
##   0.0327299162745476  0 1
##   0.034887570887804   1 0
##   0.035640187561512   1 0
##   0.0367292277514935  1 0
##   0.0372420251369476  1 0
##   0.0399587824940681  1 0
##   0.0408413261175156  1 0
##   0.0426119267940521  1 0
##   0.0470128804445267  1 0
##   0.0478858537971973  1 0
##   0.0493488647043705  1 0
##   0.0493666082620621  1 0
##   0.0516273565590382  2 0
##   0.0531312637031078  2 0
##   0.0534789375960827  1 0
##   0.0555753707885742  1 0
##   0.0574772618710995  1 0
##   0.0576437823474407  1 0
##   0.059060450643301   0 1
##   0.061437826603651   1 0
##   0.0621099956333637  1 0
##   0.0622129440307617  3 0
##   0.0639677345752716  1 0
##   0.0640841647982597  1 0
##   0.0647494792938232  1 0
##   0.0657114312052727  1 0
##   0.0674889758229256  2 0
##   0.0680648237466812  1 0
##   0.06855708360672    1 0
##   0.06901615858078    1 0
##   0.0701743364334106  1 0
##   0.0742299780249596  1 0
##   0.0754449740052223  1 0
##   0.0756427496671677  1 0
##   0.0842391327023506  1 0
##   0.0853107869625092  1 0
##   0.087767593562603   1 0
##   0.0905502438545227  1 0
##   0.0912154912948608  0 1
##   0.0921662077307701  1 0
##   0.0933858901262283  0 1
##   0.0964819267392159  1 0
##   0.0976270213723183  1 0
##   0.0982456877827644  1 0
##   0.0984572097659111  1 0
##   0.100423686206341   1 0
##   0.106209143996239   1 0
##   0.1069725304842     1 0
##   0.107519745826721   1 0
##   0.109446182847023   1 0
##   0.115293174982071   1 0
##   0.116544790565968   1 0
##   0.117969430983067   1 0
##   0.119162052869797   5 0
##   0.122233964502811   1 0
##   0.122569389641285   1 0
##   0.123916767537594   1 0
##   0.126365810632706   1 0
##   0.13396780192852    1 0
##   0.137021660804749   1 0
##   0.138191729784012   0 1
##   0.140619248151779   1 0
##   0.141350731253624   1 0
##   0.144413217902184   1 0
##   0.14451976120472    1 0
##   0.14705865085125    0 1
##   0.147957116365433   1 0
##   0.148579686880112   1 0
##   0.14897096157074    1 0
##   0.150005728006363   2 0
##   0.15117309987545    1 0
##   0.154090464115143   1 0
##   0.154302254319191   1 0
##   0.15655392408371    0 1
##   0.157810509204865   1 0
##   0.158915534615517   1 0
##   0.159266337752342   1 0
##   0.162593245506287   1 0
##   0.162614554166794   1 0
##   0.167237609624863   1 0
##   0.1674515157938     1 0
##   0.169197306036949   1 0
##   0.171434640884399   2 0
##   0.173128470778465   1 0
##   0.173470556735992   1 0
##   0.174319937825203   1 0
##   0.175556063652039   1 0
##   0.176796332001686   2 0
##   0.176902815699577   1 0
##   0.181030318140984   1 0
##   0.18211804330349    1 0
##   0.184453964233398   7 0
##   0.184886753559113   1 0
##   0.185624122619629   1 0
##   0.192453324794769   1 0
##   0.195625185966492   0 1
##   0.195738241076469   1 0
##   0.196802169084549   1 0
##   0.206239119172096   1 0
##   0.210779875516891   1 0
##   0.215279683470726   1 0
##   0.217747837305069   1 0
##   0.218524619936943   1 0
##   0.222450107336044   0 1
##   0.22259546816349    0 1
##   0.230454593896866   1 0
##   0.232564836740494   1 0
##   0.236020535230637   1 0
##   0.236738473176956   2 0
##   0.239499643445015   1 0
##   0.242322847247124   1 0
##   0.244026839733124   1 0
##   0.245089367032051   1 0
##   0.249255791306496   1 0
##   0.251003921031952   1 0
##   0.251543194055557   1 0
##   0.252611219882965   1 0
##   0.253780335187912   1 0
##   0.257779449224472   1 0
##   0.259678810834885   9 0
##   0.261962652206421   1 0
##   0.263538420200348   1 0
##   0.270047545433044   1 0
##   0.272392094135284   1 0
##   0.273226946592331   0 1
##   0.275163173675537   1 0
##   0.281381517648697   0 2
##   0.285288333892822   1 0
##   0.286141574382782   1 0
##   0.292279243469238   1 0
##   0.295561015605927   1 0
##   0.297093838453293   1 0
##   0.299843579530716   1 0
##   0.300797283649445   1 0
##   0.307326287031174   3 0
##   0.308457106351852   1 0
##   0.312486231327057   1 0
##   0.320543646812439   1 0
##   0.32208576798439    1 0
##   0.325221598148346   1 0
##   0.327071785926819   0 1
##   0.328097134828568   1 0
##   0.328215479850769   0 1
##   0.329337388277054   1 0
##   0.330908626317978   1 0
##   0.336146861314774   1 0
##   0.337761223316193   2 0
##   0.344943583011627   1 0
##   0.347853899002075   0 1
##   0.348743826150894   1 0
##   0.350715458393097   1 0
##   0.361450433731079   1 0
##   0.363471984863281   1 0
##   0.364102870225906   1 0
##   0.372773796319962   1 0
##   0.373498797416687   1 0
##   0.374406516551971   0 1
##   0.37531241774559    1 0
##   0.377081900835037   1 0
##   0.377538442611694   1 0
##   0.386977434158325   0 1
##   0.388280838727951   1 0
##   0.393013149499893   1 0
##   0.404749721288681   0 1
##   0.417631298303604   1 0
##   0.42008438706398    0 2
##   0.429090857505798   1 0
##   0.429356157779694   1 0
##   0.431580781936646   1 0
##   0.433642774820328   1 0
##   0.433798849582672   1 0
##   0.436494320631027   1 0
##   0.442339420318604   1 0
##   0.446784287691116   1 0
##   0.446997076272964   1 0
##   0.447783887386322   1 0
##   0.448463469743729   1 0
##   0.449750810861588   2 0
##   0.45465412735939    0 1
##   0.458186328411102   1 0
##   0.461209833621979   3 0
##   0.471829295158386   1 0
##   0.472332924604416   1 0
##   0.473987877368927   1 0
##   0.480573028326035   1 0
##   0.48388659954071    1 0
##   0.486571192741394   1 0
##   0.49681493639946    1 0
##   0.501095473766327   1 0
##   0.51073807477951    1 0
##   0.517107546329498   1 0
##   0.519537150859833   0 1
##   0.532163202762604   1 0
##   0.532743275165558   1 0
##   0.53760689496994    1 0
##   0.543861150741577   1 0
##   0.558823704719543   2 0
##   0.559723734855652   1 0
##   0.56605738401413    1 0
##   0.568113327026367   1 0
##   0.571636915206909   1 0
##   0.584913611412048   1 0
##   0.60736495256424    0 1
##   0.608735799789429   1 0
##   0.621109127998352   1 0
##   0.624208569526672   1 0
##   0.629350781440735   1 0
##   0.635827779769897   1 0
##   0.643598020076752   1 0
##   0.659517228603363   2 0
##   0.671005964279175   0 1
##   0.68084055185318    1 0
##   0.681740343570709   0 1
##   0.688498497009277   0 1
##   0.692206382751465   0 1
##   0.69520491361618    0 1
##   0.709362804889679   1 0
##   0.728408813476562   0 1
##   0.731802463531494   1 0
##   0.741499960422516   1 0
##   0.74736875295639    1 0
##   0.752689063549042   0 1
##   0.753388822078705   1 0
##   0.757936716079712   1 0
##   0.760124027729034   0 1
##   0.763295888900757   1 0
##   0.764209985733032   2 0
##   0.76749461889267    0 1
##   0.787645399570465   0 1
##   0.789930522441864   0 1
##   0.793878972530365   0 1
##   0.80405580997467    1 0
##   0.813709855079651   1 0
##   0.821412622928619   0 1
##   0.825702726840973   0 2
##   0.830344021320343   0 1
##   0.836681604385376   0 1
##   0.838634371757507   0 4
##   0.839786410331726   1 0
##   0.844725966453552   0 2
##   0.845633804798126   0 1
##   0.853083074092865   0 1
##   0.85387110710144    0 1
##   0.859410524368286   1 0
##   0.859514057636261   2 0
##   0.861940085887909   0 1
##   0.87712162733078    0 1
##   0.88205486536026    0 1
##   0.885589063167572   0 1
##   0.887162744998932   0 1
##   0.89171838760376    2 0
##   0.892994403839111   0 1
##   0.893406391143799   0 1
##   0.896997570991516   0 1
##   0.900290668010712   0 1
##   0.90069967508316    0 1
##   0.902005195617676   1 0
##   0.906231284141541   0 1
##   0.909773826599121   0 1
##   0.91436094045639    0 1
##   0.916954696178436   1 0
##   0.924614369869232   0 1
##   0.925161957740784   0 1
##   0.926715672016144   0 1
##   0.927820086479187   0 1
##   0.928377866744995   0 1
##   0.932399809360504   0 1
##   0.933138787746429   0 1
##   0.934040427207947   0 1
##   0.935863196849823   0 1
##   0.943123281002045   0 1
##   0.946106433868408   0 2
##   0.947964072227478   0 1
##   0.949282705783844   0 1
##   0.949956953525543   0 1
##   0.953500032424927   0 1
##   0.956559717655182   0 1
##   0.959644556045532   0 1
##   0.960473477840424   1 0
##   0.961247742176056   0 1
##   0.96229749917984    0 1
##   0.962357878684998   1 0
##   0.967474639415741   1 0
##   0.970607340335846   0 1
##   0.970946609973907   0 1
##   0.971501111984253   0 1
##   0.972439587116241   0 1
##   0.972541689872742   0 1
##   0.973712503910065   0 1
##   0.974044680595398   0 1
##   0.974933087825775   0 1
##   0.975276052951813   0 1
##   0.976718544960022   0 1
##   0.977484881877899   0 1
##   0.978380501270294   0 1
##   0.978422403335571   0 1
##   0.978551924228668   0 1
##   0.980888545513153   0 1
##   0.982032239437103   0 1
##   0.982743144035339   0 1
##   0.982776880264282   0 1
##   0.984456181526184   0 1
##   0.984741508960724   0 1
##   0.98504376411438    0 1
##   0.985257565975189   0 1
##   0.985469698905945   0 1
##   0.985978782176971   0 1
##   0.986623823642731   0 1
##   0.986688315868378   0 2
##   0.987813055515289   0 1
##   0.988323926925659   0 1
##   0.988487482070923   0 1
##   0.990152776241302   0 1
##   0.990858614444733   0 1
##   0.990926265716553   0 1
##   0.991143822669983   0 1
##   0.991239249706268   0 1
##   0.991434633731842   0 1
##   0.991873800754547   0 1
##   0.992735147476196   0 2
##   0.993293642997742   0 1
##   0.993363261222839   0 1
##   0.994016408920288   0 1
##   0.994386315345764   0 1
##   0.994667053222656   0 1
##   0.994671642780304   0 1
##   0.994764447212219   0 1
##   0.994920551776886   0 1
##   0.995090484619141   0 1
##   0.995184242725372   0 1
##   0.995215773582458   0 1
##   0.995587348937988   0 1
##   0.995936393737793   0 1
##   0.996057152748108   0 1
##   0.996494829654694   0 1
##   0.997001111507416   0 1
##   0.997099995613098   0 1
##   0.99745911359787    0 2
##   0.997486472129822   0 1
##   0.997927665710449   0 1
##   0.998233556747437   0 1
##   0.998289287090302   0 1
##   0.998407542705536   0 1
##   0.998464345932007   0 1
##   0.998869717121124   0 1
##   0.998891294002533   0 1
##   0.999260604381561   0 1
library(ggplot2)

# 0.5 and higher is survived
test$predicted_class <- ifelse(test$predicted > 0.5, 1, 0)

# Graphic
ggplot(test, aes(x = factor(TrueSurvived), fill = factor(predicted_class))) +
  geom_bar(position = "dodge") +
  labs(title = "Prediction vs Actual Survival",
       x = "Actual", fill = "Predicted Class")

test$Index <- 1:nrow(test)
ggplot(test, aes(x = Index)) +
  geom_point(aes(y = TrueSurvived), color = "gold", size = 2) +
  geom_point(aes(y = predicted), color = "steelblue", shape = 4) +
  labs(title = "True vs Predicted Survival", y = "Survival (0 = No, 1 = Yes)")

test$predicted_class <- ifelse(test$predicted > 0.5, 1, 0)

ggplot(test, aes(x = Index)) +
  geom_point(aes(y = TrueSurvived), color = "gold", size = 2) +
  geom_point(aes(y = predicted_class), color = "steelblue", shape = 4) +
  labs(title = "True vs Predicted Class", y = "Survival (0 = No, 1 = Yes)")

11.4 Visual Summary

11.4.1 Bar Plot: Prediction vs Actual

  • Most passengers who did not survive (Actual = 0) were correctly predicted.
  • A good number of survivors (Actual = 1) were also correctly identified.
  • Misclassifications are visible as small bars in the opposite color.
  • The plot reflects strong class separation, with minor false positives and negatives.

11.4.2 Scatter Plot: True vs Predicted Probability

  • Gold dots: Actual outcomes (0 or 1)
  • Blue X’s: Predicted probabilities (ranging from 0 to 1)
  • Predictions are mostly near 0 or 1, indicating high confidence.
  • Overlaps = correct predictions Gaps = incorrect predictions ❌

11.4.3 Conclusion

  • The model performs well overall.
  • Visuals confirm most predictions are accurate and confident.
  • Minor errors are present but not dominant.

11.4.4 Future Enhancements

Here are some ideas to improve the Titanic survival prediction project in the future:

11.4.4.1 1. Add New Features
  • Create new columns from existing data:
    • Get titles (Mr, Miss) from the Name column
    • Extract deck levels from the Cabin column
    • Group people based on the Ticket number
  • These can help the model find more useful patterns.
11.4.4.2 2. Improve Model Explanation
  • Use SHAP or LIME to explain why the model makes certain predictions.
  • This is useful for understanding and trusting the model.
11.4.4.3 3. Use Ensemble Methods
  • Combine different models like Logistic Regression, Random Forest, and XGBoost.
  • These combinations often perform better than one model alone.
11.4.4.4 4. Handle Class Imbalance
  • There are more people who died than survived.
  • Use SMOTE or class weights to fix this and improve fairness and accuracy.
11.4.4.5 5. Build a Simple User App
  • Create a web app using R Shiny.
  • Users can enter data and see predictions directly.
11.4.4.6 6. Tune Model Parameters
  • Try different model settings with Grid Search or Random Search.
  • This can improve the model’s performance.
11.4.4.7 7. Group-Specific Models
  • Make separate models for men and women, or for each passenger class.
  • This might give better results for each group.
11.4.4.8 8. Use More Evaluation Metrics
  • Look at more than just accuracy.
  • Include AUC, F1-score, Precision, and Recall.
11.4.4.9 9. Use Cross-Validation
  • Split the data into parts and test on each part.
  • This checks how well the model works on new data.
11.4.4.10 10. Try Deep Learning (Optional)
  • If more data is available, try deep learning models.
  • These might give better results with larger datasets.

12 References

  1. Titanic - Machine Learning from Disaster (Kaggle)
    This is the official Kaggle competition page. It provides the dataset, problem description, evaluation method, and a place to submit your predictions.

  2. Titanic Data Science Solutions by StartUpSci (Kaggle Notebook)
    A very popular and detailed notebook on Kaggle. It shows step-by-step how to clean data, create new features, and build machine learning models.

  3. XGBoost Official Documentation
    The official guide for using the XGBoost machine learning algorithm. It explains how to tune parameters and use the model in R or Python.

  4. SHAP (SHapley Additive exPlanations)
    A helpful tool to explain how machine learning models make predictions. Useful for understanding the impact of each feature on the outcome.