Titanic Survival Analysis & Advanced Modeling

1 Project Purpose & Objectives

The objective of this project is to develop and evaluate machine learning models to predict the survival of passengers aboard the Titanic using logistic regression and advanced algorithms. Key goals include:

Apply logistic regression for classification.
Build interpretable visualizations and partial dependence plots (PDPs).
Perform bias analysis across gender and passenger class.
Utilize unsupervised learning (K-Means, PCA) for passenger segmentation.
Engineer new features (e.g., IsAlone, FamilySize).
Generate Kaggle-compatible predictions from the test set.
Suggest further improvements such as SHAP/LIME analysis and ensemble modeling.

2 Data Loading and Preparation

library(iml)
library(tidyverse)
library(data.table)

# File path
data_path <- "C:/Users/Admin/Desktop/Guelph/assg4/"
train <- read.csv(paste0(data_path, "train.csv"), stringsAsFactors = FALSE)
test <- read.csv(paste0(data_path, "test.csv"), stringsAsFactors = FALSE)
gender_submission <- read.csv(paste0(data_path, "gender_submission.csv"), stringsAsFactors = FALSE)

## Dataset Overview

# Show dataset structure and summary
#str(train)

library(DT)
datatable(train, 
          filter = "top",     # Her sütunun üstüne filtre kutusu koyar
          options = list(
            pageLength = 15,  # Sayfa başına 15 satır
            autoWidth = TRUE  # Otomatik sütun genişliği
          ))

# Load package
library(skimr)

# Run skim
skim(train)

Data summary
Name	train
Number of rows	891
Number of columns	12
_______________________
Column type frequency:
character	5
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
Name	1	12	82	0	891
Sex	1	4	6	0	2
Ticket	1	3	18	0	681
Cabin	1	0	15	687	148
Embarked	1	0	1	2	4

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
PassengerId	0	1.0	446.00	257.35	1.00	223.50	446.00	668.5	891.00	▇▇▇▇▇
Survived	0	1.0	0.38	0.49	0.00	0.00	0.00	1.0	1.00	▇▁▁▁▅
Pclass	0	1.0	2.31	0.84	1.00	2.00	3.00	3.0	3.00	▃▁▃▁▇
Age	177	0.8	29.70	14.53	0.42	20.12	28.00	38.0	80.00	▂▇▅▂▁
SibSp	0	1.0	0.52	1.10	0.00	0.00	0.00	1.0	8.00	▇▁▁▁▁
Parch	0	1.0	0.38	0.81	0.00	0.00	0.00	0.0	6.00	▇▁▁▁▁
Fare	0	1.0	32.20	49.69	0.00	7.91	14.45	31.0	512.33	▇▁▁▁▁

# Show dataset structure and summary
str(train)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

summary(train)

##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
##

# Display first few rows
head(train)

# Count of total passengers
nrow(train)

## [1] 891

# Survival distribution
table(train$Survived)

## 
##   0   1 
## 549 342

prop.table(table(train$Survived)) * 100

## 
##        0        1 
## 61.61616 38.38384

# Gender distribution
table(train$Sex)

## 
## female   male 
##    314    577

prop.table(table(train$Sex)) * 100

## 
##  female    male 
## 35.2413 64.7587

# Class distribution
table(train$Pclass)

## 
##   1   2   3 
## 216 184 491

# Embarkation distribution
table(train$Embarked)

## 
##       C   Q   S 
##   2 168  77 644

# Fare and Age summaries
summary(train$Fare)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    7.91   14.45   32.20   31.00  512.33

summary(train$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177

# Missing value counts
colSums(is.na(train))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

# Feature engineering
#train$Survived <- as.factor(train$Survived)
#test$Survived <- NULL
#train$FamilySize <- train$SibSp + train$Parch + 1
#train$IsAlone <- ifelse(train$FamilySize == 1, 1, 0)

3 Feature Analysis and Suitability Assessment

Feature	Type	Notes	Include	Reasoning
PassengerId	Numeric ID	Unique identifier, not useful for prediction	No	Pure index value; no analytical value
Survived	Binary Target	The target variable	Yes	This is the variable we aim to predict
Pclass	Ordinal	Socioeconomic status indicator	Yes	Clear correlation with survival rates
Name	Text	Raw name not useful. Titles can be extracted	Yes	Extract title to capture gender/social status
Sex	Categorical	Strong predictor	Yes	Female passengers had higher survival rates
Age	Numeric	Missing values exist	Yes	Fill missing values using median or model-based imputation
SibSp	Numeric	Number of siblings/spouses aboard	Yes	Used in FamilySize calculation
Parch	Numeric	Number of parents/children aboard	Yes	Combined with SibSp to compute FamilySize
Ticket	Text	Raw form noisy; prefix or group size can be engineered	Yes	Optional engineered features can be extracted
Fare	Numeric	Ticket price, economic status indicator	Yes	Often correlates with class and survival
Cabin	Text	Many missing values; deck letter can be extracted	Yes	Use deck letter as feature; impute missing as ‘Unknown’
Embarked	Categorical	Port of embarkation (C, Q, S)	Yes	Region-based travel behavior indicator
FamilySize	Engineered	SibSp + Parch + 1	Yes	Captures family group size effect
IsAlone	Engineered	Binary flag for passengers traveling alone	Yes	Strong indicator for survival outcome
Title	Engineered	Extracted from Name	Yes	Captures gender/status information embedded in name

3.1 Summary of Selected Features for Modeling

We will use the following features:

Pclass
Sex
Age (with imputation)
Fare
Embarked
FamilySize (engineered)
IsAlone (engineered)
Title (engineered from Name)
Optionally: Deck (from Cabin) and TicketPrefix (from Ticket)

These features will be carried forward into correlation analysis, modeling, and feature selection.

 
##  Data Preparation and Feature Importance

**Comment:**
- Missing values (especially `Age` and `Fare`) were filled using the median.
- New features such as `FamilySize` and `IsAlone` were created.
- Titles were extracted from the `Name` column for deeper insight.

**Interpretation:**
This step improves data quality and helps the model learn better. Handling missing values reduces data loss, and feature engineering adds meaningful patterns for prediction.


``` r
library(tidyverse)
library(corrplot)
library(randomForest)
library(caret)
# Convert relevant variables to factors
train$Sex <- as.factor(train$Sex)
train$Embarked[train$Embarked == ""] <- NA
train$Embarked[is.na(train$Embarked)] <- "S"
train$Embarked <- as.factor(train$Embarked)

# Impute missing numeric values with median
train$Age[is.na(train$Age)] <- median(train$Age, na.rm = TRUE)
train$Fare[is.na(train$Fare)] <- median(train$Fare, na.rm = TRUE)

# Feature engineering
train$FamilySize <- train$SibSp + train$Parch + 1
train$IsAlone <- ifelse(train$FamilySize == 1, 1, 0)

# Extract title from Name
train$Title <- gsub("^.*, (.*?)\\..*$", "\\1", train$Name)
train$Title[!(train$Title %in% c("Mr", "Miss", "Mrs", "Master"))] <- "RareTitle"
train$Title <- factor(train$Title)

# Convert Survived to factor
 
train$Survived <- as.factor(train$Survived)
# Prepare numeric-only subset for correlation
numeric_vars <- train %>% 
  select_if(is.numeric) %>%
  select(-PassengerId)  # Drop ID

# Compute and plot correlation matrix
cor_matrix <- cor(numeric_vars)
corrplot(cor_matrix, method = "color", addCoef.col = "black", tl.cex = 0.8)

# One-hot encoding for categorical variables
dummies <- dummyVars(" ~ Sex + Embarked + Title", data = train)
categorical_encoded <- data.frame(predict(dummies, newdata = train))

# Combine all features
train_features <- bind_cols(
  train %>% select(Survived, Pclass, Age, Fare, FamilySize, IsAlone),
  categorical_encoded
)

# Convert Survived to factor (target variable)
train_features$Survived <- as.factor(train_features$Survived)

# Random Forest with all features
set.seed(123)
rf_full <- randomForest(Survived ~ ., data = train_features, importance = TRUE, ntree = 200)

# Convert importance to data frame
importance_df <- as.data.frame(importance(rf_full))
importance_df$Feature <- rownames(importance_df)
rownames(importance_df) <- NULL

# Sort by MeanDecreaseAccuracy
importance_df <- importance_df %>%
  arrange(MeanDecreaseAccuracy)

# Plot with ggplot2
ggplot(importance_df, aes(x = MeanDecreaseAccuracy, y = reorder(Feature, MeanDecreaseAccuracy))) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Random Forest Feature Importance (Mean Decrease Accuracy)",
       x = "Mean Decrease Accuracy",
       y = "Feature") +
  theme_minimal()

3.2 Key Observations from Correlation Analysis

SibSp and FamilySize → 0.89 correlation
Very strong positive correlation — this is expected, as FamilySize = SibSp + Parch + 1.
Parch and FamilySize → 0.78 correlation
Strong relationship again confirms redundancy.
IsAlone and FamilySize → -0.69 correlation
Inversely related: being alone corresponds to FamilySize == 1.
Fare and Pclass → -0.55 correlation
Negative correlation suggests that passengers in higher classes tended to pay more for their tickets.
Other correlations (e.g., Age, Pclass) were moderate to weak and not as significant.

3.3 Interpretation of Random Forest Feature Importance

General Meaning:

The bar chart displays each feature’s contribution to model accuracy. A higher Mean Decrease Accuracy means the feature is more influential in predicting survival. These features help the model make better decisions.

3.3.1 Top Features

Pclass: Highest importance — passenger class (1st, 2nd, 3rd) is strongly related to survival.
FamilySize: Total number of family members onboard. People traveling with family show different survival rates.
Fare: Ticket price, which may reflect socioeconomic status. Higher fares are linked to higher survival chances.
Sex.female / Sex.male: Gender is highly predictive — women were more likely to survive.
Age: Important factor, though slightly less than above. Young children and adults showed different patterns.

3.3.2 Medium-Impact Features

Title.Mr / Title.Miss / Title.Master, etc.: Extracted from names. Provides information about age, gender, and social status.
Embarked.C / Embarked.S / Embarked.Q: Port of embarkation. Some influence, possibly due to class or route.
IsAlone: Alone passengers may have had lower survival, though less significant than FamilySize.

3.3.3 Low-Impact or Redundant Features

Rare Titles (e.g., Title.Col, Title.Don, Title.Mlle): Very few passengers have these titles. Their predictive contribution is low. They can be grouped under a new category like "RareTitle".

3.3.4 Final Recommendation

Retain the top ~10 features:

Pclass, FamilySize, Fare, Sex, Age, Embarked, Title, IsAlone.

These will provide the most predictive power for survival.

Optional / Can be removed:

Rare titles and any categorical levels with very low frequency.

3.3.5 Feature Engineering Recommendations

Avoid using SibSp and Parch separately if FamilySize is included — this will help prevent multicollinearity.
Use IsAlone and FamilySize as engineered features instead — they carry distinct and informative signals.
Consider analyzing Pclass and Fare together as indicators of economic status. ## Clustering with K-Means & PCA

library(cluster)
library(factoextra)
library(plotly)

# Seçilen en önemli özellikler
cluster_input <- train %>% 
  select(Pclass, Age, Fare, FamilySize, IsAlone,
         Title, Sex, Embarked)

# Kategorik değişkenleri sayısala çevir (one-hot encoding)
dummies <- dummyVars(" ~ .", data = cluster_input)
cluster_matrix <- predict(dummies, newdata = cluster_input)
cluster_df <- as.data.frame(cluster_matrix)

# Sadece tam gözlemleri filtrele
valid_rows <- complete.cases(cluster_df)
features_cluster <- scale(cluster_df[valid_rows, ])

# KMeans kümeleme
set.seed(123)
kclust <- kmeans(features_cluster, centers = 3, nstart = 25)
train$Cluster <- NA
train$Cluster[valid_rows] <- kclust$cluster
train$Cluster <- factor(train$Cluster)

# PCA ile görselleştirme
pca <- prcomp(features_cluster)
pca_df <- as.data.frame(pca$x[, 1:2])
pca_df$Cluster <- factor(kclust$cluster)

# Survived oranlarını cluster'lara göre incele
train %>%
  filter(!is.na(Cluster)) %>%
  group_by(Cluster, Survived) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = round(Count / sum(Count) * 100, 1)) %>%
  arrange(Cluster, desc(Survived))

plot_ly(pca_df, x = ~PC1, y = ~PC2, color = ~Cluster,
        type = 'scatter', mode = 'markers') %>%
  layout(title = "K-Means Clustering with Top Features (PCA Projection)")

3.4 Result of Clustering with PCA & K-Means

3.5 Interpretation: Survival Rates by Cluster

Based on the K-Means clustering and survival breakdown table:

Cluster 2 (🟠 Orange) has the highest survival rate, with 84% of passengers in this group surviving.
This suggests that Cluster 2 represents passengers with favorable characteristics such as:
- Higher Fare
- First Class (Pclass = 1)
- Possibly Female and/or from families
Cluster 3 (🔵 Blue) shows a moderate survival rate of 74.2%.
This cluster may contain passengers with some survival advantages but not as strong as Cluster 2.
Cluster 1 (🟢 Green) has the lowest survival rate, with only 57.5% of its passengers surviving.
⚠ It likely represents passengers who were alone (IsAlone = 1), younger, or from lower fare brackets.

3.6 Summary:

Clusters are highly informative and indicate clear survival patterns.
Cluster 2 likely represents the safest group, possibly women or families in higher classes.
These clusters could be used for segmentation, model stratification, or targeted survival analysis in further modeling.

4 Categorical Feature Analysis

library(ggplot2)
library(dplyr)
library(tidyr)

# 1. Gender-Based Survival
p1 <- train %>%
  group_by(Sex, Survived) %>%
  summarise(Count = n(), .groups = "drop") %>%
  mutate(Percent = round(Count / sum(Count) * 100, 1)) %>%
  ggplot(aes(x = Sex, y = Percent, fill = as.factor(Survived))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Survival Rate by Gender", y = "%", fill = "Survived") +
  theme_minimal()

# 2. Class-Based Survival
p2 <- train %>%
  group_by(Pclass, Survived) %>%
  summarise(Count = n(), .groups = "drop") %>%
  mutate(Percent = round(Count / sum(Count) * 100, 1)) %>%
  ggplot(aes(x = as.factor(Pclass), y = Percent, fill = as.factor(Survived))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Survival Rate by Passenger Class", x = "Pclass", y = "%", fill = "Survived") +
  theme_minimal()

# 3. Embarked-Based Survival
p3 <- train %>%
  group_by(Embarked, Survived) %>%
  summarise(Count = n(), .groups = "drop") %>%
  mutate(Percent = round(Count / sum(Count) * 100, 1)) %>%
  ggplot(aes(x = Embarked, y = Percent, fill = as.factor(Survived))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Survival Rate by Embarkation Port", y = "%", fill = "Survived") +
  theme_minimal()

# 4. Title-Based Survival (from engineered Title)
p4 <- train %>%
  group_by(Title, Survived) %>%
  summarise(Count = n(), .groups = "drop") %>%
  mutate(Percent = round(Count / sum(Count) * 100, 1)) %>%
  ggplot(aes(x = reorder(Title, -Percent), y = Percent, fill = as.factor(Survived))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Survival Rate by Title", y = "%", fill = "Survived") +
  theme_minimal() +
  coord_flip()

# Display multiple plots (optionally with patchwork)
library(patchwork)
(p1 | p2) / (p3 | p4)

4.1 Key Takeaways from Categorical Feature Analysis

4.1.1 Gender (Sex)

Female passengers had a significantly higher survival rate compared to males.
This confirms the historical “women and children first” policy during evacuation.
Males make up a large portion of non-survivors.

4.1.2 Passenger Class (Pclass)

1st class passengers had the highest chance of survival.
3rd class passengers had the lowest survival rate by far.
Socioeconomic status was a critical factor in determining access to lifeboats and evacuation priority.

4.1.3 Embarkation Port (Embarked)

Passengers who boarded from port ‘C’ (Cherbourg) had a noticeably higher survival rate.
Those from ‘S’ (Southampton) and ‘Q’ (Queenstown) had lower survival rates.
This might relate to the distribution of class by embarkation location (e.g., more 1st class passengers from Cherbourg).

4.1.4 Title (Extracted from Name)

Titles like ‘Miss’, ‘Mrs’, and ‘Master’ had high survival rates.
Passengers with ‘Mr’ and rare titles (e.g., ‘Rev’, ‘Col’, ‘Don’) had significantly lower survival rates.
Titles help the model capture social role, age, or gender indirectly, making it a valuable engineered feature.

4.2 Summary

This set of visualizations confirms that Sex, Pclass, Embarked, and engineered features like Title have strong predictive value. These features are crucial for the model and should be retained during feature selection.

5 Logistic Regression & Partial Dependence Plots (PDP)

library(DALEX)
library(iml)

# Prepare model dataset
logreg_data <- train %>%
  mutate(
    Sex = as.factor(Sex),
    Embarked = ifelse(Embarked == "", NA, Embarked),
    Embarked = factor(Embarked),
    Title = gsub("^.*, (.*?)\\..*$", "\\1", Name),
    Title = factor(Title),
    FamilySize = SibSp + Parch + 1,
    IsAlone = ifelse(FamilySize == 1, 1, 0)
  ) %>%
  select(Survived, Pclass, Sex, Age, Fare, FamilySize, IsAlone, Embarked, Title) %>%
  drop_na()

# Logistic regression with more informative features
log_model <- glm(Survived ~ Pclass + Sex + Age + Fare + FamilySize + IsAlone + Embarked + Title,
                 data = logreg_data, family = binomial)

# PDP preparation
predictor <- Predictor$new(
  model = log_model,
  data = logreg_data %>% select(-Survived),
  y = as.numeric(logreg_data$Survived)
)

# Generate PDPs for top features
pdp_age <- FeatureEffect$new(predictor, feature = "Age", method = "pdp")
pdp_fare <- FeatureEffect$new(predictor, feature = "Fare", method = "pdp")
pdp_pclass <- FeatureEffect$new(predictor, feature = "Pclass", method = "pdp")
pdp_sex <- FeatureEffect$new(predictor, feature = "Sex", method = "pdp")

# Plot PDPs
plot(pdp_age) + ggtitle("PDP: Age")

plot(pdp_fare) + ggtitle("PDP: Fare")

plot(pdp_pclass) + ggtitle("PDP: Pclass")

plot(pdp_sex) + ggtitle("PDP: Sex")

5.1 Result of Logistic Regression & PDP (Partial Dependence Plots)

5.1.1 PDP: Age

Observation: Strong negative relationship between age and survival.
Interpretation: Younger passengers had a higher survival probability.
Why: Children were prioritized in lifeboat boarding during the disaster.

5.1.2 PDP: Fare

Observation: Fare shows a positive effect on survival probability.
Interpretation: Passengers who paid more had higher chances of survival.
Why: Fare reflects class and access to better rescue conditions.

5.2 PDP: Passenger Class (Pclass)

Observation: Class level inversely affects survival.
Interpretation: 1st class passengers were most likely to survive, 3rd class the least.
Why: Higher-class locations provided better access to lifeboats.

5.3 PDP: Sex

Observation: Sharp difference by gender:
- Female → Higher survival.
- Male → Lower survival.
Interpretation: Gender is a strong predictor.
Why: Evacuation protocols prioritized women.

6 XGBoost Modeling

library(xgboost)
library(caret)
library(dplyr)

# Feature Engineering for Training Data
train_xgb <- train %>%
  mutate(
    Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age),
    Fare = ifelse(is.na(Fare), median(Fare, na.rm = TRUE), Fare),
    FamilySize = SibSp + Parch + 1,
    IsAlone = ifelse(FamilySize == 1, 1, 0),
    Title = gsub("^.*, (.*?)\\..*$", "\\1", Name),
    Sex = factor(Sex),
    Embarked = factor(ifelse(Embarked == "", "S", Embarked)),
    Title = factor(Title)
  )

train$Embarked[train$Embarked == ""] <- NA
train$Embarked[is.na(train$Embarked)] <- "S"
train$Embarked <- as.factor(train$Embarked)
train$Title <- gsub("^.*, (.*?)\\..*$", "\\1", train$Name)
train$Title[!(train$Title %in% c("Mr", "Miss", "Mrs", "Master"))] <- "RareTitle"
train$Title <- factor(train$Title)
train$Sex <- as.factor(train$Sex)
# One-hot encoding for categorical variables
dummies <- dummyVars(" ~ Sex + Embarked + Title", data = train_xgb)
cat_encoded <- predict(dummies, newdata = train_xgb)

# Combine numeric + encoded features
xgb_features <- cbind(
  train_xgb %>% select(Pclass, Age, Fare, FamilySize, IsAlone),
  cat_encoded
)

# Create label vector
label <- as.numeric(train$Survived) - 1

# Create DMatrix (preserves feature names)
dtrain <- xgb.DMatrix(data = as.matrix(xgb_features), label = label)

# Train XGBoost model
set.seed(123)
xgb_model <- xgboost(
  data = dtrain,
  nrounds = 50,
  objective = "binary:logistic",
  verbose = 0
)

7 Evaluation: Bias, ROC, Confusion Matrix

library(pROC)
library(caret)

# Hazırlık: bias_data'nın tüm özellikleri içermesi lazım
bias_data <- train %>%
  mutate(
    Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age),
    Fare = ifelse(is.na(Fare), median(Fare, na.rm = TRUE), Fare),
    FamilySize = SibSp + Parch + 1,
    IsAlone = ifelse(FamilySize == 1, 1, 0),
    Title = gsub("^.*, (.*?)\\..*$", "\\1", Name),
    Sex = factor(Sex),
    Embarked = factor(ifelse(Embarked == "", "S", Embarked)),
    Title = factor(Title)
  )

# Kategorik değişkenleri encode et
bias_encoded <- predict(dummies, newdata = bias_data)

# XGBoost formatına uygun hale getir
bias_features <- cbind(
  bias_data %>% select(Pclass, Age, Fare, FamilySize, IsAlone),
  bias_encoded
)

dpredict <- xgb.DMatrix(data = as.matrix(bias_features))

# Tahmin al
train$pred <- predict(xgb_model, dpredict)

# Bias Analizi: ortalama tahmin
train %>% group_by(Sex) %>% summarise(AvgPrediction = mean(pred, na.rm = TRUE))

train %>% group_by(Pclass) %>% summarise(AvgPrediction = mean(pred, na.rm = TRUE))

# ROC ve AUC
roc_obj <- roc(as.numeric(train$Survived) - 1, train$pred)
plot(roc_obj, main = "ROC Curve - XGBoost")

auc(roc_obj)

## Area under the curve: 0.9856

# Confusion Matrix
pred_class <- ifelse(train$pred > 0.5, 1, 0)
confusionMatrix(factor(pred_class), factor(as.numeric(train$Survived) - 1))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 533  41
##          1  16 301
##                                           
##                Accuracy : 0.936           
##                  95% CI : (0.9179, 0.9512)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8629          
##                                           
##  Mcnemar's Test P-Value : 0.001478        
##                                           
##             Sensitivity : 0.9709          
##             Specificity : 0.8801          
##          Pos Pred Value : 0.9286          
##          Neg Pred Value : 0.9495          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5982          
##    Detection Prevalence : 0.6442          
##       Balanced Accuracy : 0.9255          
##                                           
##        'Positive' Class : 0               
##

7.1 Result of Model Evaluation (ROC Curve & Confusion Matrix)

ROC curve showed the XGBoost model had high discrimination ability.
Confusion matrix validated prediction performance.

Interpretation:
The model demonstrated strong predictive performance, especially in distinguishing between survivors and non-survivors. The high AUC value supports this claim.

8 Gain Chart for Model Evaluation

The Gain Chart is a powerful evaluation tool used in binary classification tasks. It helps visualize how effectively a model identifies positive cases—in this case, passengers who survived the Titanic disaster.

8.1 Purpose of Gain Chart

X-axis (% of Sample): Cumulative percentage of passengers, sorted by predicted survival probability (from highest to lowest).
Y-axis (% of Target Captured): Percentage of actual survivors captured within that portion of the population.

A well-performing model will have a curve that rises steeply and stays above the diagonal red line, indicating it correctly identifies most of the positive cases early on.

8.2 Gain Chart: XGBoost Titanic Model

library(dplyr)

# Step 1: Prepare prediction features (same as XGBoost training set)
gain_data <- train %>%
  mutate(
    Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age),
    Fare = ifelse(is.na(Fare), median(Fare, na.rm = TRUE), Fare),
    FamilySize = SibSp + Parch + 1,
    IsAlone = ifelse(FamilySize == 1, 1, 0),
    Title = gsub("^.*, (.*?)\\..*$", "\\1", Name),
    Sex = factor(Sex),
    Embarked = factor(ifelse(Embarked == "", "S", Embarked)),
    Title = factor(Title)
  )

# One-hot encode categorical variables
gain_encoded <- predict(dummies, newdata = gain_data)

# Final input for prediction
gain_features <- cbind(
  gain_data %>% select(Pclass, Age, Fare, FamilySize, IsAlone),
  gain_encoded
)

# Step 2: Make predictions
actuals <- as.numeric(train$Survived) - 1
preds <- predict(xgb_model, newdata = xgb.DMatrix(data = as.matrix(gain_features)))

# Step 3: Filter valid rows
valid_idx <- which(!is.na(actuals) & !is.na(preds))
gain_df <- data.frame(actual = actuals[valid_idx], predicted = preds[valid_idx])

# Step 4: Sort predictions and compute cumulative gains
gain_df <- gain_df %>%
  arrange(desc(predicted)) %>%
  mutate(
    row_id = row_number(),
    cum_pct_total = row_id / n(),
    cum_captured = cumsum(actual) / sum(actual)
  )

# Step 5: Plot Gain Chart
plot(gain_df$cum_pct_total * 100, gain_df$cum_captured * 100,
     type = "l", col = "blue", lwd = 2,
     xlab = "% of Sample", ylab = "% of Positive Cases Captured",
     main = "Gain Chart - Titanic Survival Prediction (XGBoost)")

abline(0, 1, col = "red", lty = 2)  # baseline
legend("bottomright", legend = c("XGBoost Model", "Random Baseline"),
       col = c("blue", "red"), lty = c(1, 2), bty = "n")

8.3 Result of Gain Chart

Comment: - If the top 20% of predictions capture 60% of actual survivors, the model is considered strong.

Note: If the chart does not render, it may be due to axis length mismatch.

Interpretation: Gain charts assess how well the model identifies the most likely survivors. It’s commonly used in marketing and resource prioritization.

9 Test Predictions & Submission File

library(xgboost)
library(dplyr)

# Feature engineering
train_processed <- train %>%
  mutate(
    Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age),
    Fare = ifelse(is.na(Fare), median(Fare, na.rm = TRUE), Fare),
    FamilySize = SibSp + Parch + 1,
    IsAlone = ifelse(FamilySize == 1, 1, 0),
    Title = gsub("^.*, (.*?)\\..*$", "\\1", Name),
    Embarked = ifelse(Embarked == "", "S", Embarked)
  )
  
str(train_processed)

## 'data.frame':    891 obs. of  17 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 28 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : int  3 1 3 3 3 2 3 3 3 1 ...
##  $ FamilySize : num  2 2 1 2 1 1 1 5 3 2 ...
##  $ IsAlone    : num  0 0 1 0 1 1 1 0 0 0 ...
##  $ Title      : chr  "Mr" "Mrs" "Miss" "Mrs" ...
##  $ Cluster    : Factor w/ 3 levels "1","2","3": 2 3 3 3 2 2 2 1 3 3 ...
##  $ pred       : num  0.0617 0.998 0.4803 0.9985 0.088 ...

# Convert to factors
train$Embarked[train$Embarked == ""] <- NA
train$Embarked[is.na(train$Embarked)] <- "S"
train$Embarked <- as.factor(train$Embarked)
train$Title <- gsub("^.*, (.*?)\\..*$", "\\1", train$Name)
train$Title[!(train$Title %in% c("Mr", "Miss", "Mrs", "Master"))] <- "RareTitle"
train$Title <- factor(train$Title)
train$Sex <- as.factor(train$Sex)
# Create model matrix for XGBoost (dummy encoding included)
xgb_matrix <- model.matrix(~ Pclass + Age + Fare + FamilySize + IsAlone + Sex + Embarked + Title - 1, data = train_processed)

# Prepare label
label <- as.numeric(train$Survived) - 1

# Train model
dtrain <- xgb.DMatrix(data = xgb_matrix, label = label)
set.seed(123)
xgb_model <- xgboost(data = dtrain, nrounds = 50, objective = "binary:logistic", verbose = 0)
train_feature_names <- colnames(xgb_matrix)

# Eksik sütunları doldur
missing_cols <- setdiff(train_feature_names, colnames(xgb_matrix))
for (col in missing_cols) {
  xgb_matrix <- cbind(xgb_matrix, setNames(data.frame(rep(0, nrow(xgb_matrix))), col))
}

# Fazla sütunları çıkar
extra_cols <- setdiff(colnames(xgb_matrix), train_feature_names)
xgb_matrix <- xgb_matrix[, !(colnames(xgb_matrix) %in% extra_cols)]

# Aynı sütun sırasına göre düzenle
xgb_matrix <- xgb_matrix[, train_feature_names]

10 Predicting on Test Set using same model.matrix

# Step 1: Preprocessing test data
test_processed <- test %>%
  mutate(
    Age = ifelse(is.na(Age), median(train$Age, na.rm = TRUE), Age),
    Fare = ifelse(is.na(Fare), median(train$Fare, na.rm = TRUE), Fare),
    FamilySize = SibSp + Parch + 1,
    IsAlone = ifelse(FamilySize == 1, 1, 0),
    Title = gsub("^.*, (.*?)\\..*$", "\\1", Name),
    Embarked = ifelse(Embarked == "", "S", Embarked)
  )
str(test_processed )

## 'data.frame':    418 obs. of  14 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
##  $ Sex        : chr  "male" "female" "male" "male" ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : chr  "" "" "" "" ...
##  $ Embarked   : chr  "Q" "S" "Q" "S" ...
##  $ FamilySize : num  1 2 1 1 3 1 1 3 1 3 ...
##  $ IsAlone    : num  1 0 1 1 0 1 1 0 1 0 ...
##  $ Title      : chr  "Mr" "Mrs" "Mr" "Mr" ...

# Step 2: Align factor levels with training set
test_processed$Embarked[test_processed$Embarked == ""] <- NA
test_processed$Embarked[is.na(test_processed$Embarked)] <- "S"
test_processed$Embarked <- factor(test_processed$Embarked, levels = levels(train$Embarked))
test_processed$Title <- gsub("^.*, (.*?)\\..*$", "\\1", test$Name)
test_processed$Title[!(test_processed$Title %in% levels(train$Title))] <- "RareTitle"
test_processed$Title <- factor(test_processed$Title, levels = levels(train$Title))
test_processed$Sex <- factor(test_processed$Sex, levels = levels(train$Sex))
 
str(test_processed$Sex)

##  Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...

str(test_processed$Embarked)

##  Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...

str(test_processed$Title)

##  Factor w/ 5 levels "Master","Miss",..: 3 4 3 3 4 3 2 3 4 3 ...

summary(test_processed$Sex)

## female   male 
##    152    266

summary(test_processed$Embarked)

##   C   Q   S 
## 102  46 270

summary(test_processed$Title)

##    Master      Miss        Mr       Mrs RareTitle 
##        21        78       240        72         7

# Step 3: Create model matrix
xgb_test_matrix <- model.matrix(~ Pclass + Age + Fare + FamilySize + IsAlone + Sex + Embarked + Title - 1, 
                                data = test_processed)

# Step 4: Fix column mismatch
missing_cols <- setdiff(train_feature_names, colnames(xgb_test_matrix))
for (col in missing_cols) {
  xgb_test_matrix <- cbind(xgb_test_matrix, setNames(data.frame(rep(0, nrow(test_processed))), col))
}
extra_cols <- setdiff(colnames(xgb_test_matrix), train_feature_names)
xgb_test_matrix <- xgb_test_matrix[, !(colnames(xgb_test_matrix) %in% extra_cols)]
xgb_test_matrix <- xgb_test_matrix[, train_feature_names]  # ensure correct order

# Step 5: Prediction
str(xgb_test_matrix)

## 'data.frame':    418 obs. of  24 variables:
##  $ Pclass           : num  3 3 2 3 3 3 3 2 3 3 ...
##  $ Age              : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ Fare             : num  7.83 7 9.69 8.66 12.29 ...
##  $ FamilySize       : num  1 2 1 1 3 1 1 3 1 3 ...
##  $ IsAlone          : num  1 0 1 1 0 1 1 0 1 0 ...
##  $ Sexfemale        : num  0 1 0 0 1 0 1 0 1 0 ...
##  $ Sexmale          : num  1 0 1 1 0 1 0 1 0 1 ...
##  $ Embarked         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleCol         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleDon         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleDr          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleJonkheer    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleLady        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleMajor       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleMaster      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleMiss        : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ TitleMlle        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleMme         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleMr          : num  1 0 1 1 0 1 0 1 0 1 ...
##  $ TitleMrs         : num  0 1 0 0 1 0 0 0 1 0 ...
##  $ TitleMs          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleRev         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TitleSir         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Titlethe Countess: num  0 0 0 0 0 0 0 0 0 0 ...

str(train_feature_names)

##  chr [1:24] "Pclass" "Age" "Fare" "FamilySize" "IsAlone" "Sexfemale" ...

if (nrow(xgb_test_matrix) == 418 && ncol(xgb_test_matrix) == length(train_feature_names)) {
  dtest <- xgb.DMatrix(data = as.matrix(xgb_test_matrix))
  test_preds <- predict(xgb_model, newdata = dtest)

  # Convert to binary prediction
  test_pred_class <- ifelse(test_preds > 0.5, 1, 0)
  str(test_pred_class)
  # Build Kaggle submission
  submission <- data.frame(PassengerId = test$PassengerId, Survived = test_pred_class)
  write.csv(submission, "XGBoostSubmission_ModelMatrix.csv", row.names = FALSE)
} else {
  cat("❌ ERROR: Test matrix shape mismatch. Check encoding or factor levels.\n")
  cat("Rows:", nrow(xgb_test_matrix), "Cols:", ncol(xgb_test_matrix), "\n")
  cat("Expected Cols:", length(train_feature_names), "\n")
}

##  num [1:418] 0 0 0 1 1 0 0 1 1 0 ...

10.1 Result of Prediction & Submission

File: LogisticsRegressionSubmission.csv

Comment: - Predictions on the test dataset were exported in Kaggle submission format.

Interpretation: The submission file contains final predictions ready for evaluation. It proves the model’s capability on unseen data.

10.2 Visualization of Prediction

library(ggplot2)

# Create data frame containing prediction results
test_pred_df <- data.frame(Predicted_Probability = test_preds)

# Histogramı draw
ggplot(test_pred_df, aes(x = Predicted_Probability)) +
  geom_histogram(binwidth = 0.05, fill = "steelblue", color = "white") +
  labs(
    title = "Histogram of Predicted Survival Probabilities (Test Set)",
    x = "Predicted Probability of Survival",
    y = "Count"
  ) +
  theme_minimal()

ggplot(test_pred_df, aes(x = Predicted_Probability)) +
  geom_density(fill = "skyblue", alpha = 0.6) +
  labs(title = "Density of Predicted Survival Probabilities",
       x = "Predicted Probability",
       y = "Density") +
  theme_minimal()

test_pred_df$Sex <- test$Sex  # Cinsiyet ekle
ggplot(test_pred_df, aes(x = Sex, y = Predicted_Probability, fill = Sex)) +
  geom_violin(trim = FALSE, alpha = 0.5) +
  geom_boxplot(width = 0.1, color = "black", alpha = 0.8) +
  labs(title = "Predicted Probabilities by Gender",
       x = "Gender", y = "Survival Probability") +
  theme_minimal()

# install.packages("ggdist"); install.packages("gghalves")
library(ggdist)
library(gghalves)

ggplot(test_pred_df, aes(x = Sex, y = Predicted_Probability, fill = Sex)) +
  stat_halfeye(adjust = 0.5, width = 0.6, .width = 0, point_color = NA) +
  geom_boxplot(width = 0.1, outlier.shape = NA, alpha = 0.3) +
  geom_jitter(aes(color = Sex), width = 0.1, alpha = 0.3) +
  theme_minimal() +
  labs(title = "Raincloud Plot: Gender vs Predicted Survival Probability")

# install.packages("flexdashboard")
library(flexdashboard)

gauge(mean(test_preds) * 100, min = 0, max = 100, 
      sectors = gaugeSectors(success = c(70, 100), warning = c(40, 70), danger = c(0, 40)),
      label = "Avg Survival Probability")

library(plotly)
plot_ly(test_pred_df, x = ~Predicted_Probability, type = "histogram",
        marker = list(color = 'rgba(100, 150, 250, 0.7)')) %>%
  layout(title = "Interactive Histogram of Predicted Survival Probabilities",
         xaxis = list(title = "Predicted Probability"),
         yaxis = list(title = "Count"))

library(plotly)

test$predicted <- test_preds
plot_ly(test, x = ~Age, y = ~Fare,
        color = ~predicted,
        type = 'scatter',
        mode = 'markers',
        text = ~paste("Sex:", Sex,
                      "<br>Pclass:", Pclass,
                      "<br>FamilySize:", SibSp + Parch + 1),
        hoverinfo = "text") %>%
  layout(title = "Predicted Survival by Age and Fare",
         xaxis = list(title = "Age"),
         yaxis = list(title = "Fare"))

11 Test Set Predictions Visualization

11.1 Predicted Survival Probabilities Histogram

11.2 Final Evaluation Summary

Aspect	Comment
Data Cleaning	Proper handling of missing values and feature engineering
Modeling Techniques	Logistic Regression and XGBoost used; interpretable and powerful
Visualization	Multiple meaningful plots: PDP, ROC, Clusters, Gain Chart
Evaluation Metrics	ROC AUC, Confusion Matrix, and Gain Chart analyzed
Result Explanation	Clear and insightful; aligns with real-world knowledge
Deliverable Quality	Submission file created in correct format

11.3 Final Summary

11.3.1 Interactive Histogram of Predicted Survival Probabilities

This histogram visualizes the distribution of the predicted survival probabilities output by the model for passengers in the test dataset.

11.3.2 Key Observations:

The distribution is bimodal — with peaks near 0.0 and 1.0, indicating that the model made very confident predictions for most passengers.
- Predictions close to 0.0 imply that the model strongly believes the passenger did not survive.
- Predictions close to 1.0 imply a high likelihood of survival.
Fewer predictions are made around 0.4–0.6, which represents a zone of uncertainty.

*Interpretation:
The model demonstrates strong certainty in its predictions. However, this level of confidence may suggest overfitting**, particularly if the model is heavily influenced by a dominant feature such as gender.

Caution:
A good model should balance between confident and uncertain predictions. The lack of moderate probability values could indicate limited nuance in complex cases or insufficient model regularization.

11.3.3 Overall Evaluation

The model performs in a way that strongly favors binary outcomes, aligning with the Titanic context where survival often depended on clear factors like gender and class.
There is a need to verify whether this confidence is justified by real-world accuracy by comparing predicted values with actual outcomes.
Further analysis (e.g., ROC curve, confusion matrix, feature importance, SHAP values) is recommended to assess model robustness and identify possible bias.

11.3.4 Result of Prediction & Submission