The objective of this project is to develop and evaluate machine learning models to predict the survival of passengers aboard the Titanic using logistic regression and advanced algorithms. Key goals include:
IsAlone,
FamilySize).library(iml)
library(tidyverse)
library(data.table)
# File path
data_path <- "C:/Users/Admin/Desktop/Guelph/assg4/"
train <- read.csv(paste0(data_path, "train.csv"), stringsAsFactors = FALSE)
test <- read.csv(paste0(data_path, "test.csv"), stringsAsFactors = FALSE)
gender_submission <- read.csv(paste0(data_path, "gender_submission.csv"), stringsAsFactors = FALSE)
## Dataset Overview
# Show dataset structure and summary
#str(train)
library(DT)
datatable(train,
filter = "top", # Her sütunun üstüne filtre kutusu koyar
options = list(
pageLength = 15, # Sayfa başına 15 satır
autoWidth = TRUE # Otomatik sütun genişliği
))
# Load package
library(skimr)
# Run skim
skim(train)
| Name | train |
| Number of rows | 891 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Name | 0 | 1 | 12 | 82 | 0 | 891 | 0 |
| Sex | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| Ticket | 0 | 1 | 3 | 18 | 0 | 681 | 0 |
| Cabin | 0 | 1 | 0 | 15 | 687 | 148 | 0 |
| Embarked | 0 | 1 | 0 | 1 | 2 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| PassengerId | 0 | 1.0 | 446.00 | 257.35 | 1.00 | 223.50 | 446.00 | 668.5 | 891.00 | ▇▇▇▇▇ |
| Survived | 0 | 1.0 | 0.38 | 0.49 | 0.00 | 0.00 | 0.00 | 1.0 | 1.00 | ▇▁▁▁▅ |
| Pclass | 0 | 1.0 | 2.31 | 0.84 | 1.00 | 2.00 | 3.00 | 3.0 | 3.00 | ▃▁▃▁▇ |
| Age | 177 | 0.8 | 29.70 | 14.53 | 0.42 | 20.12 | 28.00 | 38.0 | 80.00 | ▂▇▅▂▁ |
| SibSp | 0 | 1.0 | 0.52 | 1.10 | 0.00 | 0.00 | 0.00 | 1.0 | 8.00 | ▇▁▁▁▁ |
| Parch | 0 | 1.0 | 0.38 | 0.81 | 0.00 | 0.00 | 0.00 | 0.0 | 6.00 | ▇▁▁▁▁ |
| Fare | 0 | 1.0 | 32.20 | 49.69 | 0.00 | 7.91 | 14.45 | 31.0 | 512.33 | ▇▁▁▁▁ |
# Show dataset structure and summary
str(train)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
summary(train)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
# Display first few rows
head(train)
# Count of total passengers
nrow(train)
## [1] 891
# Survival distribution
table(train$Survived)
##
## 0 1
## 549 342
prop.table(table(train$Survived)) * 100
##
## 0 1
## 61.61616 38.38384
# Gender distribution
table(train$Sex)
##
## female male
## 314 577
prop.table(table(train$Sex)) * 100
##
## female male
## 35.2413 64.7587
# Class distribution
table(train$Pclass)
##
## 1 2 3
## 216 184 491
# Embarkation distribution
table(train$Embarked)
##
## C Q S
## 2 168 77 644
# Fare and Age summaries
summary(train$Fare)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 7.91 14.45 32.20 31.00 512.33
summary(train$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 20.12 28.00 29.70 38.00 80.00 177
# Missing value counts
colSums(is.na(train))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
# Feature engineering
#train$Survived <- as.factor(train$Survived)
#test$Survived <- NULL
#train$FamilySize <- train$SibSp + train$Parch + 1
#train$IsAlone <- ifelse(train$FamilySize == 1, 1, 0)
| Feature | Type | Notes | Include | Reasoning |
|---|---|---|---|---|
| PassengerId | Numeric ID | Unique identifier, not useful for prediction | No | Pure index value; no analytical value |
| Survived | Binary Target | The target variable | Yes | This is the variable we aim to predict |
| Pclass | Ordinal | Socioeconomic status indicator | Yes | Clear correlation with survival rates |
| Name | Text | Raw name not useful. Titles can be extracted | Yes | Extract title to capture gender/social status |
| Sex | Categorical | Strong predictor | Yes | Female passengers had higher survival rates |
| Age | Numeric | Missing values exist | Yes | Fill missing values using median or model-based imputation |
| SibSp | Numeric | Number of siblings/spouses aboard | Yes | Used in FamilySize calculation |
| Parch | Numeric | Number of parents/children aboard | Yes | Combined with SibSp to compute FamilySize |
| Ticket | Text | Raw form noisy; prefix or group size can be engineered | Yes | Optional engineered features can be extracted |
| Fare | Numeric | Ticket price, economic status indicator | Yes | Often correlates with class and survival |
| Cabin | Text | Many missing values; deck letter can be extracted | Yes | Use deck letter as feature; impute missing as ‘Unknown’ |
| Embarked | Categorical | Port of embarkation (C, Q, S) | Yes | Region-based travel behavior indicator |
| FamilySize | Engineered | SibSp + Parch + 1 | Yes | Captures family group size effect |
| IsAlone | Engineered | Binary flag for passengers traveling alone | Yes | Strong indicator for survival outcome |
| Title | Engineered | Extracted from Name | Yes | Captures gender/status information embedded in name |
We will use the following features:
PclassSexAge (with imputation)FareEmbarkedFamilySize (engineered)IsAlone (engineered)Title (engineered from Name)Deck (from Cabin) and
TicketPrefix (from Ticket)These features will be carried forward into correlation analysis, modeling, and feature selection.
## Data Preparation and Feature Importance
**Comment:**
- Missing values (especially `Age` and `Fare`) were filled using the median.
- New features such as `FamilySize` and `IsAlone` were created.
- Titles were extracted from the `Name` column for deeper insight.
**Interpretation:**
This step improves data quality and helps the model learn better. Handling missing values reduces data loss, and feature engineering adds meaningful patterns for prediction.
``` r
library(tidyverse)
library(corrplot)
library(randomForest)
library(caret)
# Convert relevant variables to factors
train$Sex <- as.factor(train$Sex)
train$Embarked[train$Embarked == ""] <- NA
train$Embarked[is.na(train$Embarked)] <- "S"
train$Embarked <- as.factor(train$Embarked)
# Impute missing numeric values with median
train$Age[is.na(train$Age)] <- median(train$Age, na.rm = TRUE)
train$Fare[is.na(train$Fare)] <- median(train$Fare, na.rm = TRUE)
# Feature engineering
train$FamilySize <- train$SibSp + train$Parch + 1
train$IsAlone <- ifelse(train$FamilySize == 1, 1, 0)
# Extract title from Name
train$Title <- gsub("^.*, (.*?)\\..*$", "\\1", train$Name)
train$Title[!(train$Title %in% c("Mr", "Miss", "Mrs", "Master"))] <- "RareTitle"
train$Title <- factor(train$Title)
# Convert Survived to factor
train$Survived <- as.factor(train$Survived)
# Prepare numeric-only subset for correlation
numeric_vars <- train %>%
select_if(is.numeric) %>%
select(-PassengerId) # Drop ID
# Compute and plot correlation matrix
cor_matrix <- cor(numeric_vars)
corrplot(cor_matrix, method = "color", addCoef.col = "black", tl.cex = 0.8)
# One-hot encoding for categorical variables
dummies <- dummyVars(" ~ Sex + Embarked + Title", data = train)
categorical_encoded <- data.frame(predict(dummies, newdata = train))
# Combine all features
train_features <- bind_cols(
train %>% select(Survived, Pclass, Age, Fare, FamilySize, IsAlone),
categorical_encoded
)
# Convert Survived to factor (target variable)
train_features$Survived <- as.factor(train_features$Survived)
# Random Forest with all features
set.seed(123)
rf_full <- randomForest(Survived ~ ., data = train_features, importance = TRUE, ntree = 200)
# Convert importance to data frame
importance_df <- as.data.frame(importance(rf_full))
importance_df$Feature <- rownames(importance_df)
rownames(importance_df) <- NULL
# Sort by MeanDecreaseAccuracy
importance_df <- importance_df %>%
arrange(MeanDecreaseAccuracy)
# Plot with ggplot2
ggplot(importance_df, aes(x = MeanDecreaseAccuracy, y = reorder(Feature, MeanDecreaseAccuracy))) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Random Forest Feature Importance (Mean Decrease Accuracy)",
x = "Mean Decrease Accuracy",
y = "Feature") +
theme_minimal()
SibSp and FamilySize → 0.89
correlation
Very strong positive correlation — this is expected, as
FamilySize = SibSp + Parch + 1.
Parch and FamilySize → 0.78
correlation
Strong relationship again confirms redundancy.
IsAlone and FamilySize → -0.69
correlation
Inversely related: being alone corresponds to
FamilySize == 1.
Fare and Pclass → -0.55
correlation
Negative correlation suggests that passengers in higher classes tended
to pay more for their tickets.
Other correlations (e.g., Age, Pclass)
were moderate to weak and not as significant.
General Meaning:
The bar chart displays each feature’s contribution to model accuracy. A higher Mean Decrease Accuracy means the feature is more influential in predicting survival. These features help the model make better decisions.
FamilySize.Title.Col,
Title.Don, Title.Mlle): Very few passengers
have these titles. Their predictive contribution is low. They can be
grouped under a new category like "RareTitle".Retain the top ~10 features:
Pclass, FamilySize, Fare,
Sex, Age, Embarked,
Title, IsAlone.These will provide the most predictive power for survival.
Optional / Can be removed:
SibSp and Parch
separately if FamilySize is included —
this will help prevent multicollinearity.IsAlone and FamilySize as engineered
features instead — they carry distinct and informative signals.Pclass and Fare
together as indicators of economic status. ##
Clustering with K-Means & PCAlibrary(cluster)
library(factoextra)
library(plotly)
# Seçilen en önemli özellikler
cluster_input <- train %>%
select(Pclass, Age, Fare, FamilySize, IsAlone,
Title, Sex, Embarked)
# Kategorik değişkenleri sayısala çevir (one-hot encoding)
dummies <- dummyVars(" ~ .", data = cluster_input)
cluster_matrix <- predict(dummies, newdata = cluster_input)
cluster_df <- as.data.frame(cluster_matrix)
# Sadece tam gözlemleri filtrele
valid_rows <- complete.cases(cluster_df)
features_cluster <- scale(cluster_df[valid_rows, ])
# KMeans kümeleme
set.seed(123)
kclust <- kmeans(features_cluster, centers = 3, nstart = 25)
train$Cluster <- NA
train$Cluster[valid_rows] <- kclust$cluster
train$Cluster <- factor(train$Cluster)
# PCA ile görselleştirme
pca <- prcomp(features_cluster)
pca_df <- as.data.frame(pca$x[, 1:2])
pca_df$Cluster <- factor(kclust$cluster)
# Survived oranlarını cluster'lara göre incele
train %>%
filter(!is.na(Cluster)) %>%
group_by(Cluster, Survived) %>%
summarise(Count = n()) %>%
mutate(Percentage = round(Count / sum(Count) * 100, 1)) %>%
arrange(Cluster, desc(Survived))
plot_ly(pca_df, x = ~PC1, y = ~PC2, color = ~Cluster,
type = 'scatter', mode = 'markers') %>%
layout(title = "K-Means Clustering with Top Features (PCA Projection)")
Based on the K-Means clustering and survival breakdown table:
Cluster 2 (🟠 Orange) has the highest
survival rate, with 84% of passengers in this
group surviving.
This suggests that Cluster 2 represents passengers with favorable
characteristics such as:
Cluster 3 (🔵 Blue) shows a moderate
survival rate of 74.2%.
This cluster may contain passengers with some survival advantages but
not as strong as Cluster 2.
Cluster 1 (🟢 Green) has the lowest
survival rate, with only 57.5% of its
passengers surviving.
⚠ It likely represents passengers who were alone
(IsAlone = 1), younger, or from lower fare
brackets.
library(ggplot2)
library(dplyr)
library(tidyr)
# 1. Gender-Based Survival
p1 <- train %>%
group_by(Sex, Survived) %>%
summarise(Count = n(), .groups = "drop") %>%
mutate(Percent = round(Count / sum(Count) * 100, 1)) %>%
ggplot(aes(x = Sex, y = Percent, fill = as.factor(Survived))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Survival Rate by Gender", y = "%", fill = "Survived") +
theme_minimal()
# 2. Class-Based Survival
p2 <- train %>%
group_by(Pclass, Survived) %>%
summarise(Count = n(), .groups = "drop") %>%
mutate(Percent = round(Count / sum(Count) * 100, 1)) %>%
ggplot(aes(x = as.factor(Pclass), y = Percent, fill = as.factor(Survived))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Survival Rate by Passenger Class", x = "Pclass", y = "%", fill = "Survived") +
theme_minimal()
# 3. Embarked-Based Survival
p3 <- train %>%
group_by(Embarked, Survived) %>%
summarise(Count = n(), .groups = "drop") %>%
mutate(Percent = round(Count / sum(Count) * 100, 1)) %>%
ggplot(aes(x = Embarked, y = Percent, fill = as.factor(Survived))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Survival Rate by Embarkation Port", y = "%", fill = "Survived") +
theme_minimal()
# 4. Title-Based Survival (from engineered Title)
p4 <- train %>%
group_by(Title, Survived) %>%
summarise(Count = n(), .groups = "drop") %>%
mutate(Percent = round(Count / sum(Count) * 100, 1)) %>%
ggplot(aes(x = reorder(Title, -Percent), y = Percent, fill = as.factor(Survived))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Survival Rate by Title", y = "%", fill = "Survived") +
theme_minimal() +
coord_flip()
# Display multiple plots (optionally with patchwork)
library(patchwork)
(p1 | p2) / (p3 | p4)
This set of visualizations confirms that Sex, Pclass, Embarked, and engineered features like Title have strong predictive value. These features are crucial for the model and should be retained during feature selection.
library(DALEX)
library(iml)
# Prepare model dataset
logreg_data <- train %>%
mutate(
Sex = as.factor(Sex),
Embarked = ifelse(Embarked == "", NA, Embarked),
Embarked = factor(Embarked),
Title = gsub("^.*, (.*?)\\..*$", "\\1", Name),
Title = factor(Title),
FamilySize = SibSp + Parch + 1,
IsAlone = ifelse(FamilySize == 1, 1, 0)
) %>%
select(Survived, Pclass, Sex, Age, Fare, FamilySize, IsAlone, Embarked, Title) %>%
drop_na()
# Logistic regression with more informative features
log_model <- glm(Survived ~ Pclass + Sex + Age + Fare + FamilySize + IsAlone + Embarked + Title,
data = logreg_data, family = binomial)
# PDP preparation
predictor <- Predictor$new(
model = log_model,
data = logreg_data %>% select(-Survived),
y = as.numeric(logreg_data$Survived)
)
# Generate PDPs for top features
pdp_age <- FeatureEffect$new(predictor, feature = "Age", method = "pdp")
pdp_fare <- FeatureEffect$new(predictor, feature = "Fare", method = "pdp")
pdp_pclass <- FeatureEffect$new(predictor, feature = "Pclass", method = "pdp")
pdp_sex <- FeatureEffect$new(predictor, feature = "Sex", method = "pdp")
# Plot PDPs
plot(pdp_age) + ggtitle("PDP: Age")
plot(pdp_fare) + ggtitle("PDP: Fare")
plot(pdp_pclass) + ggtitle("PDP: Pclass")
plot(pdp_sex) + ggtitle("PDP: Sex")
library(xgboost)
library(caret)
library(dplyr)
# Feature Engineering for Training Data
train_xgb <- train %>%
mutate(
Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age),
Fare = ifelse(is.na(Fare), median(Fare, na.rm = TRUE), Fare),
FamilySize = SibSp + Parch + 1,
IsAlone = ifelse(FamilySize == 1, 1, 0),
Title = gsub("^.*, (.*?)\\..*$", "\\1", Name),
Sex = factor(Sex),
Embarked = factor(ifelse(Embarked == "", "S", Embarked)),
Title = factor(Title)
)
train$Embarked[train$Embarked == ""] <- NA
train$Embarked[is.na(train$Embarked)] <- "S"
train$Embarked <- as.factor(train$Embarked)
train$Title <- gsub("^.*, (.*?)\\..*$", "\\1", train$Name)
train$Title[!(train$Title %in% c("Mr", "Miss", "Mrs", "Master"))] <- "RareTitle"
train$Title <- factor(train$Title)
train$Sex <- as.factor(train$Sex)
# One-hot encoding for categorical variables
dummies <- dummyVars(" ~ Sex + Embarked + Title", data = train_xgb)
cat_encoded <- predict(dummies, newdata = train_xgb)
# Combine numeric + encoded features
xgb_features <- cbind(
train_xgb %>% select(Pclass, Age, Fare, FamilySize, IsAlone),
cat_encoded
)
# Create label vector
label <- as.numeric(train$Survived) - 1
# Create DMatrix (preserves feature names)
dtrain <- xgb.DMatrix(data = as.matrix(xgb_features), label = label)
# Train XGBoost model
set.seed(123)
xgb_model <- xgboost(
data = dtrain,
nrounds = 50,
objective = "binary:logistic",
verbose = 0
)
library(pROC)
library(caret)
# Hazırlık: bias_data'nın tüm özellikleri içermesi lazım
bias_data <- train %>%
mutate(
Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age),
Fare = ifelse(is.na(Fare), median(Fare, na.rm = TRUE), Fare),
FamilySize = SibSp + Parch + 1,
IsAlone = ifelse(FamilySize == 1, 1, 0),
Title = gsub("^.*, (.*?)\\..*$", "\\1", Name),
Sex = factor(Sex),
Embarked = factor(ifelse(Embarked == "", "S", Embarked)),
Title = factor(Title)
)
# Kategorik değişkenleri encode et
bias_encoded <- predict(dummies, newdata = bias_data)
# XGBoost formatına uygun hale getir
bias_features <- cbind(
bias_data %>% select(Pclass, Age, Fare, FamilySize, IsAlone),
bias_encoded
)
dpredict <- xgb.DMatrix(data = as.matrix(bias_features))
# Tahmin al
train$pred <- predict(xgb_model, dpredict)
# Bias Analizi: ortalama tahmin
train %>% group_by(Sex) %>% summarise(AvgPrediction = mean(pred, na.rm = TRUE))
train %>% group_by(Pclass) %>% summarise(AvgPrediction = mean(pred, na.rm = TRUE))
# ROC ve AUC
roc_obj <- roc(as.numeric(train$Survived) - 1, train$pred)
plot(roc_obj, main = "ROC Curve - XGBoost")
auc(roc_obj)
## Area under the curve: 0.9856
# Confusion Matrix
pred_class <- ifelse(train$pred > 0.5, 1, 0)
confusionMatrix(factor(pred_class), factor(as.numeric(train$Survived) - 1))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 533 41
## 1 16 301
##
## Accuracy : 0.936
## 95% CI : (0.9179, 0.9512)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8629
##
## Mcnemar's Test P-Value : 0.001478
##
## Sensitivity : 0.9709
## Specificity : 0.8801
## Pos Pred Value : 0.9286
## Neg Pred Value : 0.9495
## Prevalence : 0.6162
## Detection Rate : 0.5982
## Detection Prevalence : 0.6442
## Balanced Accuracy : 0.9255
##
## 'Positive' Class : 0
##
Interpretation:
The model demonstrated strong predictive performance, especially in
distinguishing between survivors and non-survivors. The high AUC value
supports this claim.
The Gain Chart is a powerful evaluation tool used in binary classification tasks. It helps visualize how effectively a model identifies positive cases—in this case, passengers who survived the Titanic disaster.
A well-performing model will have a curve that rises steeply and stays above the diagonal red line, indicating it correctly identifies most of the positive cases early on.
library(dplyr)
# Step 1: Prepare prediction features (same as XGBoost training set)
gain_data <- train %>%
mutate(
Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age),
Fare = ifelse(is.na(Fare), median(Fare, na.rm = TRUE), Fare),
FamilySize = SibSp + Parch + 1,
IsAlone = ifelse(FamilySize == 1, 1, 0),
Title = gsub("^.*, (.*?)\\..*$", "\\1", Name),
Sex = factor(Sex),
Embarked = factor(ifelse(Embarked == "", "S", Embarked)),
Title = factor(Title)
)
# One-hot encode categorical variables
gain_encoded <- predict(dummies, newdata = gain_data)
# Final input for prediction
gain_features <- cbind(
gain_data %>% select(Pclass, Age, Fare, FamilySize, IsAlone),
gain_encoded
)
# Step 2: Make predictions
actuals <- as.numeric(train$Survived) - 1
preds <- predict(xgb_model, newdata = xgb.DMatrix(data = as.matrix(gain_features)))
# Step 3: Filter valid rows
valid_idx <- which(!is.na(actuals) & !is.na(preds))
gain_df <- data.frame(actual = actuals[valid_idx], predicted = preds[valid_idx])
# Step 4: Sort predictions and compute cumulative gains
gain_df <- gain_df %>%
arrange(desc(predicted)) %>%
mutate(
row_id = row_number(),
cum_pct_total = row_id / n(),
cum_captured = cumsum(actual) / sum(actual)
)
# Step 5: Plot Gain Chart
plot(gain_df$cum_pct_total * 100, gain_df$cum_captured * 100,
type = "l", col = "blue", lwd = 2,
xlab = "% of Sample", ylab = "% of Positive Cases Captured",
main = "Gain Chart - Titanic Survival Prediction (XGBoost)")
abline(0, 1, col = "red", lty = 2) # baseline
legend("bottomright", legend = c("XGBoost Model", "Random Baseline"),
col = c("blue", "red"), lty = c(1, 2), bty = "n")
Comment: - If the top 20% of predictions capture 60% of actual survivors, the model is considered strong.
Note: If the chart does not render, it may be due to axis length mismatch.
Interpretation: Gain charts assess how well the model identifies the most likely survivors. It’s commonly used in marketing and resource prioritization.
library(xgboost)
library(dplyr)
# Feature engineering
train_processed <- train %>%
mutate(
Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age),
Fare = ifelse(is.na(Fare), median(Fare, na.rm = TRUE), Fare),
FamilySize = SibSp + Parch + 1,
IsAlone = ifelse(FamilySize == 1, 1, 0),
Title = gsub("^.*, (.*?)\\..*$", "\\1", Name),
Embarked = ifelse(Embarked == "", "S", Embarked)
)
str(train_processed)
## 'data.frame': 891 obs. of 17 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 28 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : int 3 1 3 3 3 2 3 3 3 1 ...
## $ FamilySize : num 2 2 1 2 1 1 1 5 3 2 ...
## $ IsAlone : num 0 0 1 0 1 1 1 0 0 0 ...
## $ Title : chr "Mr" "Mrs" "Miss" "Mrs" ...
## $ Cluster : Factor w/ 3 levels "1","2","3": 2 3 3 3 2 2 2 1 3 3 ...
## $ pred : num 0.0617 0.998 0.4803 0.9985 0.088 ...
# Convert to factors
train$Embarked[train$Embarked == ""] <- NA
train$Embarked[is.na(train$Embarked)] <- "S"
train$Embarked <- as.factor(train$Embarked)
train$Title <- gsub("^.*, (.*?)\\..*$", "\\1", train$Name)
train$Title[!(train$Title %in% c("Mr", "Miss", "Mrs", "Master"))] <- "RareTitle"
train$Title <- factor(train$Title)
train$Sex <- as.factor(train$Sex)
# Create model matrix for XGBoost (dummy encoding included)
xgb_matrix <- model.matrix(~ Pclass + Age + Fare + FamilySize + IsAlone + Sex + Embarked + Title - 1, data = train_processed)
# Prepare label
label <- as.numeric(train$Survived) - 1
# Train model
dtrain <- xgb.DMatrix(data = xgb_matrix, label = label)
set.seed(123)
xgb_model <- xgboost(data = dtrain, nrounds = 50, objective = "binary:logistic", verbose = 0)
train_feature_names <- colnames(xgb_matrix)
# Eksik sütunları doldur
missing_cols <- setdiff(train_feature_names, colnames(xgb_matrix))
for (col in missing_cols) {
xgb_matrix <- cbind(xgb_matrix, setNames(data.frame(rep(0, nrow(xgb_matrix))), col))
}
# Fazla sütunları çıkar
extra_cols <- setdiff(colnames(xgb_matrix), train_feature_names)
xgb_matrix <- xgb_matrix[, !(colnames(xgb_matrix) %in% extra_cols)]
# Aynı sütun sırasına göre düzenle
xgb_matrix <- xgb_matrix[, train_feature_names]
# Step 1: Preprocessing test data
test_processed <- test %>%
mutate(
Age = ifelse(is.na(Age), median(train$Age, na.rm = TRUE), Age),
Fare = ifelse(is.na(Fare), median(train$Fare, na.rm = TRUE), Fare),
FamilySize = SibSp + Parch + 1,
IsAlone = ifelse(FamilySize == 1, 1, 0),
Title = gsub("^.*, (.*?)\\..*$", "\\1", Name),
Embarked = ifelse(Embarked == "", "S", Embarked)
)
str(test_processed )
## 'data.frame': 418 obs. of 14 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : chr "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
## $ Sex : chr "male" "female" "male" "male" ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : chr "330911" "363272" "240276" "315154" ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : chr "" "" "" "" ...
## $ Embarked : chr "Q" "S" "Q" "S" ...
## $ FamilySize : num 1 2 1 1 3 1 1 3 1 3 ...
## $ IsAlone : num 1 0 1 1 0 1 1 0 1 0 ...
## $ Title : chr "Mr" "Mrs" "Mr" "Mr" ...
# Step 2: Align factor levels with training set
test_processed$Embarked[test_processed$Embarked == ""] <- NA
test_processed$Embarked[is.na(test_processed$Embarked)] <- "S"
test_processed$Embarked <- factor(test_processed$Embarked, levels = levels(train$Embarked))
test_processed$Title <- gsub("^.*, (.*?)\\..*$", "\\1", test$Name)
test_processed$Title[!(test_processed$Title %in% levels(train$Title))] <- "RareTitle"
test_processed$Title <- factor(test_processed$Title, levels = levels(train$Title))
test_processed$Sex <- factor(test_processed$Sex, levels = levels(train$Sex))
str(test_processed$Sex)
## Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
str(test_processed$Embarked)
## Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
str(test_processed$Title)
## Factor w/ 5 levels "Master","Miss",..: 3 4 3 3 4 3 2 3 4 3 ...
summary(test_processed$Sex)
## female male
## 152 266
summary(test_processed$Embarked)
## C Q S
## 102 46 270
summary(test_processed$Title)
## Master Miss Mr Mrs RareTitle
## 21 78 240 72 7
# Step 3: Create model matrix
xgb_test_matrix <- model.matrix(~ Pclass + Age + Fare + FamilySize + IsAlone + Sex + Embarked + Title - 1,
data = test_processed)
# Step 4: Fix column mismatch
missing_cols <- setdiff(train_feature_names, colnames(xgb_test_matrix))
for (col in missing_cols) {
xgb_test_matrix <- cbind(xgb_test_matrix, setNames(data.frame(rep(0, nrow(test_processed))), col))
}
extra_cols <- setdiff(colnames(xgb_test_matrix), train_feature_names)
xgb_test_matrix <- xgb_test_matrix[, !(colnames(xgb_test_matrix) %in% extra_cols)]
xgb_test_matrix <- xgb_test_matrix[, train_feature_names] # ensure correct order
# Step 5: Prediction
str(xgb_test_matrix)
## 'data.frame': 418 obs. of 24 variables:
## $ Pclass : num 3 3 2 3 3 3 3 2 3 3 ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ FamilySize : num 1 2 1 1 3 1 1 3 1 3 ...
## $ IsAlone : num 1 0 1 1 0 1 1 0 1 0 ...
## $ Sexfemale : num 0 1 0 0 1 0 1 0 1 0 ...
## $ Sexmale : num 1 0 1 1 0 1 0 1 0 1 ...
## $ Embarked : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TitleCol : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TitleDon : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TitleDr : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TitleJonkheer : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TitleLady : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TitleMajor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TitleMaster : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TitleMiss : num 0 0 0 0 0 0 1 0 0 0 ...
## $ TitleMlle : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TitleMme : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TitleMr : num 1 0 1 1 0 1 0 1 0 1 ...
## $ TitleMrs : num 0 1 0 0 1 0 0 0 1 0 ...
## $ TitleMs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TitleRev : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TitleSir : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Titlethe Countess: num 0 0 0 0 0 0 0 0 0 0 ...
str(train_feature_names)
## chr [1:24] "Pclass" "Age" "Fare" "FamilySize" "IsAlone" "Sexfemale" ...
if (nrow(xgb_test_matrix) == 418 && ncol(xgb_test_matrix) == length(train_feature_names)) {
dtest <- xgb.DMatrix(data = as.matrix(xgb_test_matrix))
test_preds <- predict(xgb_model, newdata = dtest)
# Convert to binary prediction
test_pred_class <- ifelse(test_preds > 0.5, 1, 0)
str(test_pred_class)
# Build Kaggle submission
submission <- data.frame(PassengerId = test$PassengerId, Survived = test_pred_class)
write.csv(submission, "XGBoostSubmission_ModelMatrix.csv", row.names = FALSE)
} else {
cat("❌ ERROR: Test matrix shape mismatch. Check encoding or factor levels.\n")
cat("Rows:", nrow(xgb_test_matrix), "Cols:", ncol(xgb_test_matrix), "\n")
cat("Expected Cols:", length(train_feature_names), "\n")
}
## num [1:418] 0 0 0 1 1 0 0 1 1 0 ...
File:
LogisticsRegressionSubmission.csv
Comment: - Predictions on the test dataset were exported in Kaggle submission format.
Interpretation: The submission file contains final predictions ready for evaluation. It proves the model’s capability on unseen data.
library(ggplot2)
# Create data frame containing prediction results
test_pred_df <- data.frame(Predicted_Probability = test_preds)
# Histogramı draw
ggplot(test_pred_df, aes(x = Predicted_Probability)) +
geom_histogram(binwidth = 0.05, fill = "steelblue", color = "white") +
labs(
title = "Histogram of Predicted Survival Probabilities (Test Set)",
x = "Predicted Probability of Survival",
y = "Count"
) +
theme_minimal()
ggplot(test_pred_df, aes(x = Predicted_Probability)) +
geom_density(fill = "skyblue", alpha = 0.6) +
labs(title = "Density of Predicted Survival Probabilities",
x = "Predicted Probability",
y = "Density") +
theme_minimal()
test_pred_df$Sex <- test$Sex # Cinsiyet ekle
ggplot(test_pred_df, aes(x = Sex, y = Predicted_Probability, fill = Sex)) +
geom_violin(trim = FALSE, alpha = 0.5) +
geom_boxplot(width = 0.1, color = "black", alpha = 0.8) +
labs(title = "Predicted Probabilities by Gender",
x = "Gender", y = "Survival Probability") +
theme_minimal()
# install.packages("ggdist"); install.packages("gghalves")
library(ggdist)
library(gghalves)
ggplot(test_pred_df, aes(x = Sex, y = Predicted_Probability, fill = Sex)) +
stat_halfeye(adjust = 0.5, width = 0.6, .width = 0, point_color = NA) +
geom_boxplot(width = 0.1, outlier.shape = NA, alpha = 0.3) +
geom_jitter(aes(color = Sex), width = 0.1, alpha = 0.3) +
theme_minimal() +
labs(title = "Raincloud Plot: Gender vs Predicted Survival Probability")
# install.packages("flexdashboard")
library(flexdashboard)
gauge(mean(test_preds) * 100, min = 0, max = 100,
sectors = gaugeSectors(success = c(70, 100), warning = c(40, 70), danger = c(0, 40)),
label = "Avg Survival Probability")
library(plotly)
plot_ly(test_pred_df, x = ~Predicted_Probability, type = "histogram",
marker = list(color = 'rgba(100, 150, 250, 0.7)')) %>%
layout(title = "Interactive Histogram of Predicted Survival Probabilities",
xaxis = list(title = "Predicted Probability"),
yaxis = list(title = "Count"))
library(plotly)
test$predicted <- test_preds
plot_ly(test, x = ~Age, y = ~Fare,
color = ~predicted,
type = 'scatter',
mode = 'markers',
text = ~paste("Sex:", Sex,
"<br>Pclass:", Pclass,
"<br>FamilySize:", SibSp + Parch + 1),
hoverinfo = "text") %>%
layout(title = "Predicted Survival by Age and Fare",
xaxis = list(title = "Age"),
yaxis = list(title = "Fare"))
| Aspect | Comment |
|---|---|
| Data Cleaning | Proper handling of missing values and feature engineering |
| Modeling Techniques | Logistic Regression and XGBoost used; interpretable and powerful |
| Visualization | Multiple meaningful plots: PDP, ROC, Clusters, Gain Chart |
| Evaluation Metrics | ROC AUC, Confusion Matrix, and Gain Chart analyzed |
| Result Explanation | Clear and insightful; aligns with real-world knowledge |
| Deliverable Quality | Submission file created in correct format |
This histogram visualizes the distribution of the predicted survival probabilities output by the model for passengers in the test dataset.
*Interpretation:
The model demonstrates strong certainty in its
predictions. However, this level of confidence may suggest
overfitting**, particularly if the model is heavily influenced
by a dominant feature such as gender.
Caution:
A good model should balance between confident and uncertain predictions.
The lack of moderate probability values could indicate limited nuance in
complex cases or insufficient model regularization.
File:
LogisticsRegressionSubmission.csv
Comment: - Predictions on the test dataset were exported in Kaggle submission format.
Interpretation: The submission file contains final predictions ready for evaluation. It proves the model’s capability on unseen data. ## Test and Actual Values Compare - Visualization
# Read actual value file
truth <- gender_submission
# Suppose you add the test_pred vector to the test set
test$PassengerId <- truth$PassengerId # ID'leri eşleştir
test$TrueSurvived <- truth$Survived
# Compare
table(Predicted = test$predicted, Actual = test$TrueSurvived)
## Actual
## Predicted 0 1
## 0.00501341605558991 1 0
## 0.00534301018342376 1 0
## 0.0076636872254312 1 0
## 0.00829859636723995 1 0
## 0.0101103726774454 1 0
## 0.0167445708066225 1 0
## 0.0181489121168852 1 0
## 0.0186488442122936 1 0
## 0.0230183135718107 5 0
## 0.023526769131422 1 0
## 0.0239775534719229 1 0
## 0.027447484433651 0 1
## 0.0277628973126411 0 1
## 0.0314021483063698 2 0
## 0.0322321355342865 2 0
## 0.0324949808418751 1 0
## 0.0327299162745476 0 1
## 0.034887570887804 1 0
## 0.035640187561512 1 0
## 0.0367292277514935 1 0
## 0.0372420251369476 1 0
## 0.0399587824940681 1 0
## 0.0408413261175156 1 0
## 0.0426119267940521 1 0
## 0.0470128804445267 1 0
## 0.0478858537971973 1 0
## 0.0493488647043705 1 0
## 0.0493666082620621 1 0
## 0.0516273565590382 2 0
## 0.0531312637031078 2 0
## 0.0534789375960827 1 0
## 0.0555753707885742 1 0
## 0.0574772618710995 1 0
## 0.0576437823474407 1 0
## 0.059060450643301 0 1
## 0.061437826603651 1 0
## 0.0621099956333637 1 0
## 0.0622129440307617 3 0
## 0.0639677345752716 1 0
## 0.0640841647982597 1 0
## 0.0647494792938232 1 0
## 0.0657114312052727 1 0
## 0.0674889758229256 2 0
## 0.0680648237466812 1 0
## 0.06855708360672 1 0
## 0.06901615858078 1 0
## 0.0701743364334106 1 0
## 0.0742299780249596 1 0
## 0.0754449740052223 1 0
## 0.0756427496671677 1 0
## 0.0842391327023506 1 0
## 0.0853107869625092 1 0
## 0.087767593562603 1 0
## 0.0905502438545227 1 0
## 0.0912154912948608 0 1
## 0.0921662077307701 1 0
## 0.0933858901262283 0 1
## 0.0964819267392159 1 0
## 0.0976270213723183 1 0
## 0.0982456877827644 1 0
## 0.0984572097659111 1 0
## 0.100423686206341 1 0
## 0.106209143996239 1 0
## 0.1069725304842 1 0
## 0.107519745826721 1 0
## 0.109446182847023 1 0
## 0.115293174982071 1 0
## 0.116544790565968 1 0
## 0.117969430983067 1 0
## 0.119162052869797 5 0
## 0.122233964502811 1 0
## 0.122569389641285 1 0
## 0.123916767537594 1 0
## 0.126365810632706 1 0
## 0.13396780192852 1 0
## 0.137021660804749 1 0
## 0.138191729784012 0 1
## 0.140619248151779 1 0
## 0.141350731253624 1 0
## 0.144413217902184 1 0
## 0.14451976120472 1 0
## 0.14705865085125 0 1
## 0.147957116365433 1 0
## 0.148579686880112 1 0
## 0.14897096157074 1 0
## 0.150005728006363 2 0
## 0.15117309987545 1 0
## 0.154090464115143 1 0
## 0.154302254319191 1 0
## 0.15655392408371 0 1
## 0.157810509204865 1 0
## 0.158915534615517 1 0
## 0.159266337752342 1 0
## 0.162593245506287 1 0
## 0.162614554166794 1 0
## 0.167237609624863 1 0
## 0.1674515157938 1 0
## 0.169197306036949 1 0
## 0.171434640884399 2 0
## 0.173128470778465 1 0
## 0.173470556735992 1 0
## 0.174319937825203 1 0
## 0.175556063652039 1 0
## 0.176796332001686 2 0
## 0.176902815699577 1 0
## 0.181030318140984 1 0
## 0.18211804330349 1 0
## 0.184453964233398 7 0
## 0.184886753559113 1 0
## 0.185624122619629 1 0
## 0.192453324794769 1 0
## 0.195625185966492 0 1
## 0.195738241076469 1 0
## 0.196802169084549 1 0
## 0.206239119172096 1 0
## 0.210779875516891 1 0
## 0.215279683470726 1 0
## 0.217747837305069 1 0
## 0.218524619936943 1 0
## 0.222450107336044 0 1
## 0.22259546816349 0 1
## 0.230454593896866 1 0
## 0.232564836740494 1 0
## 0.236020535230637 1 0
## 0.236738473176956 2 0
## 0.239499643445015 1 0
## 0.242322847247124 1 0
## 0.244026839733124 1 0
## 0.245089367032051 1 0
## 0.249255791306496 1 0
## 0.251003921031952 1 0
## 0.251543194055557 1 0
## 0.252611219882965 1 0
## 0.253780335187912 1 0
## 0.257779449224472 1 0
## 0.259678810834885 9 0
## 0.261962652206421 1 0
## 0.263538420200348 1 0
## 0.270047545433044 1 0
## 0.272392094135284 1 0
## 0.273226946592331 0 1
## 0.275163173675537 1 0
## 0.281381517648697 0 2
## 0.285288333892822 1 0
## 0.286141574382782 1 0
## 0.292279243469238 1 0
## 0.295561015605927 1 0
## 0.297093838453293 1 0
## 0.299843579530716 1 0
## 0.300797283649445 1 0
## 0.307326287031174 3 0
## 0.308457106351852 1 0
## 0.312486231327057 1 0
## 0.320543646812439 1 0
## 0.32208576798439 1 0
## 0.325221598148346 1 0
## 0.327071785926819 0 1
## 0.328097134828568 1 0
## 0.328215479850769 0 1
## 0.329337388277054 1 0
## 0.330908626317978 1 0
## 0.336146861314774 1 0
## 0.337761223316193 2 0
## 0.344943583011627 1 0
## 0.347853899002075 0 1
## 0.348743826150894 1 0
## 0.350715458393097 1 0
## 0.361450433731079 1 0
## 0.363471984863281 1 0
## 0.364102870225906 1 0
## 0.372773796319962 1 0
## 0.373498797416687 1 0
## 0.374406516551971 0 1
## 0.37531241774559 1 0
## 0.377081900835037 1 0
## 0.377538442611694 1 0
## 0.386977434158325 0 1
## 0.388280838727951 1 0
## 0.393013149499893 1 0
## 0.404749721288681 0 1
## 0.417631298303604 1 0
## 0.42008438706398 0 2
## 0.429090857505798 1 0
## 0.429356157779694 1 0
## 0.431580781936646 1 0
## 0.433642774820328 1 0
## 0.433798849582672 1 0
## 0.436494320631027 1 0
## 0.442339420318604 1 0
## 0.446784287691116 1 0
## 0.446997076272964 1 0
## 0.447783887386322 1 0
## 0.448463469743729 1 0
## 0.449750810861588 2 0
## 0.45465412735939 0 1
## 0.458186328411102 1 0
## 0.461209833621979 3 0
## 0.471829295158386 1 0
## 0.472332924604416 1 0
## 0.473987877368927 1 0
## 0.480573028326035 1 0
## 0.48388659954071 1 0
## 0.486571192741394 1 0
## 0.49681493639946 1 0
## 0.501095473766327 1 0
## 0.51073807477951 1 0
## 0.517107546329498 1 0
## 0.519537150859833 0 1
## 0.532163202762604 1 0
## 0.532743275165558 1 0
## 0.53760689496994 1 0
## 0.543861150741577 1 0
## 0.558823704719543 2 0
## 0.559723734855652 1 0
## 0.56605738401413 1 0
## 0.568113327026367 1 0
## 0.571636915206909 1 0
## 0.584913611412048 1 0
## 0.60736495256424 0 1
## 0.608735799789429 1 0
## 0.621109127998352 1 0
## 0.624208569526672 1 0
## 0.629350781440735 1 0
## 0.635827779769897 1 0
## 0.643598020076752 1 0
## 0.659517228603363 2 0
## 0.671005964279175 0 1
## 0.68084055185318 1 0
## 0.681740343570709 0 1
## 0.688498497009277 0 1
## 0.692206382751465 0 1
## 0.69520491361618 0 1
## 0.709362804889679 1 0
## 0.728408813476562 0 1
## 0.731802463531494 1 0
## 0.741499960422516 1 0
## 0.74736875295639 1 0
## 0.752689063549042 0 1
## 0.753388822078705 1 0
## 0.757936716079712 1 0
## 0.760124027729034 0 1
## 0.763295888900757 1 0
## 0.764209985733032 2 0
## 0.76749461889267 0 1
## 0.787645399570465 0 1
## 0.789930522441864 0 1
## 0.793878972530365 0 1
## 0.80405580997467 1 0
## 0.813709855079651 1 0
## 0.821412622928619 0 1
## 0.825702726840973 0 2
## 0.830344021320343 0 1
## 0.836681604385376 0 1
## 0.838634371757507 0 4
## 0.839786410331726 1 0
## 0.844725966453552 0 2
## 0.845633804798126 0 1
## 0.853083074092865 0 1
## 0.85387110710144 0 1
## 0.859410524368286 1 0
## 0.859514057636261 2 0
## 0.861940085887909 0 1
## 0.87712162733078 0 1
## 0.88205486536026 0 1
## 0.885589063167572 0 1
## 0.887162744998932 0 1
## 0.89171838760376 2 0
## 0.892994403839111 0 1
## 0.893406391143799 0 1
## 0.896997570991516 0 1
## 0.900290668010712 0 1
## 0.90069967508316 0 1
## 0.902005195617676 1 0
## 0.906231284141541 0 1
## 0.909773826599121 0 1
## 0.91436094045639 0 1
## 0.916954696178436 1 0
## 0.924614369869232 0 1
## 0.925161957740784 0 1
## 0.926715672016144 0 1
## 0.927820086479187 0 1
## 0.928377866744995 0 1
## 0.932399809360504 0 1
## 0.933138787746429 0 1
## 0.934040427207947 0 1
## 0.935863196849823 0 1
## 0.943123281002045 0 1
## 0.946106433868408 0 2
## 0.947964072227478 0 1
## 0.949282705783844 0 1
## 0.949956953525543 0 1
## 0.953500032424927 0 1
## 0.956559717655182 0 1
## 0.959644556045532 0 1
## 0.960473477840424 1 0
## 0.961247742176056 0 1
## 0.96229749917984 0 1
## 0.962357878684998 1 0
## 0.967474639415741 1 0
## 0.970607340335846 0 1
## 0.970946609973907 0 1
## 0.971501111984253 0 1
## 0.972439587116241 0 1
## 0.972541689872742 0 1
## 0.973712503910065 0 1
## 0.974044680595398 0 1
## 0.974933087825775 0 1
## 0.975276052951813 0 1
## 0.976718544960022 0 1
## 0.977484881877899 0 1
## 0.978380501270294 0 1
## 0.978422403335571 0 1
## 0.978551924228668 0 1
## 0.980888545513153 0 1
## 0.982032239437103 0 1
## 0.982743144035339 0 1
## 0.982776880264282 0 1
## 0.984456181526184 0 1
## 0.984741508960724 0 1
## 0.98504376411438 0 1
## 0.985257565975189 0 1
## 0.985469698905945 0 1
## 0.985978782176971 0 1
## 0.986623823642731 0 1
## 0.986688315868378 0 2
## 0.987813055515289 0 1
## 0.988323926925659 0 1
## 0.988487482070923 0 1
## 0.990152776241302 0 1
## 0.990858614444733 0 1
## 0.990926265716553 0 1
## 0.991143822669983 0 1
## 0.991239249706268 0 1
## 0.991434633731842 0 1
## 0.991873800754547 0 1
## 0.992735147476196 0 2
## 0.993293642997742 0 1
## 0.993363261222839 0 1
## 0.994016408920288 0 1
## 0.994386315345764 0 1
## 0.994667053222656 0 1
## 0.994671642780304 0 1
## 0.994764447212219 0 1
## 0.994920551776886 0 1
## 0.995090484619141 0 1
## 0.995184242725372 0 1
## 0.995215773582458 0 1
## 0.995587348937988 0 1
## 0.995936393737793 0 1
## 0.996057152748108 0 1
## 0.996494829654694 0 1
## 0.997001111507416 0 1
## 0.997099995613098 0 1
## 0.99745911359787 0 2
## 0.997486472129822 0 1
## 0.997927665710449 0 1
## 0.998233556747437 0 1
## 0.998289287090302 0 1
## 0.998407542705536 0 1
## 0.998464345932007 0 1
## 0.998869717121124 0 1
## 0.998891294002533 0 1
## 0.999260604381561 0 1
library(ggplot2)
# 0.5 and higher is survived
test$predicted_class <- ifelse(test$predicted > 0.5, 1, 0)
# Graphic
ggplot(test, aes(x = factor(TrueSurvived), fill = factor(predicted_class))) +
geom_bar(position = "dodge") +
labs(title = "Prediction vs Actual Survival",
x = "Actual", fill = "Predicted Class")
test$Index <- 1:nrow(test)
ggplot(test, aes(x = Index)) +
geom_point(aes(y = TrueSurvived), color = "gold", size = 2) +
geom_point(aes(y = predicted), color = "steelblue", shape = 4) +
labs(title = "True vs Predicted Survival", y = "Survival (0 = No, 1 = Yes)")
test$predicted_class <- ifelse(test$predicted > 0.5, 1, 0)
ggplot(test, aes(x = Index)) +
geom_point(aes(y = TrueSurvived), color = "gold", size = 2) +
geom_point(aes(y = predicted_class), color = "steelblue", shape = 4) +
labs(title = "True vs Predicted Class", y = "Survival (0 = No, 1 = Yes)")
0 or
1)Here are some ideas to improve the Titanic survival prediction project in the future:
Titanic -
Machine Learning from Disaster (Kaggle)
This is the official Kaggle competition page. It provides the dataset,
problem description, evaluation method, and a place to submit your
predictions.
Titanic
Data Science Solutions by StartUpSci (Kaggle Notebook)
A very popular and detailed notebook on Kaggle. It shows step-by-step
how to clean data, create new features, and build machine learning
models.
XGBoost
Official Documentation
The official guide for using the XGBoost machine learning algorithm. It
explains how to tune parameters and use the model in R or
Python.
SHAP (SHapley
Additive exPlanations)
A helpful tool to explain how machine learning models make predictions.
Useful for understanding the impact of each feature on the
outcome.