Titanic Survival Prediction: Machine Learning Classification Analysis

Introduction

The Project

In this project, our primary objective is to build a model that can accurately predict whether a passenger on the Titanic would have survived or not. Classification in machine learning involves predicting the category of a target variable, and this case provides a historical scenario where such predictions could have been vital.

The Dataset

The dataset we are working with comes from the Kaggle competition, Titanic: Machine Learning from Disaster. It includes passenger data from the Titanic, such as age, gender, passenger class, fare paid, number of siblings/spouses aboard, and whether or not the passenger survived.

Business goal

The business goal of this project is to understand the factors contributing to survival rate on the Titanic and to build a model that can accurately predict survival outcomes. This model, while based on historical data, can provide insights for safety and survival analysis in contemporary travel scenarios, particularly in the shipping and cruise industries.

Data Preparation

Import Library and Data

In this section, necessary libraries for the analysis are imported. For this task, libraries like tidyverse for data manipulation and visualization, caret for modeling, and randomForest for the Random Forest model are imported. We also imported the three datasets!

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(RColorBrewer)
library(caret)

##  要求されたパッケージ lattice をロード中です 
## 
##  次のパッケージを付け加えます: 'caret' 
## 
##  以下のオブジェクトは 'package:purrr' からマスクされています:
## 
##     lift

library(e1071)
library(rpart)
library(randomForest)

## Warning: パッケージ 'randomForest' はバージョン 4.3.1 の R の下で造られました

## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
##  次のパッケージを付け加えます: 'randomForest' 
## 
##  以下のオブジェクトは 'package:dplyr' からマスクされています:
## 
##     combine
## 
##  以下のオブジェクトは 'package:ggplot2' からマスクされています:
## 
##     margin

train_data <- read.csv("./data/train.csv")
test_data <- read.csv("./data/test.csv")
gender_submission <- read.csv("./data/gender_submission.csv")

Train Overview

glimpse(train_data)

## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

Test Overview

glimpse(test_data)

## Rows: 418
## Columns: 11
## $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903…
## $ Pclass      <int> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 2, 3, 3, 3…
## $ Name        <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Needs)", "M…
## $ Sex         <chr> "male", "female", "male", "male", "female", "male", "femal…
## $ Age         <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18.0, 21.0…
## $ SibSp       <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ticket      <chr> "330911", "363272", "240276", "315154", "3101298", "7538",…
## $ Fare        <dbl> 7.8292, 7.0000, 9.6875, 8.6625, 12.2875, 9.2250, 7.6292, 2…
## $ Cabin       <chr> "", "", "", "", "", "", "", "", "", "", "", "", "B45", "",…
## $ Embarked    <chr> "Q", "S", "Q", "S", "S", "S", "Q", "S", "C", "S", "S", "S"…

gender_submission Overview

glimpse(gender_submission)

## Rows: 418
## Columns: 2
## $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903…
## $ Survived    <int> 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1…

Data Preparation

In this part of the analysis, the data is cleaned and transformed to facilitate further analysis. Missing values in the ‘Age’ column are identified and imputed with the median age. This step is crucial to ensure accurate model results.

train_data$Age[is.na(train_data$Age)] <- median(train_data$Age, na.rm = TRUE)
train_data$Embarked[is.na(train_data$Embarked)] <- names(which.max(table(train_data$Embarked)))
test_data$Age[is.na(test_data$Age)] <- median(train_data$Age, na.rm = TRUE)
test_data$Fare[is.na(test_data$Fare)] <- median(train_data$Fare, na.rm = TRUE)

train_data <- train_data[train_data$Embarked != "",]
test_data <- test_data[test_data$Embarked != "",]

Exploratory Data Analysis

summary(train_data)

##   PassengerId     Survived          Pclass          Name          
##  Min.   :  1   Min.   :0.0000   Min.   :1.000   Length:889        
##  1st Qu.:224   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446   Mean   :0.3825   Mean   :2.312                     
##  3rd Qu.:668   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891   Max.   :1.0000   Max.   :3.000                     
##      Sex                 Age            SibSp            Parch       
##  Length:889         Min.   : 0.42   Min.   :0.0000   Min.   :0.0000  
##  Class :character   1st Qu.:22.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.0000   Median :0.0000  
##                     Mean   :29.32   Mean   :0.5242   Mean   :0.3825  
##                     3rd Qu.:35.00   3rd Qu.:1.0000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.0000   Max.   :6.0000  
##     Ticket               Fare            Cabin             Embarked        
##  Length:889         Min.   :  0.000   Length:889         Length:889        
##  Class :character   1st Qu.:  7.896   Class :character   Class :character  
##  Mode  :character   Median : 14.454   Mode  :character   Mode  :character  
##                     Mean   : 32.097                                        
##                     3rd Qu.: 31.000                                        
##                     Max.   :512.329

Check for Survival count

ggplot(train_data, aes(Survived)) + geom_bar() + labs(title = "Survival count (1 = Survived)")

## Warning in check_font_path(italic, "italic"): 'italic' should be a length-one
## vector, using the first element

str(train)

## function (x, ...)

Exploring the distribution of ages

ggplot(train_data, aes(Age)) +
  geom_histogram(binwidth = 5, fill = "#69b3a2", color = "#e9ecef", alpha = 0.9) +
  labs(x = "Age", y = "Count", title = "Distribution of Ages") +
  theme_minimal()

## Warning in check_font_path(italic, "italic"): 'italic' should be a length-one
## vector, using the first element

Survival rate by gender

p1 <- ggplot(train_data, aes(x = Sex, fill = as.factor(Survived))) + 
      geom_bar(position = 'fill') +
      scale_fill_brewer(palette = "Pastel1") +
      labs(y = "Proportion", 
           x = "Sex", 
           fill = "Survived",
           title = "Survival by Sex")
p1

## Warning in check_font_path(italic, "italic"): 'italic' should be a length-one
## vector, using the first element

Survival rate by passenger class

p2 <- ggplot(train_data, aes(x = as.factor(Pclass), fill = as.factor(Survived))) + 
      geom_bar(position = 'fill') +
      scale_fill_brewer(palette = "Pastel1") +
      labs(y = "Proportion", 
           x = "Pclass", 
           fill = "Survived",
           title = "Survival by Passenger Class")

p2

## Warning in check_font_path(italic, "italic"): 'italic' should be a length-one
## vector, using the first element

Survival rate by Embarkation Point

p3 <- ggplot(train_data, aes(x = Embarked, fill = as.factor(Survived))) + 
      geom_bar(position = 'fill') +
      scale_fill_brewer(palette = "Pastel1") +
      labs(y = "Proportion", 
           x = "Embarked", 
           fill = "Survived",
           title = "Survival by Embarkation Point")
p3

## Warning in check_font_path(italic, "italic"): 'italic' should be a length-one
## vector, using the first element

ggplot(train_data) +
  geom_boxplot(aes(x = Sex, y = Age, fill = as.factor(Pclass)), na.rm=TRUE) +
  scale_fill_brewer(palette = "Pastel1") +
  labs(x = "Sex", 
       y = "Age",
       fill= "Pclass", 
       title = "Age distribution by Sex and Pclass") +
  theme_minimal()

## Warning in check_font_path(italic, "italic"): 'italic' should be a length-one
## vector, using the first element

Cross Validation

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The goal of cross-validation is to limit problems like overfitting, give an insight on how the model will generalize to an independent dataset.

set.seed(123)
trainIndex <- createDataPartition(train_data$Survived, p = .8, 
                                  list = FALSE, 
                                  times = 1)
cv_train <- train_data[ trainIndex,]
cv_test  <- train_data[-trainIndex,]

drop_columns <- c("PassengerId", "Name", "Ticket", "Cabin")
cv_train <- cv_train %>% select(-one_of(drop_columns))
cv_test <- cv_test %>% select(-one_of(drop_columns))

Modeling

Model 1 : Naive Bayes

cv_train$Survived <- factor(cv_train$Survived, levels = c(0, 1))
cv_test$Survived <- factor(cv_test$Survived, levels = c(0, 1))

nb_model <- naiveBayes(Survived ~ ., data = cv_train)
nb_predictions <- predict(nb_model, cv_test)
confusionMatrix(nb_predictions, cv_test$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 92 29
##          1 10 46
##                                           
##                Accuracy : 0.7797          
##                  95% CI : (0.7113, 0.8384)
##     No Information Rate : 0.5763          
##     P-Value [Acc > NIR] : 1.051e-08       
##                                           
##                   Kappa : 0.5332          
##                                           
##  Mcnemar's Test P-Value : 0.003948        
##                                           
##             Sensitivity : 0.9020          
##             Specificity : 0.6133          
##          Pos Pred Value : 0.7603          
##          Neg Pred Value : 0.8214          
##              Prevalence : 0.5763          
##          Detection Rate : 0.5198          
##    Detection Prevalence : 0.6836          
##       Balanced Accuracy : 0.7576          
##                                           
##        'Positive' Class : 0               
##

Model 2: Decision Tree

dt_model <- rpart(Survived ~ ., data = cv_train, method = "class")
dt_predictions <- predict(dt_model, cv_test, type = "class")
confusionMatrix(dt_predictions, cv_test$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 97 25
##          1  5 50
##                                          
##                Accuracy : 0.8305         
##                  95% CI : (0.767, 0.8826)
##     No Information Rate : 0.5763         
##     P-Value [Acc > NIR] : 4.425e-13      
##                                          
##                   Kappa : 0.6402         
##                                          
##  Mcnemar's Test P-Value : 0.0005226      
##                                          
##             Sensitivity : 0.9510         
##             Specificity : 0.6667         
##          Pos Pred Value : 0.7951         
##          Neg Pred Value : 0.9091         
##              Prevalence : 0.5763         
##          Detection Rate : 0.5480         
##    Detection Prevalence : 0.6893         
##       Balanced Accuracy : 0.8088         
##                                          
##        'Positive' Class : 0              
##

Model 3: Random Forest

rf_model <- randomForest(Survived ~ ., data = cv_train)
rf_predictions <- predict(rf_model, cv_test)
confusionMatrix(rf_predictions, cv_test$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 91 21
##          1 11 54
##                                           
##                Accuracy : 0.8192          
##                  95% CI : (0.7545, 0.8729)
##     No Information Rate : 0.5763          
##     P-Value [Acc > NIR] : 5.336e-12       
##                                           
##                   Kappa : 0.6232          
##                                           
##  Mcnemar's Test P-Value : 0.1116          
##                                           
##             Sensitivity : 0.8922          
##             Specificity : 0.7200          
##          Pos Pred Value : 0.8125          
##          Neg Pred Value : 0.8308          
##              Prevalence : 0.5763          
##          Detection Rate : 0.5141          
##    Detection Prevalence : 0.6328          
##       Balanced Accuracy : 0.8061          
##                                           
##        'Positive' Class : 0               
##

Model Evaluation

Model 1: Naive Bayes Confusion Matrix and Metrics

The confusion matrix of the Naive Bayes model along with the calculated metrics are as follows:

nb_confusion_matrix <- confusionMatrix(nb_predictions, cv_test$Survived)
nb_confusion_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 92 29
##          1 10 46
##                                           
##                Accuracy : 0.7797          
##                  95% CI : (0.7113, 0.8384)
##     No Information Rate : 0.5763          
##     P-Value [Acc > NIR] : 1.051e-08       
##                                           
##                   Kappa : 0.5332          
##                                           
##  Mcnemar's Test P-Value : 0.003948        
##                                           
##             Sensitivity : 0.9020          
##             Specificity : 0.6133          
##          Pos Pred Value : 0.7603          
##          Neg Pred Value : 0.8214          
##              Prevalence : 0.5763          
##          Detection Rate : 0.5198          
##    Detection Prevalence : 0.6836          
##       Balanced Accuracy : 0.7576          
##                                           
##        'Positive' Class : 0               
##

The model has an accuracy of 0.779661, indicating that it correctly classified 77.9661017% of the test set. Other important metrics are as follows:

Sensitivity: 0.9019608
Specificity: 0.6133333
Precision: 0.7603306
F1 Score: 0.8251121
Kappa: 0.533171

Model 2: Decision Tree Confusion Matrix and Metrics

The confusion matrix of the Decision Tree model along with the calculated metrics are as follows:

dt_confusion_matrix <- confusionMatrix(dt_predictions, cv_test$Survived)
dt_confusion_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 97 25
##          1  5 50
##                                          
##                Accuracy : 0.8305         
##                  95% CI : (0.767, 0.8826)
##     No Information Rate : 0.5763         
##     P-Value [Acc > NIR] : 4.425e-13      
##                                          
##                   Kappa : 0.6402         
##                                          
##  Mcnemar's Test P-Value : 0.0005226      
##                                          
##             Sensitivity : 0.9510         
##             Specificity : 0.6667         
##          Pos Pred Value : 0.7951         
##          Neg Pred Value : 0.9091         
##              Prevalence : 0.5763         
##          Detection Rate : 0.5480         
##    Detection Prevalence : 0.6893         
##       Balanced Accuracy : 0.8088         
##                                          
##        'Positive' Class : 0              
##

The model has an accuracy of 0.8305085, indicating that it correctly classified 83.0508475% of the test set. Other important metrics are as follows:

Sensitivity: 0.9509804
Specificity: 0.6666667
Precision: 0.795082
Kappa: 0.6402439

Model 3: Random Forest Confusion Matrix and Metrics

The confusion matrix of the Random Forest model along with the calculated metrics are as follows:

rf_confusion_matrix <- confusionMatrix(rf_predictions, cv_test$Survived)
rf_confusion_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 91 21
##          1 11 54
##                                           
##                Accuracy : 0.8192          
##                  95% CI : (0.7545, 0.8729)
##     No Information Rate : 0.5763          
##     P-Value [Acc > NIR] : 5.336e-12       
##                                           
##                   Kappa : 0.6232          
##                                           
##  Mcnemar's Test P-Value : 0.1116          
##                                           
##             Sensitivity : 0.8922          
##             Specificity : 0.7200          
##          Pos Pred Value : 0.8125          
##          Neg Pred Value : 0.8308          
##              Prevalence : 0.5763          
##          Detection Rate : 0.5141          
##    Detection Prevalence : 0.6328          
##       Balanced Accuracy : 0.8061          
##                                           
##        'Positive' Class : 0               
##

The model has an accuracy of 0.819209, indicating that it correctly classified 81.920904% of the test set. Other important metrics are as follows:

Sensitivity: 0.8921569
Specificity: 0.72
Precision: 0.8125
F1 Score: 0.8504673
Kappa: 0.6231537

Conclusion

From the analysis of the three models: Naive Bayes, Decision Tree, and Random Forest, we can make the following observations:

The Decision Tree model has the highest accuracy of 83.05%, slightly higher than the Random Forest model’s accuracy of 81.92% and much higher than the Naive Bayes model’s accuracy of 77.97%.

In terms of sensitivity, the Decision Tree model again tops the list with a score of 0.95, followed by the Random Forest model with a score of 0.89 and the Naive Bayes model with 0.90. Sensitivity is crucial because it measures the proportion of actual positives that are correctly identified.

However, when we consider the specificity, which measures the proportion of actual negatives that are correctly identified, the Random Forest model outperforms the others with a score of 0.72, followed by the Decision Tree model with a score of 0.67, and the Naive Bayes model with 0.61.

In terms of Precision, which is the proportion of positive identifications that were actually correct, the Decision Tree model performs best with a score of 0.79, followed closely by the Random Forest model with 0.81 and the Naive Bayes model with 0.76.

Finally, when considering Kappa, which is a measure of how much better the classifier is performing over the performance of a classifier that simply guesses at random according to the frequency of each class, the Decision Tree model has the highest score of 0.64, followed by the Random Forest model with a score of 0.62 and the Naive Bayes model with 0.53.

Therefore, based on these metrics, the Decision Tree model seems to give the best overall performance. However, the specific choice of model would depend on the specific needs of your prediction task. For instance, if specificity was of paramount importance for your task, the Random Forest model may be the preferred choice. Moreover, the performance of the models may also be improved with further parameter tuning and cross-validation.

As with any machine learning task, it’s important to remember that the choice of model can significantly affect the outcome, and that different models may be more suited to different tasks depending on the specifics of the data and the task at hand.

Reference

Kaggle: The Titanic Dataset