In this project, our primary objective is to build a model that can accurately predict whether a passenger on the Titanic would have survived or not. Classification in machine learning involves predicting the category of a target variable, and this case provides a historical scenario where such predictions could have been vital.
The dataset we are working with comes from the Kaggle competition, Titanic: Machine Learning from Disaster. It includes passenger data from the Titanic, such as age, gender, passenger class, fare paid, number of siblings/spouses aboard, and whether or not the passenger survived.
The business goal of this project is to understand the factors contributing to survival rate on the Titanic and to build a model that can accurately predict survival outcomes. This model, while based on historical data, can provide insights for safety and survival analysis in contemporary travel scenarios, particularly in the shipping and cruise industries.
In this section, necessary libraries for the analysis are imported. For this task, libraries like tidyverse for data manipulation and visualization, caret for modeling, and randomForest for the Random Forest model are imported. We also imported the three datasets!
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(RColorBrewer)
library(caret)
## 要求されたパッケージ lattice をロード中です
##
## 次のパッケージを付け加えます: 'caret'
##
## 以下のオブジェクトは 'package:purrr' からマスクされています:
##
## lift
library(e1071)
library(rpart)
library(randomForest)
## Warning: パッケージ 'randomForest' はバージョン 4.3.1 の R の下で造られました
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## 次のパッケージを付け加えます: 'randomForest'
##
## 以下のオブジェクトは 'package:dplyr' からマスクされています:
##
## combine
##
## 以下のオブジェクトは 'package:ggplot2' からマスクされています:
##
## margin
train_data <- read.csv("./data/train.csv")
test_data <- read.csv("./data/test.csv")
gender_submission <- read.csv("./data/gender_submission.csv")
glimpse(train_data)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
glimpse(test_data)
## Rows: 418
## Columns: 11
## $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903…
## $ Pclass <int> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 2, 3, 3, 3…
## $ Name <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Needs)", "M…
## $ Sex <chr> "male", "female", "male", "male", "female", "male", "femal…
## $ Age <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18.0, 21.0…
## $ SibSp <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ticket <chr> "330911", "363272", "240276", "315154", "3101298", "7538",…
## $ Fare <dbl> 7.8292, 7.0000, 9.6875, 8.6625, 12.2875, 9.2250, 7.6292, 2…
## $ Cabin <chr> "", "", "", "", "", "", "", "", "", "", "", "", "B45", "",…
## $ Embarked <chr> "Q", "S", "Q", "S", "S", "S", "Q", "S", "C", "S", "S", "S"…
glimpse(gender_submission)
## Rows: 418
## Columns: 2
## $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903…
## $ Survived <int> 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1…
In this part of the analysis, the data is cleaned and transformed to facilitate further analysis. Missing values in the ‘Age’ column are identified and imputed with the median age. This step is crucial to ensure accurate model results.
train_data$Age[is.na(train_data$Age)] <- median(train_data$Age, na.rm = TRUE)
train_data$Embarked[is.na(train_data$Embarked)] <- names(which.max(table(train_data$Embarked)))
test_data$Age[is.na(test_data$Age)] <- median(train_data$Age, na.rm = TRUE)
test_data$Fare[is.na(test_data$Fare)] <- median(train_data$Fare, na.rm = TRUE)
train_data <- train_data[train_data$Embarked != "",]
test_data <- test_data[test_data$Embarked != "",]
summary(train_data)
## PassengerId Survived Pclass Name
## Min. : 1 Min. :0.0000 Min. :1.000 Length:889
## 1st Qu.:224 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446 Median :0.0000 Median :3.000 Mode :character
## Mean :446 Mean :0.3825 Mean :2.312
## 3rd Qu.:668 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891 Max. :1.0000 Max. :3.000
## Sex Age SibSp Parch
## Length:889 Min. : 0.42 Min. :0.0000 Min. :0.0000
## Class :character 1st Qu.:22.00 1st Qu.:0.0000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.0000 Median :0.0000
## Mean :29.32 Mean :0.5242 Mean :0.3825
## 3rd Qu.:35.00 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.0000 Max. :6.0000
## Ticket Fare Cabin Embarked
## Length:889 Min. : 0.000 Length:889 Length:889
## Class :character 1st Qu.: 7.896 Class :character Class :character
## Mode :character Median : 14.454 Mode :character Mode :character
## Mean : 32.097
## 3rd Qu.: 31.000
## Max. :512.329
ggplot(train_data, aes(Survived)) + geom_bar() + labs(title = "Survival count (1 = Survived)")
## Warning in check_font_path(italic, "italic"): 'italic' should be a length-one
## vector, using the first element
str(train)
## function (x, ...)
ggplot(train_data, aes(Age)) +
geom_histogram(binwidth = 5, fill = "#69b3a2", color = "#e9ecef", alpha = 0.9) +
labs(x = "Age", y = "Count", title = "Distribution of Ages") +
theme_minimal()
## Warning in check_font_path(italic, "italic"): 'italic' should be a length-one
## vector, using the first element
p1 <- ggplot(train_data, aes(x = Sex, fill = as.factor(Survived))) +
geom_bar(position = 'fill') +
scale_fill_brewer(palette = "Pastel1") +
labs(y = "Proportion",
x = "Sex",
fill = "Survived",
title = "Survival by Sex")
p1
## Warning in check_font_path(italic, "italic"): 'italic' should be a length-one
## vector, using the first element
p2 <- ggplot(train_data, aes(x = as.factor(Pclass), fill = as.factor(Survived))) +
geom_bar(position = 'fill') +
scale_fill_brewer(palette = "Pastel1") +
labs(y = "Proportion",
x = "Pclass",
fill = "Survived",
title = "Survival by Passenger Class")
p2
## Warning in check_font_path(italic, "italic"): 'italic' should be a length-one
## vector, using the first element
p3 <- ggplot(train_data, aes(x = Embarked, fill = as.factor(Survived))) +
geom_bar(position = 'fill') +
scale_fill_brewer(palette = "Pastel1") +
labs(y = "Proportion",
x = "Embarked",
fill = "Survived",
title = "Survival by Embarkation Point")
p3
## Warning in check_font_path(italic, "italic"): 'italic' should be a length-one
## vector, using the first element
ggplot(train_data) +
geom_boxplot(aes(x = Sex, y = Age, fill = as.factor(Pclass)), na.rm=TRUE) +
scale_fill_brewer(palette = "Pastel1") +
labs(x = "Sex",
y = "Age",
fill= "Pclass",
title = "Age distribution by Sex and Pclass") +
theme_minimal()
## Warning in check_font_path(italic, "italic"): 'italic' should be a length-one
## vector, using the first element
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The goal of cross-validation is to limit problems like overfitting, give an insight on how the model will generalize to an independent dataset.
set.seed(123)
trainIndex <- createDataPartition(train_data$Survived, p = .8,
list = FALSE,
times = 1)
cv_train <- train_data[ trainIndex,]
cv_test <- train_data[-trainIndex,]
drop_columns <- c("PassengerId", "Name", "Ticket", "Cabin")
cv_train <- cv_train %>% select(-one_of(drop_columns))
cv_test <- cv_test %>% select(-one_of(drop_columns))
cv_train$Survived <- factor(cv_train$Survived, levels = c(0, 1))
cv_test$Survived <- factor(cv_test$Survived, levels = c(0, 1))
nb_model <- naiveBayes(Survived ~ ., data = cv_train)
nb_predictions <- predict(nb_model, cv_test)
confusionMatrix(nb_predictions, cv_test$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 92 29
## 1 10 46
##
## Accuracy : 0.7797
## 95% CI : (0.7113, 0.8384)
## No Information Rate : 0.5763
## P-Value [Acc > NIR] : 1.051e-08
##
## Kappa : 0.5332
##
## Mcnemar's Test P-Value : 0.003948
##
## Sensitivity : 0.9020
## Specificity : 0.6133
## Pos Pred Value : 0.7603
## Neg Pred Value : 0.8214
## Prevalence : 0.5763
## Detection Rate : 0.5198
## Detection Prevalence : 0.6836
## Balanced Accuracy : 0.7576
##
## 'Positive' Class : 0
##
dt_model <- rpart(Survived ~ ., data = cv_train, method = "class")
dt_predictions <- predict(dt_model, cv_test, type = "class")
confusionMatrix(dt_predictions, cv_test$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 97 25
## 1 5 50
##
## Accuracy : 0.8305
## 95% CI : (0.767, 0.8826)
## No Information Rate : 0.5763
## P-Value [Acc > NIR] : 4.425e-13
##
## Kappa : 0.6402
##
## Mcnemar's Test P-Value : 0.0005226
##
## Sensitivity : 0.9510
## Specificity : 0.6667
## Pos Pred Value : 0.7951
## Neg Pred Value : 0.9091
## Prevalence : 0.5763
## Detection Rate : 0.5480
## Detection Prevalence : 0.6893
## Balanced Accuracy : 0.8088
##
## 'Positive' Class : 0
##
rf_model <- randomForest(Survived ~ ., data = cv_train)
rf_predictions <- predict(rf_model, cv_test)
confusionMatrix(rf_predictions, cv_test$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 91 21
## 1 11 54
##
## Accuracy : 0.8192
## 95% CI : (0.7545, 0.8729)
## No Information Rate : 0.5763
## P-Value [Acc > NIR] : 5.336e-12
##
## Kappa : 0.6232
##
## Mcnemar's Test P-Value : 0.1116
##
## Sensitivity : 0.8922
## Specificity : 0.7200
## Pos Pred Value : 0.8125
## Neg Pred Value : 0.8308
## Prevalence : 0.5763
## Detection Rate : 0.5141
## Detection Prevalence : 0.6328
## Balanced Accuracy : 0.8061
##
## 'Positive' Class : 0
##
The confusion matrix of the Naive Bayes model along with the calculated metrics are as follows:
nb_confusion_matrix <- confusionMatrix(nb_predictions, cv_test$Survived)
nb_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 92 29
## 1 10 46
##
## Accuracy : 0.7797
## 95% CI : (0.7113, 0.8384)
## No Information Rate : 0.5763
## P-Value [Acc > NIR] : 1.051e-08
##
## Kappa : 0.5332
##
## Mcnemar's Test P-Value : 0.003948
##
## Sensitivity : 0.9020
## Specificity : 0.6133
## Pos Pred Value : 0.7603
## Neg Pred Value : 0.8214
## Prevalence : 0.5763
## Detection Rate : 0.5198
## Detection Prevalence : 0.6836
## Balanced Accuracy : 0.7576
##
## 'Positive' Class : 0
##
The model has an accuracy of 0.779661, indicating that it correctly classified 77.9661017% of the test set. Other important metrics are as follows:
The confusion matrix of the Decision Tree model along with the calculated metrics are as follows:
dt_confusion_matrix <- confusionMatrix(dt_predictions, cv_test$Survived)
dt_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 97 25
## 1 5 50
##
## Accuracy : 0.8305
## 95% CI : (0.767, 0.8826)
## No Information Rate : 0.5763
## P-Value [Acc > NIR] : 4.425e-13
##
## Kappa : 0.6402
##
## Mcnemar's Test P-Value : 0.0005226
##
## Sensitivity : 0.9510
## Specificity : 0.6667
## Pos Pred Value : 0.7951
## Neg Pred Value : 0.9091
## Prevalence : 0.5763
## Detection Rate : 0.5480
## Detection Prevalence : 0.6893
## Balanced Accuracy : 0.8088
##
## 'Positive' Class : 0
##
The model has an accuracy of 0.8305085, indicating that it correctly classified 83.0508475% of the test set. Other important metrics are as follows:
The confusion matrix of the Random Forest model along with the calculated metrics are as follows:
rf_confusion_matrix <- confusionMatrix(rf_predictions, cv_test$Survived)
rf_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 91 21
## 1 11 54
##
## Accuracy : 0.8192
## 95% CI : (0.7545, 0.8729)
## No Information Rate : 0.5763
## P-Value [Acc > NIR] : 5.336e-12
##
## Kappa : 0.6232
##
## Mcnemar's Test P-Value : 0.1116
##
## Sensitivity : 0.8922
## Specificity : 0.7200
## Pos Pred Value : 0.8125
## Neg Pred Value : 0.8308
## Prevalence : 0.5763
## Detection Rate : 0.5141
## Detection Prevalence : 0.6328
## Balanced Accuracy : 0.8061
##
## 'Positive' Class : 0
##
The model has an accuracy of 0.819209, indicating that it correctly classified 81.920904% of the test set. Other important metrics are as follows:
From the analysis of the three models: Naive Bayes, Decision Tree, and Random Forest, we can make the following observations:
The Decision Tree model has the highest accuracy of 83.05%, slightly higher than the Random Forest model’s accuracy of 81.92% and much higher than the Naive Bayes model’s accuracy of 77.97%.
In terms of sensitivity, the Decision Tree model again tops the list with a score of 0.95, followed by the Random Forest model with a score of 0.89 and the Naive Bayes model with 0.90. Sensitivity is crucial because it measures the proportion of actual positives that are correctly identified.
However, when we consider the specificity, which measures the proportion of actual negatives that are correctly identified, the Random Forest model outperforms the others with a score of 0.72, followed by the Decision Tree model with a score of 0.67, and the Naive Bayes model with 0.61.
In terms of Precision, which is the proportion of positive identifications that were actually correct, the Decision Tree model performs best with a score of 0.79, followed closely by the Random Forest model with 0.81 and the Naive Bayes model with 0.76.
Finally, when considering Kappa, which is a measure of how much better the classifier is performing over the performance of a classifier that simply guesses at random according to the frequency of each class, the Decision Tree model has the highest score of 0.64, followed by the Random Forest model with a score of 0.62 and the Naive Bayes model with 0.53.
Therefore, based on these metrics, the Decision Tree model seems to give the best overall performance. However, the specific choice of model would depend on the specific needs of your prediction task. For instance, if specificity was of paramount importance for your task, the Random Forest model may be the preferred choice. Moreover, the performance of the models may also be improved with further parameter tuning and cross-validation.
As with any machine learning task, it’s important to remember that the choice of model can significantly affect the outcome, and that different models may be more suited to different tasks depending on the specifics of the data and the task at hand.