The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this project, we will try to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data from kaggle. The algorithm that we will use are Naive Bayes, Decision Tree and Random Forest.
library(dplyr)#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
library(e1071)
library(caret)#> Loading required package: ggplot2
#> Loading required package: lattice
library(ROCR)
library(partykit)#> Loading required package: grid
#> Loading required package: libcoin
#> Loading required package: mvtnorm
titanic <- read.csv('data_input/train.csv')
titanicglimpse(titanic)#> Rows: 891
#> Columns: 12
#> $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
#> $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1~
#> $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3~
#> $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl~
#> $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal~
#> $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ~
#> $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0~
#> $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0~
#> $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37~
#> $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,~
#> $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C~
#> $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"~
summary(titanic)#> PassengerId Survived Pclass Name
#> Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
#> 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
#> Median :446.0 Median :0.0000 Median :3.000 Mode :character
#> Mean :446.0 Mean :0.3838 Mean :2.309
#> 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
#> Max. :891.0 Max. :1.0000 Max. :3.000
#>
#> Sex Age SibSp Parch
#> Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
#> Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
#> Mode :character Median :28.00 Median :0.000 Median :0.0000
#> Mean :29.70 Mean :0.523 Mean :0.3816
#> 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
#> Max. :80.00 Max. :8.000 Max. :6.0000
#> NA's :177
#> Ticket Fare Cabin Embarked
#> Length:891 Min. : 0.00 Length:891 Length:891
#> Class :character 1st Qu.: 7.91 Class :character Class :character
#> Mode :character Median : 14.45 Mode :character Mode :character
#> Mean : 32.20
#> 3rd Qu.: 31.00
#> Max. :512.33
#>
Parch, min value to Q3 tends to 0, meaning that the data is not very informative (it may not be used later in modeling)
colSums(is.na(titanic))#> PassengerId Survived Pclass Name Sex Age
#> 0 0 0 0 0 177
#> SibSp Parch Ticket Fare Cabin Embarked
#> 0 0 0 0 0 0
The Age column has 177 missing values, we will fill these missing values with the median value of Age
Glossary data titanic:
PassengerId : Unique ID Number for each Titanic passenger (out of 831)Survived : Survival (0 = No, 1 = Yes)Pclass : Ticket Class (1= Upper, 2= Middle, 3= Low)Name : Unique Name of each Titanic passengerSex : Female or MaleAge : Age in year (out of 80)SibSp : Number of siblings / spouses aboard the Titanic (out of 8)Parch : Number of parents . children aboard the Titanic (out of 6)Ticket : Unique Ticket Number (out of 831)Fare : Passenger Fare (out of 512.33)Cabin : Cabin NumberEmbarked : Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)Data Resume: - The data has 891 rows and 12 columns. - We can ignore PassengerId, Ticket, Name because it cannot be classified as a category or number that can affect the model. - Our target variable is Survived. - Survived, Pclass, Sex data type can be changed into a factor.
titanic <- titanic %>%
mutate(Embarked_S = ifelse(Embarked == "S", 1, 0), #create column Embarked_S
Embarked_C = ifelse(Embarked == "C", 1, 0), #create column Embarked_C
Embarked_Q = ifelse(Embarked == "Q", 1, 0), #create column Embarked_Q
Sex = ifelse(Sex == "female", 0, 1),
Age = replace_na(Age, replace = round(median(Age, na.rm = T))), #fill missing value with mean
Age = as.integer(Age)) %>%
dplyr::select(-c(PassengerId, Ticket, Cabin, Name, Embarked)) %>%
mutate_at(c('Survived', 'Pclass', 'Sex', 'Embarked_S', 'Embarked_C', 'Embarked_Q'), as.factor)glimpse(titanic)#> Rows: 891
#> Columns: 10
#> $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1,~
#> $ Pclass <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3,~
#> $ Sex <fct> 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0,~
#> $ Age <int> 22, 38, 26, 35, 35, 28, 54, 2, 27, 14, 4, 58, 20, 39, 14, 5~
#> $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0,~
#> $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0,~
#> $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, ~
#> $ Embarked_S <fct> 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,~
#> $ Embarked_C <fct> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,~
#> $ Embarked_Q <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,~
anyNA(titanic)#> [1] FALSE
prop.table(table(titanic$Survived))#>
#> 0 1
#> 0.6161616 0.3838384
The target variable has a fairly balanced proportion, doesn’t need additional pre-processing.
Data Splitting.
RNGkind(sample.kind = "Rounding")
set.seed(303)
index <- sample(nrow(titanic), nrow(titanic)*0.7)
#Data Splitting
t_train <- titanic[index,]
t_test <- titanic[-index,]
label_ttrain <- t_train[, "Survived"]
label_ttest <- t_test[, "Survived"]Target Variable Data Train Proportion
prop.table(table(label_ttrain))#> label_ttrain
#> 0 1
#> 0.5971108 0.4028892
model_naive<- naiveBayes(Survived ~ ., data = t_train) preds_naive <- predict(model_naive, newdata = t_test)
head(data.frame(actual = label_ttest, prediction = preds_naive))eval_naive <- confusionMatrix(data = as.factor(preds_naive),
reference = as.factor(label_ttest),
positive = "1")
eval_naive#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 156 32
#> 1 21 59
#>
#> Accuracy : 0.8022
#> 95% CI : (0.7494, 0.8482)
#> No Information Rate : 0.6604
#> P-Value [Acc > NIR] : 0.00000022
#>
#> Kappa : 0.5457
#>
#> Mcnemar's Test P-Value : 0.1696
#>
#> Sensitivity : 0.6484
#> Specificity : 0.8814
#> Pos Pred Value : 0.7375
#> Neg Pred Value : 0.8298
#> Prevalence : 0.3396
#> Detection Rate : 0.2201
#> Detection Prevalence : 0.2985
#> Balanced Accuracy : 0.7649
#>
#> 'Positive' Class : 1
#>
# prediction with probability output
naive_pprob <- predict(model_naive, newdata = t_test, type = "raw")
# creating ROC data frame
data_roc <- data.frame(pred_proba = naive_pprob[, '1'],
actual = ifelse(t_test$Survived == "1", 1, 0))
head(data_roc)#create prediction object
naive_roc <- prediction(predictions = data_roc$pred_proba,
labels = data_roc$actual)
# ROC curve
plot(performance(naive_roc, "tpr", "fpr"),
main="ROC")
abline(a=0, b=1, col='red') #diagonal lineauc_naive <- performance(prediction.obj = naive_roc, measure = "auc")
auc_naive@y.values # AUC value from ROC plot#> [[1]]
#> [1] 0.8123176
model_dt <- ctree(Survived ~ ., data = t_train)
plot(model_dt, type="simple", gp = gpar(fontsize = 8)) From the model, we get the information that the classification is based on variables
Sex, Pclass, Embarked_S, Fare, and Age.
preds_dt <- predict(model_dt, newdata = t_test, type = "response")
head(preds_dt)#> 2 4 6 9 15 23
#> 1 1 0 1 1 1
#> Levels: 0 1
eval_dt <- confusionMatrix(data = preds_dt, reference = label_ttest, positive = "1")
eval_dt#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 155 24
#> 1 22 67
#>
#> Accuracy : 0.8284
#> 95% CI : (0.7778, 0.8715)
#> No Information Rate : 0.6604
#> P-Value [Acc > NIR] : 0.0000000006551
#>
#> Kappa : 0.6153
#>
#> Mcnemar's Test P-Value : 0.8828
#>
#> Sensitivity : 0.7363
#> Specificity : 0.8757
#> Pos Pred Value : 0.7528
#> Neg Pred Value : 0.8659
#> Prevalence : 0.3396
#> Detection Rate : 0.2500
#> Detection Prevalence : 0.3321
#> Balanced Accuracy : 0.8060
#>
#> 'Positive' Class : 1
#>
# prediction with probability output
dt_pprob <- predict(model_dt, newdata = t_test, type = "prob")
# creating ROC data frame
data2_roc <- data.frame(pred_proba = dt_pprob[, '1'],
actual = ifelse(t_test$Survived == "1", 1, 0))
#create prediction object
dt_roc <- prediction(predictions = data2_roc$pred_proba,
labels = data2_roc$actual)
# ROC curve
plot(performance(dt_roc, "tpr", "fpr"),
main="ROC")
abline(a=0, b=1, col='red') #diagonal lineauc_dt <- performance(prediction.obj = dt_roc, measure = "auc")
auc_dt@y.values # AUC value from ROC plot#> [[1]]
#> [1] 0.8603092
Check for column that have near-zero (we can delete columns that have near-zero or less informative).
nearZeroVar(t_train)#> integer(0)
We can see that we have no near-zero variance data.
RNGkind(sample.kind = "Rounding")
set.seed(212)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model_forest <- train(Survived ~ ., data = t_train, method = "rf", trControl = ctrl)
model_forest#> Random Forest
#>
#> 623 samples
#> 9 predictor
#> 2 classes: '0', '1'
#>
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 3 times)
#> Summary of sample sizes: 498, 499, 498, 499, 498, 498, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.8116946 0.5929486
#> 6 0.7966968 0.5746334
#> 10 0.7779699 0.5360043
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.
From the model summary, we know that the optimum number of variables considered for splitting at each tree node is 2. We can also inspect the importance of each variable that was used in our random forest using varImp().
varImp(model_forest)#> rf variable importance
#>
#> Overall
#> Sex1 100.000
#> Fare 53.841
#> Age 37.394
#> Pclass3 27.839
#> SibSp 12.102
#> Parch 9.964
#> Embarked_S1 4.911
#> Pclass2 4.315
#> Embarked_C1 3.409
#> Embarked_Q1 0.000
Even though our data has been split for train and test, actually we don’t need to split train and test data when using random forest. The result of bootstrap sampling has data that are not used in making random forests which called out-of-bag data. This out-of-bag data can be considered as data test by the model.
plot(model_forest$finalModel)
legend("topright", colnames(model_forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)model_forest$finalModel#>
#> Call:
#> randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)))
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 2
#>
#> OOB estimate of error rate: 19.58%
#> Confusion matrix:
#> 0 1 class.error
#> 0 342 30 0.08064516
#> 1 92 159 0.36653386
preds_forest <- predict(model_forest, newdata = t_test, type = "raw")
head(preds_forest)#> [1] 1 1 0 0 1 1
#> Levels: 0 1
eval_forest <- confusionMatrix(data = preds_forest, reference = label_ttest, positive = "1")
eval_forest#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 162 32
#> 1 15 59
#>
#> Accuracy : 0.8246
#> 95% CI : (0.7737, 0.8682)
#> No Information Rate : 0.6604
#> P-Value [Acc > NIR] : 0.000000001618
#>
#> Kappa : 0.5904
#>
#> Mcnemar's Test P-Value : 0.0196
#>
#> Sensitivity : 0.6484
#> Specificity : 0.9153
#> Pos Pred Value : 0.7973
#> Neg Pred Value : 0.8351
#> Prevalence : 0.3396
#> Detection Rate : 0.2201
#> Detection Prevalence : 0.2761
#> Balanced Accuracy : 0.7818
#>
#> 'Positive' Class : 1
#>
# prediction with probability output
forest_pprob <- predict(model_forest, newdata = t_test, type = "prob")
# creating ROC data frame
data3_roc <- data.frame(pred_proba = forest_pprob[, '1'],
actual = ifelse(t_test$Survived == "1", 1, 0))
#create prediction object
forest_roc <- prediction(predictions = data3_roc$pred_proba,
labels = data3_roc$actual)
# ROC curve
plot(performance(forest_roc, "tpr", "fpr"),
main="ROC")
abline(a=0, b=1, col='red') #diagonal lineauc_forest <- performance(prediction.obj = forest_roc, measure = "auc")
auc_forest@y.values # AUC value from ROC plot#> [[1]]
#> [1] 0.8908239
#get the accuracy, sensitivity, specificity, precision and AUC of model naive bayes
evals_naive <- data_frame(Model = "Naive Bayes",
Accuracy = eval_naive$overall[1],
Recall = eval_naive$byClass[1],
Specificity = eval_naive$byClass[2],
Precision = eval_naive$byClass[3])
#get the accuracy, sensitivity, specificity, precision and AUC of model decision tree
evals_dt <- data_frame(Model = "Decision Tree",
Accuracy = eval_dt$overall[1],
Recall = eval_dt$byClass[1],
Specificity = eval_dt$byClass[2],
Precision = eval_dt$byClass[3])
#get the accuracy, sensitivity, specificity, precision and AUC of model random forest
evals_forest <- data_frame(Model = "Random Forest",
Accuracy = eval_forest$overall[1],
Recall = eval_forest$byClass[1],
Specificity = eval_forest$byClass[2],
Precision = eval_forest$byClass[3])
rbind(evals_naive, evals_dt, evals_forest)As a rescue team, we need to know the data of passengers who should have survived the Titanic crash. Therefore, we will look at the Recall metrics, this is to ensure that no survivors are missed in the rescue process.
From the results above, Decision Tree has the best recall performance with a figure of 73.6%. Let’s try to tune the model to get better performance results.
In the train dataset, the proportion of our data was around 59,71% not survived and around 40,29% survived. Let’s do down-sampling to balancing the proportion of target variable.
RNGkind(sample.kind = "Rounding")
set.seed(212)
t_train2 <- downSample(x = t_train[, -1],
y = as.factor(t_train[, 1]),
yname = "Survived")
rmarkdown::paged_table(t_train2)check variable target proportion.
prop.table(table(t_train2$Survived))#>
#> 0 1
#> 0.5 0.5
model_dt2 <- ctree(Survived ~ ., data = t_train2)
plot(model_dt2, type="simple", gp = gpar(fontsize = 8))preds_dt2 <- predict(model_dt2, newdata = t_test, type = "response")
head(preds_dt2)#> 2 4 6 9 15 23
#> 1 1 0 1 1 1
#> Levels: 0 1
eval_dt2 <- confusionMatrix(data = preds_dt2, reference = label_ttest, positive = "1")
eval_dt2#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 130 18
#> 1 47 73
#>
#> Accuracy : 0.7575
#> 95% CI : (0.7016, 0.8075)
#> No Information Rate : 0.6604
#> P-Value [Acc > NIR] : 0.0003713
#>
#> Kappa : 0.4981
#>
#> Mcnemar's Test P-Value : 0.0005147
#>
#> Sensitivity : 0.8022
#> Specificity : 0.7345
#> Pos Pred Value : 0.6083
#> Neg Pred Value : 0.8784
#> Prevalence : 0.3396
#> Detection Rate : 0.2724
#> Detection Prevalence : 0.4478
#> Balanced Accuracy : 0.7683
#>
#> 'Positive' Class : 1
#>
# prediction with probability output
dt_pprob2 <- predict(model_dt2, newdata = t_test, type = "prob")
# creating ROC data frame
data4_roc <- data.frame(pred_proba = dt_pprob2[, '1'],
actual = ifelse(t_test$Survived == "1", 1, 0))
#create prediction object
dt_roc2 <- prediction(predictions = data4_roc$pred_proba,
labels = data4_roc$actual)
# ROC curve
plot(performance(dt_roc2, "tpr", "fpr"),
main="ROC")
abline(a=0, b=1, col='red') #diagonal lineauc_dt2 <- performance(prediction.obj = dt_roc2, measure = "auc")
auc_dt2@y.values # AUC value from ROC plot#> [[1]]
#> [1] 0.8335196
evals_dt2 <- data_frame(Model = "DT Tune Up",
Accuracy = eval_dt2$overall[1],
Recall = eval_dt2$byClass[1],
Specificity = eval_dt2$byClass[2],
Precision = eval_dt2$byClass[3])
rbind(evals_dt, evals_dt2)As a rescue team, we need to know the data of passengers who should have survived the Titanic crash. Therefore, we will look at the Recall metrics, this is to ensure that no survivors are missed in the rescue process.
Based on the performance results, Decision Tree with the down sample method has the highest recall performance, which is around 80.22%. In my opinion, this number is enough to predict the victims of a shipwreck as big as the case of the Titanic. I’ve also done modeling using Logistic Regression and K-Nearest Neighbor, but the model only produces the highest Recall of 73.6% (you can check my RPubs here). However, If you expect better results, you can do other model tuning like better data pre-processing or something like that. Thank you.