Titanic
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on-board, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.💡 In this challenge, we have been asked to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie = age, gender, socio-economic class, etc). The Answer will be shown in Conclusion focus only on gender and age
source , kaggle’s challange : https://www.kaggle.com/c/titaniclibrary(dplyr) # function of data wrangling
library(tidyr) # function of data wrangling 2
library(e1071) #function `naiveBayes()`
library(caret) #confusion matrix
library(partykit) #function decision tree
library(randomForest) #function random forest
library(ROCR) #function ROC AUC
Read dataset = train.csv
#train.csv
<- read.csv("titanic/train.csv", stringsAsFactors = T)
titanic glimpse(titanic)
#> Rows: 891
#> Columns: 12
#> $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
#> $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
#> $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
#> $ Name <fct> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
#> $ Sex <fct> male, female, female, female, male, male, male, male, fema…
#> $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
#> $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
#> $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
#> $ Ticket <fct> A/5 21171, PC 17599, STON/O2. 3101282, 113803, 373450, 330…
#> $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
#> $ Cabin <fct> , C85, , C123, , , E46, , , , G6, C103, , , , , , , , , , …
#> $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C…
Description of variabels/columns :
Survival
0 = No, 1 = Yespclass
(Ticket class) 1 = 1st, 2 = 2nd, 3 = 3rdsex
Age
= in yearssibsp
= Number of siblings / spouses aboard the
Titanicparch
= Number of parents / children aboard the
Titanicticket
= Ticket numberfare
= Passenger farecabin
= Cabin numberembarked
= Port of Embarkation (C = Cherbourg, Q =
Queenstown, S = Southampton)Insight :
Survived
,Pclass
,Sex
,Embarked
should be change to be factor<- titanic %>%
titanic mutate_at(vars(Survived,Pclass,Sex,Embarked), as.factor) %>%
select(-c(Name,Ticket,Cabin))
glimpse(titanic)
#> Rows: 891
#> Columns: 9
#> $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
#> $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
#> $ Pclass <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
#> $ Sex <fct> male, female, female, female, male, male, male, male, fema…
#> $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
#> $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
#> $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
#> $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
#> $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C…
Check Missing Values
colSums(is.na(titanic))
#> PassengerId Survived Pclass Sex Age SibSp
#> 0 0 0 0 177 0
#> Parch Fare Embarked
#> 0 0 0
insight :
Age
have missing values 177 rows in data set. We are
going to do a imputation missing valuesImputation missing values
hist(titanic$Age)
Insight :
age
between 20-30. We need to do a mode to fill in NA with
mode as a simply statistic’s method.sex
,age
, etc# Make a function about Mode to avoid calculate NA in data-set
<- function(x) {
Mode <- na.omit(unique(x) )
ux <- tabulate(match(x, ux)); ux[tab == max(tab) ]
tab
}
# Replace NA with Mode Values in columns Age
<- titanic %>%
titanic mutate(Age = replace_na(titanic$Age, Mode(titanic$Age)))
colSums(is.na(titanic))
#> PassengerId Survived Pclass Sex Age SibSp
#> 0 0 0 0 0 0
#> Parch Fare Embarked
#> 0 0 0
Insight :
Do EDA to know well distribution and characteristics of the data-set and make sure target of the data is balance to make machine learning models work well
summary(titanic)
#> PassengerId Survived Pclass Sex Age SibSp
#> Min. : 1.0 0:549 1:216 female:314 Min. : 0.42 Min. :0.000
#> 1st Qu.:223.5 1:342 2:184 male :577 1st Qu.:22.00 1st Qu.:0.000
#> Median :446.0 3:491 Median :24.00 Median :0.000
#> Mean :446.0 Mean :28.57 Mean :0.523
#> 3rd Qu.:668.5 3rd Qu.:35.00 3rd Qu.:1.000
#> Max. :891.0 Max. :80.00 Max. :8.000
#> Parch Fare Embarked
#> Min. :0.0000 Min. : 0.00 : 2
#> 1st Qu.:0.0000 1st Qu.: 7.91 C:168
#> Median :0.0000 Median : 14.45 Q: 77
#> Mean :0.3816 Mean : 32.20 S:644
#> 3rd Qu.:0.0000 3rd Qu.: 31.00
#> Max. :6.0000 Max. :512.33
Insight :
We need to know about target variable is already balanced or not.
prop.table(table(titanic$Survived))
#>
#> 0 1
#> 0.6161616 0.3838384
Insight :
RNGkind(sample.kind = "Rounding")
set.seed(417)
# index sampling
<- sample(x = nrow(titanic), size = nrow(titanic)*0.8)
index # sample(x = nrow(df), size = nrow(df)*0.2)
# splitting
<- titanic[index,]
titanic_train <- titanic[-index,] titanic_test
Due to our data set almost 1000 observations , we need to do up-sample to make balance of target variable
# upsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)
# menggunakan dplyr
<- upSample(x = titanic_train %>% select(-c(Survived)), # prediktor
titanic_train_up y = titanic_train$Survived, # targer -> kolom diabetes
yname = "Survived") # nama kolom target
head(titanic_train_up)
prop.table(table(titanic_train_up$Survived))
#>
#> 0 1
#> 0.5 0.5
insight :
Naive Bayes is a classification method that uses the Bayes theorem which discusses the probability of dependent events. Humans, basically using a bayesian mindset, are always changing our beliefs based on new information received.
#using laplace = 1 due to there is zero value between embarked because of there is 2 records not identified
<- naiveBayes(Survived ~ ., data = titanic_train_up, laplace = 1) model_naive
#Use type = "class" due to returns its class label (default threshold 0.5)
<- predict(model_naive, newdata = titanic_test,type = "class")
preds_naive # for the probability
<- predict(model_naive, titanic_test, type = "raw") preds_naive1
confusionMatrix(data = preds_naive , #predict
reference = titanic_test$Survived , # data actual
positive = "1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 93 20
#> 1 16 50
#>
#> Accuracy : 0.7989
#> 95% CI : (0.7326, 0.855)
#> No Information Rate : 0.6089
#> P-Value [Acc > NIR] : 0.00000004117
#>
#> Kappa : 0.5734
#>
#> Mcnemar's Test P-Value : 0.6171
#>
#> Sensitivity : 0.7143
#> Specificity : 0.8532
#> Pos Pred Value : 0.7576
#> Neg Pred Value : 0.8230
#> Prevalence : 0.3911
#> Detection Rate : 0.2793
#> Detection Prevalence : 0.3687
#> Balanced Accuracy : 0.7837
#>
#> 'Positive' Class : 1
#>
<- predict(model_naive, newdata = titanic_train_up,type = "class")
preds_naive_train confusionMatrix(data = preds_naive_train , #predict
reference = titanic_train_up$Survived , # data actual
positive = "1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 351 131
#> 1 89 309
#>
#> Accuracy : 0.75
#> 95% CI : (0.72, 0.7783)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.5
#>
#> Mcnemar's Test P-Value : 0.005706
#>
#> Sensitivity : 0.7023
#> Specificity : 0.7977
#> Pos Pred Value : 0.7764
#> Neg Pred Value : 0.7282
#> Prevalence : 0.5000
#> Detection Rate : 0.3511
#> Detection Prevalence : 0.4523
#> Balanced Accuracy : 0.7500
#>
#> 'Positive' Class : 1
#>
For your information :
Insight :
ROC is a curve that describes the relationship between the True Positive Rate (Sensitivity or Recall) and the False Positive Rate (1-Specificity) at each threshold. A good model should ideally have a high True Positive Rate and a low False Positive Rate
# ROC
<- data.frame(prediction=round(preds_naive1[,2],4),
naive_roc trueclass= as.numeric(titanic_test$Survived == 1))
<- prediction(naive_roc$prediction, naive_roc$trueclass)
naive_roc
# ROC curve
plot(performance(naive_roc, "tpr", "fpr"),
main = "ROC")
abline(a = 0, b = 1)
AUC shows the area under the ROC curve. The closer to 1, the better
the model performance in separating positive and negative classes. To
get the AUC value, use the parameter measure = "auc"
in the
performance()
function and then take the value
y.values
.
#Check AUC
<- performance(prediction.obj = naive_roc,
naive_value measure = "auc")
@y.values naive_value
#> [[1]]
#> [1] 0.8210354
Insight :
ROC Curve
Decision Tree is a fairly simple tree-based model with robust/powerful* performance for prediction. The Decision Tree produces a visualization in the form of a decision tree which can be interpreted easily.
Decision Tree additional characters:
Note: Decision Tree is not only limited to Classification cases, but can be used in Regression cases.
<- ctree(formula = Survived ~.,
dtree_model data = titanic_train_up,
control = ctree_control(mincriterion=0.95,
minsplit=30,
minbucket=10))
plot(dtree_model, type = "s")
Insight :
sex
due produce the greatest information gain.Pclass
between 1 - 2 around 190
survived, and error 4.7% it means around 4.7% of 190, decision tree has
wrong predicted.Pclass
in 1 with
passenger’s ID > 352, is survived from titanic’s accident.Pclass
in 2,3 with age more
than 12 definitely not survived 376 passengers.# class predict in data test
<- predict(object = dtree_model,
pred_titanic_test_tuned newdata = titanic_test,
type = "response")
# for the probability
<- predict(dtree_model, titanic_test, type = "prob") pred_titanic_test_tuned_prob
# confusion matrix data test
confusionMatrix(data = pred_titanic_test_tuned, # prediction
reference = titanic_test$Survived,
positive = "1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 85 12
#> 1 24 58
#>
#> Accuracy : 0.7989
#> 95% CI : (0.7326, 0.855)
#> No Information Rate : 0.6089
#> P-Value [Acc > NIR] : 0.00000004117
#>
#> Kappa : 0.5903
#>
#> Mcnemar's Test P-Value : 0.06675
#>
#> Sensitivity : 0.8286
#> Specificity : 0.7798
#> Pos Pred Value : 0.7073
#> Neg Pred Value : 0.8763
#> Prevalence : 0.3911
#> Detection Rate : 0.3240
#> Detection Prevalence : 0.4581
#> Balanced Accuracy : 0.8042
#>
#> 'Positive' Class : 1
#>
# class predict in data train
<- predict(object = dtree_model,
pred_titanic_train_tuned newdata = titanic_train_up,
type = "response")
# confusion matrix data train
confusionMatrix(data = pred_titanic_train_tuned, # prediksi
reference = titanic_train_up$Survived,
positive = "1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 364 87
#> 1 76 353
#>
#> Accuracy : 0.8148
#> 95% CI : (0.7875, 0.8399)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : <0.0000000000000002
#>
#> Kappa : 0.6295
#>
#> Mcnemar's Test P-Value : 0.4335
#>
#> Sensitivity : 0.8023
#> Specificity : 0.8273
#> Pos Pred Value : 0.8228
#> Neg Pred Value : 0.8071
#> Prevalence : 0.5000
#> Detection Rate : 0.4011
#> Detection Prevalence : 0.4875
#> Balanced Accuracy : 0.8148
#>
#> 'Positive' Class : 1
#>
Insight :
# ROC
<- data.frame(prediction=round(pred_titanic_test_tuned_prob[,2],4),
dtree_roc trueclass= as.numeric(titanic_test$Survived == 1))
<- prediction(dtree_roc$prediction, dtree_roc$trueclass)
dtree_roc
# ROC curve
plot(performance(dtree_roc, "tpr", "fpr"),
main = "ROC")
abline(a = 0, b = 1)
#Check AUC
<- performance(prediction.obj = dtree_roc,
dtree_value measure = "auc")
@y.values dtree_value
#> [[1]]
#> [1] 0.8645478
Insight :
Random Forest is a type of Ensemble Method which consists of many Decision Trees. Each Decision Tree has its own characteristics and is not related to each other. Random Forest makes use of the Bagging (Bootstrap and Aggregation) concept in its creation. Here is the process:
mtry
parameter is used to randomly select the number of
predictor candidates (Automatic Feature Selection)# Check dimension of titanic
dim(titanic_train_up)
#> [1] 880 9
# feature selection using nearzerovar
<- nearZeroVar(titanic_train_up)
zero_var <- titanic_train_up %>%
titanic_rf select(-c(zero_var))
dim(titanic_rf)
#> [1] 880 9
insight :
#set.seed(417)
#ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 20)
# make a model of random forest
#titanic_rf_model <- train(Survived~., titanic_train_up, method = "rf", trControl = ctrl)
#titanic_rf_model
# save the model
#saveRDS(titanic_rf_model, "titanic_forest.RDS")
# read model
readRDS("model/titanic_forest.RDS")
#> Random Forest
#>
#> 880 samples
#> 8 predictor
#> 2 classes: '0', '1'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 20 times)
#> Summary of sample sizes: 792, 792, 792, 792, 792, 792, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.8272727 0.6545455
#> 6 0.8933523 0.7867045
#> 11 0.8845455 0.7690909
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 6.
insight :
<- readRDS("model/titanic_forest.RDS")
titanic_rf_model varImp(titanic_rf_model)
#> rf variable importance
#>
#> Overall
#> Sexmale 100.000
#> PassengerId 89.180
#> Fare 85.397
#> Age 72.721
#> Pclass3 16.644
#> SibSp 15.807
#> Parch 8.393
#> Pclass2 4.324
#> EmbarkedC 4.230
#> EmbarkedS 3.851
#> EmbarkedQ 0.000
insight :
Sexmale
#RAW
<- predict(object = titanic_rf_model,
pred_titanic_rf newdata = titanic_test,
type ="raw")
#Prob
<- predict(object = titanic_rf_model,
pred_titanic_rf_prob newdata = titanic_test,
type ="prob")
confusionMatrix(data = pred_titanic_rf,
reference = titanic_test$Survived,
positive = "1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 92 14
#> 1 17 56
#>
#> Accuracy : 0.8268
#> 95% CI : (0.7633, 0.8792)
#> No Information Rate : 0.6089
#> P-Value [Acc > NIR] : 0.0000000002333
#>
#> Kappa : 0.6391
#>
#> Mcnemar's Test P-Value : 0.7194
#>
#> Sensitivity : 0.8000
#> Specificity : 0.8440
#> Pos Pred Value : 0.7671
#> Neg Pred Value : 0.8679
#> Prevalence : 0.3911
#> Detection Rate : 0.3128
#> Detection Prevalence : 0.4078
#> Balanced Accuracy : 0.8220
#>
#> 'Positive' Class : 1
#>
<- predict(object = titanic_rf_model,
pred_titanic_rf_train newdata = titanic_train_up,
type ="raw")
confusionMatrix(data = pred_titanic_rf_train,
reference = titanic_train_up$Survived,
positive = "1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 440 0
#> 1 0 440
#>
#> Accuracy : 1
#> 95% CI : (0.9958, 1)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 1
#>
#> Mcnemar's Test P-Value : NA
#>
#> Sensitivity : 1.0
#> Specificity : 1.0
#> Pos Pred Value : 1.0
#> Neg Pred Value : 1.0
#> Prevalence : 0.5
#> Detection Rate : 0.5
#> Detection Prevalence : 0.5
#> Balanced Accuracy : 1.0
#>
#> 'Positive' Class : 1
#>
Insight :
$finalModel titanic_rf_model
#>
#> Call:
#> randomForest(x = x, y = y, mtry = param$mtry)
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 6
#>
#> OOB estimate of error rate: 9.89%
#> Confusion matrix:
#> 0 1 class.error
#> 0 377 63 0.14318182
#> 1 24 416 0.05454545
Insight :
Pros of Random Forests:
# ROC
<- data.frame(prediction=round(pred_titanic_rf_prob[,2],4),
rf_roc trueclass= as.numeric(titanic_test$Survived == 1))
<- prediction(rf_roc$prediction, rf_roc$trueclass)
rf_roc
# ROC curve
plot(performance(rf_roc, "tpr", "fpr"),
main = "ROC")
abline(a = 0, b = 1)
#Check AUC
<- performance(prediction.obj = rf_roc,
rf_value measure = "auc")
@y.values rf_value
#> [[1]]
#> [1] 0.8775885
Insight :
Due to our the best model is Random Forest, we will use it to know the answer of the question.
# Using data-frame predicted by the best model
<- titanic_test %>%
titanic_answer mutate(predict = pred_titanic_rf) %>%
relocate(Survived,predict)
# what sorts of people were more likely to survive?
# Using data test with column "Survived"
<- titanic_answer %>%
titanic_answer1_fm group_by(Sex,Age,Survived) %>%
count() %>%
rename(total_survive = n) %>%
filter(Survived %in% 1) %>%
filter(Sex == 'female') %>%
arrange(-total_survive)
<- titanic_answer %>%
titanic_answer1_m group_by(Sex,Age,Survived) %>%
count() %>%
rename(total_survive = n) %>%
filter(Survived %in% 1) %>%
filter(Sex == 'male') %>%
arrange(-total_survive)
titanic_answer1_fm
titanic_answer1_m
Insight :
The best model have been through between Naive Bayes, Decision Tree, and Random forest is Random Forest with accuracy 82.68% and performance model 90.11% means that the model performance is good in classifying both positive and negative class.
For further research, even tough Random Forest is the best model so far, model Random Forest should be train in big size of data train, more learn the way more better.
Our goals above have already achieved, with get the best model
machine learning that is Random Forest and we could know the result of
predictive survived in our data test
(titanic_test
)