Titanic

1. Introduction

Background

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on-board, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

Business Case

💡 In this challenge, we have been asked to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie = age, gender, socio-economic class, etc). The Answer will be shown in Conclusion focus only on gender and age

source , kaggle’s challange : https://www.kaggle.com/c/titanic

Goals

  • As a data scientist could make a great model machine learning with great accuracy to know well survived or not survived all passengers on Titanic with limited predictors
  • We can know how many people were more likely to survive between gender,age, etc from our the best model machine learning
📝 In this project we will trial Naive Bayes,Decision Tree and Random Forest as our Machine Learning

2. Preparation

Library

library(dplyr) # function of data wrangling
library(tidyr) # function of data wrangling 2
library(e1071) #function `naiveBayes()`
library(caret) #confusion matrix
library(partykit) #function decision tree
library(randomForest) #function random forest
library(ROCR) #function ROC AUC

Data Preparation

Read dataset = train.csv

#train.csv
titanic <- read.csv("titanic/train.csv", stringsAsFactors = T)
glimpse(titanic)
#> Rows: 891
#> Columns: 12
#> $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
#> $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
#> $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
#> $ Name        <fct> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
#> $ Sex         <fct> male, female, female, female, male, male, male, male, fema…
#> $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
#> $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
#> $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
#> $ Ticket      <fct> A/5 21171, PC 17599, STON/O2. 3101282, 113803, 373450, 330…
#> $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
#> $ Cabin       <fct> , C85, , C123, , , E46, , , , G6, C103, , , , , , , , , , …
#> $ Embarked    <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C…

Description of variabels/columns :

  • Survival 0 = No, 1 = Yes
  • pclass(Ticket class) 1 = 1st, 2 = 2nd, 3 = 3rd
  • sex
  • Age = in years
  • sibsp = Number of siblings / spouses aboard the Titanic
  • parch = Number of parents / children aboard the Titanic
  • ticket = Ticket number
  • fare = Passenger fare
  • cabin = Cabin number
  • embarked = Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Insight :

  • titanic have 12 columns and 891 Passengers in that time. There is several columns need to change as a factor. in this case, column of Survived,Pclass,Sex,Embarked should be change to be factor
titanic <- titanic %>% 
  mutate_at(vars(Survived,Pclass,Sex,Embarked), as.factor) %>% 
  select(-c(Name,Ticket,Cabin))
glimpse(titanic)
#> Rows: 891
#> Columns: 9
#> $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
#> $ Survived    <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
#> $ Pclass      <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
#> $ Sex         <fct> male, female, female, female, male, male, male, male, fema…
#> $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
#> $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
#> $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
#> $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
#> $ Embarked    <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C…

Check Missing Values

colSums(is.na(titanic))
#> PassengerId    Survived      Pclass         Sex         Age       SibSp 
#>           0           0           0           0         177           0 
#>       Parch        Fare    Embarked 
#>           0           0           0

insight :

  • Age have missing values 177 rows in data set. We are going to do a imputation missing values

Data Wrangling

Imputation missing values

hist(titanic$Age)

Insight :

  • From Histogram Frequency, we can see the highest frequency number of age between 20-30. We need to do a mode to fill in NA with mode as a simply statistic’s method.
  • From Histogram Frequency, We are using mode (high frequency of value) due to mode method is the best choice to fill in missing values in nominal or categorical columns, because in this case we want to do research about survival disaster between predictors sex,age, etc
# Make a function about Mode to avoid calculate NA in data-set
Mode <- function(x) {
  ux <- na.omit(unique(x) )
 tab <- tabulate(match(x, ux)); ux[tab == max(tab) ]
}

# Replace NA with Mode Values in columns Age
titanic <- titanic %>% 
  mutate(Age = replace_na(titanic$Age, Mode(titanic$Age)))


colSums(is.na(titanic))
#> PassengerId    Survived      Pclass         Sex         Age       SibSp 
#>           0           0           0           0           0           0 
#>       Parch        Fare    Embarked 
#>           0           0           0

Insight :

  • There’s no any missing values in data-set titanic.

EDA (Exploratory Data Analysis)

Do EDA to know well distribution and characteristics of the data-set and make sure target of the data is balance to make machine learning models work well

summary(titanic)
#>   PassengerId    Survived Pclass      Sex           Age            SibSp      
#>  Min.   :  1.0   0:549    1:216   female:314   Min.   : 0.42   Min.   :0.000  
#>  1st Qu.:223.5   1:342    2:184   male  :577   1st Qu.:22.00   1st Qu.:0.000  
#>  Median :446.0            3:491                Median :24.00   Median :0.000  
#>  Mean   :446.0                                 Mean   :28.57   Mean   :0.523  
#>  3rd Qu.:668.5                                 3rd Qu.:35.00   3rd Qu.:1.000  
#>  Max.   :891.0                                 Max.   :80.00   Max.   :8.000  
#>      Parch             Fare        Embarked
#>  Min.   :0.0000   Min.   :  0.00    :  2   
#>  1st Qu.:0.0000   1st Qu.:  7.91   C:168   
#>  Median :0.0000   Median : 14.45   Q: 77   
#>  Mean   :0.3816   Mean   : 32.20   S:644   
#>  3rd Qu.:0.0000   3rd Qu.: 31.00           
#>  Max.   :6.0000   Max.   :512.33

Insight :

  • In this titanic’s accident, male with the most victim died
  • All passengers passed away with average age 24 years old,
  • Majority passengers from Southampton, there is 2 passengers not identified.

We need to know about target variable is already balanced or not.

prop.table(table(titanic$Survived))
#> 
#>         0         1 
#> 0.6161616 0.3838384

Insight :

  • Target class seems not balance with majority not survived in 61.6% and survived 38.3%, we should make it balance to make sure our machine learning works well.

3. Cross Validation

RNGkind(sample.kind = "Rounding") 
set.seed(417)

# index sampling
index <- sample(x = nrow(titanic), size = nrow(titanic)*0.8)
# sample(x = nrow(df), size = nrow(df)*0.2)

# splitting
titanic_train <- titanic[index,]
titanic_test <- titanic[-index,]

Up-Sample

Due to our data set almost 1000 observations , we need to do up-sample to make balance of target variable

# upsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)

# menggunakan dplyr
titanic_train_up <- upSample(x = titanic_train %>% select(-c(Survived)), # prediktor
                       y = titanic_train$Survived, # targer -> kolom diabetes
                       yname = "Survived") # nama kolom target

head(titanic_train_up)
prop.table(table(titanic_train_up$Survived))
#> 
#>   0   1 
#> 0.5 0.5

insight :

  • Target variable already balance with proportion 50:50

4. Model Fitting

Naive Bayes

Naive Bayes is a classification method that uses the Bayes theorem which discusses the probability of dependent events. Humans, basically using a bayesian mindset, are always changing our beliefs based on new information received.

  • Naive Bayes is a machine learning model that utilizes Bayes’ Theorem in classifying.
  • The relationship between the predictor and the target variable is considered mutually dependent.
  • It is said “Naive” because each predictor is assumed to be independent (not related to each other) and has the same weight (has the same importance or influence) in making predictions. This is to facilitate calculations (the formula becomes simpler) and reduce the computational burden.
#using laplace = 1 due to there is zero value between embarked because of there is 2 records not identified
  model_naive<- naiveBayes(Survived ~ ., data = titanic_train_up, laplace = 1)

Predict of Naive Bayes

#Use type = "class" due to returns its class label (default threshold 0.5)
preds_naive <- predict(model_naive, newdata = titanic_test,type = "class")
# for the probability
preds_naive1 <- predict(model_naive, titanic_test, type = "raw")

Evaluation Naive Bayes

Confusion Matrix Data Test

confusionMatrix(data = preds_naive , #predict
                reference = titanic_test$Survived , # data actual
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 93 20
#>          1 16 50
#>                                          
#>                Accuracy : 0.7989         
#>                  95% CI : (0.7326, 0.855)
#>     No Information Rate : 0.6089         
#>     P-Value [Acc > NIR] : 0.00000004117  
#>                                          
#>                   Kappa : 0.5734         
#>                                          
#>  Mcnemar's Test P-Value : 0.6171         
#>                                          
#>             Sensitivity : 0.7143         
#>             Specificity : 0.8532         
#>          Pos Pred Value : 0.7576         
#>          Neg Pred Value : 0.8230         
#>              Prevalence : 0.3911         
#>          Detection Rate : 0.2793         
#>    Detection Prevalence : 0.3687         
#>       Balanced Accuracy : 0.7837         
#>                                          
#>        'Positive' Class : 1              
#> 

Confusion Matrix Data Train

preds_naive_train <- predict(model_naive, newdata = titanic_train_up,type = "class")
confusionMatrix(data = preds_naive_train , #predict
                reference = titanic_train_up$Survived , # data actual
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 351 131
#>          1  89 309
#>                                                
#>                Accuracy : 0.75                 
#>                  95% CI : (0.72, 0.7783)       
#>     No Information Rate : 0.5                  
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.5                  
#>                                                
#>  Mcnemar's Test P-Value : 0.005706             
#>                                                
#>             Sensitivity : 0.7023               
#>             Specificity : 0.7977               
#>          Pos Pred Value : 0.7764               
#>          Neg Pred Value : 0.7282               
#>              Prevalence : 0.5000               
#>          Detection Rate : 0.3511               
#>    Detection Prevalence : 0.4523               
#>       Balanced Accuracy : 0.7500               
#>                                                
#>        'Positive' Class : 1                    
#> 

For your information :

  • Accuracy: the ability to correctly predict both classes from the total observation.
  • Precision: the ability to correctly predict the positive class from the total predicted-positive class (false positive is low).
  • Recall: the ability to correctly predict the positive class from the total actual-positive class (false negative is low).
  • Specificity: the ability to correctly predict the negative class from the total actual-negative class.

Insight :

  • Accuracy of Naive Bayes is 79,89% with focus on positive class (1 = survived)
  • Accuracy of Naive Bayes in data train 75%, it means model have gained accuracy between 75% to 79,89% from seen data to unseen data (data test)
  • P-Value < 0,05 it means there there is a significant difference between model accuracy and NIR values. That is, the model performs much better than simply guessing the majority of the class. Therefore, it can be said that the tested model gives very good results in predicting class targets.
  • In best practice naive bayes, all data set should be in factor / categorical columns.But, If the predictor used is of a numeric type, Naive Bayes will calculate the mean (mean) and standard deviation (sd) for each target level. Probability is obtained by assuming that the numerical predictors have a normal distribution. This type of Naive Bayes is referred to as Gaussian Naive Bayes.
  • Mcnemar’s Test P-Value is 0.06675, in general between 0.6 and 0.8 already said good

ROC

ROC is a curve that describes the relationship between the True Positive Rate (Sensitivity or Recall) and the False Positive Rate (1-Specificity) at each threshold. A good model should ideally have a high True Positive Rate and a low False Positive Rate

# ROC
naive_roc <- data.frame(prediction=round(preds_naive1[,2],4),
                      trueclass= as.numeric(titanic_test$Survived == 1))

naive_roc <- prediction(naive_roc$prediction, naive_roc$trueclass) 

# ROC curve
plot(performance(naive_roc, "tpr", "fpr"),
     main = "ROC")
abline(a = 0, b = 1)

AUC

AUC shows the area under the ROC curve. The closer to 1, the better the model performance in separating positive and negative classes. To get the AUC value, use the parameter measure = "auc" in the performance() function and then take the value y.values.

#Check AUC
naive_value <- performance(prediction.obj = naive_roc, 
                         measure = "auc")
naive_value@y.values
#> [[1]]
#> [1] 0.8210354

Insight :

  • 82.10% it means, models that the model performance is good in classifying both positive and negative class.
  • If we don’t satisfied with the model, we can change the threshold prediction (not really recommended, because it can be forced):
    • Shift closer to 0: increases recall
    • Shift closer to 1: increases precision
  • due to for academic affairs, we don’t change the threshold , using default that is 0.5
To evaluate our model, we can also use the Receiver Operating Characteristics (ROC) curve and the Area Under Curve (AUC). The proportion of true positive rate (TPR or Sensitivity) to the proportion of false negative rate (ROC) is plotted (FNR or 1-Specificity). AUC represents the degree or measure of separability, while ROC is a probability curve. It indicates how well a model can distinguish between classes. The closer the curve gets to the upper-left corner of the plot (true positive is high, false negative is low), the better our model. The better our model separates our target classes, the higher the AUC score. Please see the illustration below to help you understand.

ROC Curve

Decision Tree

Decision Tree is a fairly simple tree-based model with robust/powerful* performance for prediction. The Decision Tree produces a visualization in the form of a decision tree which can be interpreted easily.

Decision Tree additional characters:

  • Variable predictors are assumed to be mutually dependent, so that multicollinearity can be overcome.
  • Can overcome numerical predictor values in the form of outliers.

Note: Decision Tree is not only limited to Classification cases, but can be used in Regression cases.

dtree_model <- ctree(formula = Survived ~.,
                     data = titanic_train_up,
                     control = ctree_control(mincriterion=0.95,
                                             minsplit=30,
                                             minbucket=10))
plot(dtree_model, type = "s")

Insight :

  • In this titanic’s data set , the predictor as a root node is sex due produce the greatest information gain.
  • The Female who stand in Pclass between 1 - 2 around 190 survived, and error 4.7% it means around 4.7% of 190, decision tree has wrong predicted.
  • About The Male who stand in Pclass in 1 with passenger’s ID > 352, is survived from titanic’s accident.
  • About The Male who stand in Pclass in 2,3 with age more than 12 definitely not survived 376 passengers.

Predict of Decision Tree

# class predict in data test
pred_titanic_test_tuned <- predict(object = dtree_model,
                                newdata = titanic_test,
                                type = "response")
# for the probability
pred_titanic_test_tuned_prob <- predict(dtree_model, titanic_test, type = "prob")

Evaluation Decision Tree

Confusion Matrix Data Test

# confusion matrix data test
confusionMatrix(data = pred_titanic_test_tuned, # prediction
                reference = titanic_test$Survived, 
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 85 12
#>          1 24 58
#>                                          
#>                Accuracy : 0.7989         
#>                  95% CI : (0.7326, 0.855)
#>     No Information Rate : 0.6089         
#>     P-Value [Acc > NIR] : 0.00000004117  
#>                                          
#>                   Kappa : 0.5903         
#>                                          
#>  Mcnemar's Test P-Value : 0.06675        
#>                                          
#>             Sensitivity : 0.8286         
#>             Specificity : 0.7798         
#>          Pos Pred Value : 0.7073         
#>          Neg Pred Value : 0.8763         
#>              Prevalence : 0.3911         
#>          Detection Rate : 0.3240         
#>    Detection Prevalence : 0.4581         
#>       Balanced Accuracy : 0.8042         
#>                                          
#>        'Positive' Class : 1              
#> 

Confusion Matrix Data Train

# class predict in data train
pred_titanic_train_tuned <- predict(object = dtree_model,
                                newdata = titanic_train_up,
                                type = "response")
# confusion matrix data train
confusionMatrix(data = pred_titanic_train_tuned, # prediksi
                reference = titanic_train_up$Survived, 
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 364  87
#>          1  76 353
#>                                              
#>                Accuracy : 0.8148             
#>                  95% CI : (0.7875, 0.8399)   
#>     No Information Rate : 0.5                
#>     P-Value [Acc > NIR] : <0.0000000000000002
#>                                              
#>                   Kappa : 0.6295             
#>                                              
#>  Mcnemar's Test P-Value : 0.4335             
#>                                              
#>             Sensitivity : 0.8023             
#>             Specificity : 0.8273             
#>          Pos Pred Value : 0.8228             
#>          Neg Pred Value : 0.8071             
#>              Prevalence : 0.5000             
#>          Detection Rate : 0.4011             
#>    Detection Prevalence : 0.4875             
#>       Balanced Accuracy : 0.8148             
#>                                              
#>        'Positive' Class : 1                  
#> 

Insight :

  • Accuracy of Decision Tree is 79,89% with focus on positive class (1 = survived), still same with Naive Bayes
  • Before we sure our model really works , we need to compare with train model, the result is have similarity but get decrease of accuracy from 81.48% down to 79.89% and sensitivity about false negative down from 82.86% to 81.48%, The majority due to decision train weaknesses, sensitive to new data and sensitive to out-lier data
  • P-Value < 0,05 it means there there is a significant difference between model accuracy and NIR values. That is, the model performs much better than simply guessing the majority of the class. Therefore, it can be said that the tested model gives very good results in predicting class targets.
  • Mcnemar’s Test P-Value is 0.06675, in general between 0.6 and 0.8 already said good
  • Kappa is 0.5903 almost 0.6, in general already said good between 0.6 and 0.8 or more ,However, it is important to remember that the kappa value should always be evaluated in the context of the problem or data being studied. For example, in cases where the prevalence of the condition being studied is very low, the resulting kappa value may be lower than expected even though the model is reasonably accurate. Therefore, the kappa value must always be evaluated by considering the context of the data being studied.

ROC

# ROC
dtree_roc <- data.frame(prediction=round(pred_titanic_test_tuned_prob[,2],4),
                      trueclass= as.numeric(titanic_test$Survived == 1))

dtree_roc <- prediction(dtree_roc$prediction, dtree_roc$trueclass) 

# ROC curve
plot(performance(dtree_roc, "tpr", "fpr"),
     main = "ROC")
abline(a = 0, b = 1)

AUC

#Check AUC
dtree_value <- performance(prediction.obj = dtree_roc, 
                         measure = "auc")
dtree_value@y.values
#> [[1]]
#> [1] 0.8645478

Insight :

  • 86.45% it means, models that the model performance is good in classifying both positive and negative class.

Random Forest

Random Forest is a type of Ensemble Method which consists of many Decision Trees. Each Decision Tree has its own characteristics and is not related to each other. Random Forest makes use of the Bagging (Bootstrap and Aggregation) concept in its creation. Here is the process:

  1. PROCESS 1 = Bootstrap sampling: Generates data by random sampling (with replacement) of the entire data and allows for duplicate rows.
  2. PROCESS 2 = 1 decision tree is made for each bootstrap data. The mtry parameter is used to randomly select the number of predictor candidates (Automatic Feature Selection)
  3. PROCESS 3 = Make predictions on new observations for each Decision Tree.
  4. PROCESS 4 = Aggregation: Generates a single prediction to predict.
    • Case classification: majority voting
    • Regression case: average of target values

Data Pre-processing

# Check dimension of titanic
dim(titanic_train_up)
#> [1] 880   9
# feature selection using nearzerovar
zero_var <- nearZeroVar(titanic_train_up)
titanic_rf <- titanic_train_up %>% 
  select(-c(zero_var))

dim(titanic_rf)
#> [1] 880   9

insight :

  • All columns in data titanic as a data set have already had variance almost 0

K-Fold Cross Validation

#set.seed(417)
#ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 20)

# make a model of random forest
#titanic_rf_model <- train(Survived~., titanic_train_up, method = "rf", trControl = ctrl)
#titanic_rf_model

# save the model
#saveRDS(titanic_rf_model, "titanic_forest.RDS")

# read model
readRDS("model/titanic_forest.RDS")
#> Random Forest 
#> 
#> 880 samples
#>   8 predictor
#>   2 classes: '0', '1' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 20 times) 
#> Summary of sample sizes: 792, 792, 792, 792, 792, 792, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>    2    0.8272727  0.6545455
#>    6    0.8933523  0.7867045
#>   11    0.8845455  0.7690909
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 6.

insight :

  • 880 samples -> the number of rows in the data train used in making the model
  • 8 predictors -> the number of predictor variables in our train data
  • 2 classes -> the number of target classes that exist in our data
  • Summary of sample sizes -> the number of sample sizes in the training data resulting from the k-fold cross validation
  • mtry and accuracy show the number of mtry used and the accuracy value of each model mtry. This accuracy can be used as a reference for which model is the best based on its mtry.
titanic_rf_model <- readRDS("model/titanic_forest.RDS")
varImp(titanic_rf_model)
#> rf variable importance
#> 
#>             Overall
#> Sexmale     100.000
#> PassengerId  89.180
#> Fare         85.397
#> Age          72.721
#> Pclass3      16.644
#> SibSp        15.807
#> Parch         8.393
#> Pclass2       4.324
#> EmbarkedC     4.230
#> EmbarkedS     3.851
#> EmbarkedQ     0.000

insight :

  • The most variable important is Sexmale

Predict of Random Forest

#RAW
pred_titanic_rf <- predict(object = titanic_rf_model,
                   newdata = titanic_test,
                   type ="raw")

#Prob
pred_titanic_rf_prob <- predict(object = titanic_rf_model,
                   newdata = titanic_test,
                   type ="prob")

Evaluation of Random Forest

Confusion Matrix Data Test

confusionMatrix(data = pred_titanic_rf,
                reference = titanic_test$Survived,
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 92 14
#>          1 17 56
#>                                           
#>                Accuracy : 0.8268          
#>                  95% CI : (0.7633, 0.8792)
#>     No Information Rate : 0.6089          
#>     P-Value [Acc > NIR] : 0.0000000002333 
#>                                           
#>                   Kappa : 0.6391          
#>                                           
#>  Mcnemar's Test P-Value : 0.7194          
#>                                           
#>             Sensitivity : 0.8000          
#>             Specificity : 0.8440          
#>          Pos Pred Value : 0.7671          
#>          Neg Pred Value : 0.8679          
#>              Prevalence : 0.3911          
#>          Detection Rate : 0.3128          
#>    Detection Prevalence : 0.4078          
#>       Balanced Accuracy : 0.8220          
#>                                           
#>        'Positive' Class : 1               
#> 

Confusion Matrix Data Train

pred_titanic_rf_train <- predict(object = titanic_rf_model,
                   newdata = titanic_train_up,
                   type ="raw")

confusionMatrix(data = pred_titanic_rf_train,
                reference = titanic_train_up$Survived,
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 440   0
#>          1   0 440
#>                                                
#>                Accuracy : 1                    
#>                  95% CI : (0.9958, 1)          
#>     No Information Rate : 0.5                  
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 1                    
#>                                                
#>  Mcnemar's Test P-Value : NA                   
#>                                                
#>             Sensitivity : 1.0                  
#>             Specificity : 1.0                  
#>          Pos Pred Value : 1.0                  
#>          Neg Pred Value : 1.0                  
#>              Prevalence : 0.5                  
#>          Detection Rate : 0.5                  
#>    Detection Prevalence : 0.5                  
#>       Balanced Accuracy : 1.0                  
#>                                                
#>        'Positive' Class : 1                    
#> 

Insight :

  • Accuracy of Random Forest in data test is 82,68% with focus on positive class (1 = survived), the highest than 2 previous models.
  • Accuracy of Random Forest in data train is 99,89%, it seems accuracy of model has dropped significantly in unseen data to be 82,68%.
  • P-Value < 0,05 it means there there is a significant difference between model accuracy and NIR values. That is, the model performs much better than simply guessing the majority of the class. Therefore, it can be said that the tested model gives very good results in predicting class targets.
  • Kappa Test is 0.6391, in general between 0.6 and 0.8 or more already said good, however, it is important to remember that the kappa value should always be evaluated in the context of the problem or data being studied. For example, in cases where the prevalence of the condition being studied is very low, the resulting kappa value may be lower than expected even though the model is reasonably accurate. Therefore, the kappa value must always be evaluated by considering the context of the data being studied.
titanic_rf_model$finalModel
#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 6
#> 
#>         OOB estimate of  error rate: 9.89%
#> Confusion matrix:
#>     0   1 class.error
#> 0 377  63  0.14318182
#> 1  24 416  0.05454545

Insight :

  • Number of trees: 500 -> random forest creates 500 trees
  • No. of variables tried at each split: 6 -> mtry : 6, in this case, the best mtry is 6 ,which has the highest accuracy when tested on data from boostrap sampling (can be considered as data train in making decision trees in random forests).
  • OOB estimate of error rate: 9.89% -> out-of-bag error from out-of-bag sample (unseen data when doing bootstrap sampling), In other words, the model accuracy on test data (out of bag data) is 100% - error rate (9.89) = 90.11%
  • Confusion matrix -> confusion matrix values for existing out-of-bag samples

Pros of Random Forests:

  • Suppresses the bias and variance of the Decision Tree, resulting in better predictive performance.
  • Automatic feature selection: Predictors are randomly selected in the making of the Decision Tree.
  • There is an out-of-bag error as a substitute for model evaluation.

ROC

# ROC
rf_roc <- data.frame(prediction=round(pred_titanic_rf_prob[,2],4),
                      trueclass= as.numeric(titanic_test$Survived == 1))

rf_roc <- prediction(rf_roc$prediction, rf_roc$trueclass) 

# ROC curve
plot(performance(rf_roc, "tpr", "fpr"),
     main = "ROC")
abline(a = 0, b = 1)

AUC

#Check AUC
rf_value <- performance(prediction.obj = rf_roc, 
                         measure = "auc")
rf_value@y.values
#> [[1]]
#> [1] 0.8775885

Insight :

  • 87.75% it means, models that the model performance is good in classifying both positive and negative class.

5.Conclusion

Business Case

“what sorts of people were more likely to survive?” using passenger data (ie = age, gender, socio-economic class, etc).

Due to our the best model is Random Forest, we will use it to know the answer of the question.

# Using data-frame predicted by the best model
titanic_answer <- titanic_test %>% 
  mutate(predict = pred_titanic_rf) %>% 
  relocate(Survived,predict)
# what sorts of people were more likely to survive?
# Using data test with column "Survived"

titanic_answer1_fm <- titanic_answer %>% 
  group_by(Sex,Age,Survived) %>% 
  count() %>% 
  rename(total_survive = n) %>%
  filter(Survived %in% 1) %>%
  filter(Sex == 'female') %>% 
  arrange(-total_survive)

titanic_answer1_m <- titanic_answer %>% 
  group_by(Sex,Age,Survived) %>% 
  count() %>% 
  rename(total_survive = n) %>%
  filter(Survived %in% 1) %>% 
  filter(Sex == 'male') %>%
  arrange(-total_survive)

titanic_answer1_fm
titanic_answer1_m

Insight :

  • 24 years old Female most likely survived from titanic’s accident with total 9 passengers
  • and for Male most likely survived from titanic’s accident are in observed age is still a baby 2 passengers, and a man who 25 years old total 2 passengers.

Conclusion

  • The best model have been through between Naive Bayes, Decision Tree, and Random forest is Random Forest with accuracy 82.68% and performance model 90.11% means that the model performance is good in classifying both positive and negative class.

  • For further research, even tough Random Forest is the best model so far, model Random Forest should be train in big size of data train, more learn the way more better.

  • Our goals above have already achieved, with get the best model machine learning that is Random Forest and we could know the result of predictive survived in our data test (titanic_test)

6. Reference