1 Background

This is a learn by building project to predict the chance of survive of Titanic’s passsenger using Naive Bayes, Decision Tree & Random Forest Analysis method.

RMS Titanic was a British passenger liner operated by the White Star Line that sank in the North Atlantic Ocean in the early morning hours of April 15, 1912, after striking an iceberg during her maiden voyage from Southampton to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making the sinking one of modern history’s deadliest peacetime commercial marine disasters.

https://en.wikipedia.org/wiki/RMS_Titanic

2 Source of Dataset

The analysis will use dataset from https://www.kaggle.com/c/titanic.

The data will be splitted into training and testing dataset.

Training dataset will be used to build machine learning models.

Testing dataset will be used to see how well machine learning models perform on unseen data.

3 Initialization Library

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.6.2

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## Warning: package 'ggplot2' was built under R version 3.6.2

## Warning: package 'tibble' was built under R version 3.6.2

## Warning: package 'tidyr' was built under R version 3.6.2

## Warning: package 'readr' was built under R version 3.6.2

## Warning: package 'purrr' was built under R version 3.6.2

## Warning: package 'dplyr' was built under R version 3.6.2

## Warning: package 'stringr' was built under R version 3.6.2

## Warning: package 'forcats' was built under R version 3.6.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(plotly)

## Warning: package 'plotly' was built under R version 3.6.2

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(GGally)

## Warning: package 'GGally' was built under R version 3.6.2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:dplyr':
## 
##     nasa

library(ggplot2)
library(readr)
library(dplyr)
library(funModeling)

## Warning: package 'funModeling' was built under R version 3.6.2

## Loading required package: Hmisc

## Warning: package 'Hmisc' was built under R version 3.6.2

## Loading required package: lattice

## Loading required package: survival

## Warning: package 'survival' was built under R version 3.6.2

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:plotly':
## 
##     subplot

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## funModeling v.1.9.3 :)
## Examples and tutorials at livebook.datascienceheroes.com
##  / Now in Spanish: librovivodecienciadedatos.ai

## 
## Attaching package: 'funModeling'

## The following object is masked from 'package:GGally':
## 
##     range01

library(lmtest)

## Warning: package 'lmtest' was built under R version 3.6.2

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.6.2

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(car)

## Warning: package 'car' was built under R version 3.6.2

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

library(MLmetrics)

## Warning: package 'MLmetrics' was built under R version 3.6.2

## 
## Attaching package: 'MLmetrics'

## The following object is masked from 'package:base':
## 
##     Recall

library(e1071)

## Warning: package 'e1071' was built under R version 3.6.2

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:Hmisc':
## 
##     impute

library(rsample)

## Warning: package 'rsample' was built under R version 3.6.2

library(caret)

## Warning: package 'caret' was built under R version 3.6.2

## 
## Attaching package: 'caret'

## The following objects are masked from 'package:MLmetrics':
## 
##     MAE, RMSE

## The following object is masked from 'package:survival':
## 
##     cluster

## The following object is masked from 'package:purrr':
## 
##     lift

library(ROCR)

## Warning: package 'ROCR' was built under R version 3.6.2

## Loading required package: gplots

## Warning: package 'gplots' was built under R version 3.6.2

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

library(partykit)

## Warning: package 'partykit' was built under R version 3.6.2

## Loading required package: grid

## Loading required package: libcoin

## Warning: package 'libcoin' was built under R version 3.6.2

## Loading required package: mvtnorm

4 Importing Dataset

titanic <- read.csv("titanic.csv")
str(titanic)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

4.1 Data Dictionary

The name of variables are as follows:
1. PassengerId : Id number
2. Survived : Survival (0 = No, 1 = Yes)
3. Pclass : Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
4. Name : Name
5. Sex : Sex
6. Age : Age in years
7. SibSp : # of siblings/spouses aboard the Titanic
8. Parch : # of parents/children aboard the Titanic
9. Ticket : Ticket number
10.Fare : Passenger fare
11.Cabin : Cabin number
12.Embarked : Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

4.2 Variable Notes

Pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

Sibsp: The dataset defines family relations in this way
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)

Parch: The dataset defines family relations in this way
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

5 Inspecting Dataset

titanic <- titanic %>% 
   select(-PassengerId, -Name, -Ticket, -Fare) %>% 
   mutate(Pclass = as.factor(Pclass),
          SibSp = as.factor(SibSp),
          Parch = as.factor(Parch))
str(titanic)

## 'data.frame':    891 obs. of  8 variables:
##  $ Survived: int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Cabin   : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

The variables PassengerId, Name, Ticket, Fare are excluded from predictor variable due to no relationship with target variable Survived.

titanic %>% 
   is.na() %>% 
  colSums(is.na(titanic))

## Survived   Pclass      Sex      Age    SibSp    Parch    Cabin Embarked 
##        0        0        0      177        0        0        0        0

There is 177 na or missing values in variable Age of dataset.

We will make a mean imputation of missing values, which is simply calculate the mean of the observed values for that variable for all individuals who are non-missing.

mean_impute<-function(x){
ifelse(is.na(x),mean(x,na.rm = T),x)
}

Age_1 <- titanic[,4]

Age_1 <- mean_impute(Age_1)
Age_1 <- as.data.frame(Age_1)

titanic_new <- c(titanic, Age_1)
titanic_new <- as.data.frame(titanic_new)
str(titanic_new)

## 'data.frame':    891 obs. of  9 variables:
##  $ Survived: int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Cabin   : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
##  $ Age_1   : num  22 38 26 35 35 ...

titanic_new %>% 
   is.na() %>% 
  colSums(is.na(titanic_new))

## Survived   Pclass      Sex      Age    SibSp    Parch    Cabin Embarked 
##        0        0        0      177        0        0        0        0 
##    Age_1 
##        0

titanic_new1 <- titanic_new %>% 
  select(-Age)

titanic_new1 %>% 
   is.na() %>% 
  colSums(is.na(titanic_new1))

## Survived   Pclass      Sex    SibSp    Parch    Cabin Embarked    Age_1 
##        0        0        0        0        0        0        0        0

We will divide variable Age_1 into 9 (nine) age groups and convert into factor data type.

age_group <- function(x){
if (x < 6) {
"Toddler"
}
else if (x > 5 & x < 12) {
"Child"
}
else if (x > 12 & x < 17) {
"Early Teenager"
}
else if (x > 17 & x < 26) {
"Late Teenager"
}
else if (x > 26 & x < 36) {
"Early Adult"
}
else if (x > 36 & x < 46) {
"Late Adult"
}
else if (x > 46 & x < 56) {
"Early Elderly"
}
else if (x > 56 & x < 66) {
"Late Elderly"
}
else
"Old Man"
}

Age_2 <- titanic_new1[,8]

Age_2 <- sapply(Age_2, age_group)
Age_2 <- as.data.frame(Age_2)

titanic_new2 <- c(titanic_new, Age_2)
titanic_new2 <- as.data.frame(titanic_new2)
str(titanic_new2)

## 'data.frame':    891 obs. of  10 variables:
##  $ Survived: int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Cabin   : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
##  $ Age_1   : num  22 38 26 35 35 ...
##  $ Age_2   : Factor w/ 9 levels "Child","Early Adult",..: 7 5 8 2 2 2 3 9 2 4 ...

titanic_data <- titanic_new2 %>% 
  select(-Age, -Age_1, -Cabin) %>% 
  mutate(Survived = as.factor(Survived))
str(titanic_data)

## 'data.frame':    891 obs. of  7 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
##  $ Age_2   : Factor w/ 9 levels "Child","Early Adult",..: 7 5 8 2 2 2 3 9 2 4 ...

6 Solving Business Problem

The model for solving business problem is to predict the chance of Titanic’s passengers getting survived.

The model will be developed using variables as follows:

Target variable: Survived
Predictor variable: Pclass, Sex, SibSp, Parch, Cabin, Embarked, Age_2

7 Exploratory Data Analysis

7.1 Checking data distribution

summary(titanic_data)

##  Survived Pclass      Sex      SibSp   Parch   Embarked           Age_2    
##  0:549    1:216   female:314   0:608   0:678    :  2    Early Adult  :355  
##  1:342    2:184   male  :577   1:209   1:118   C:168    Late Teenager:188  
##           3:491                2: 28   2: 80   Q: 77    Late Adult   : 94  
##                                3: 16   3:  5   S:644    Old Man      : 69  
##                                4: 18   4:  4            Early Elderly: 59  
##                                5:  5   5:  5            Toddler      : 44  
##                                8:  7   6:  1            (Other)      : 82

round(prop.table(table(titanic_data$Survived, titanic_data$Pclass)),2)

##    
##        1    2    3
##   0 0.09 0.11 0.42
##   1 0.15 0.10 0.13

Most of passengers were in the 3rd class (55%).

round(prop.table(table(titanic_data$Survived, titanic_data$Sex)),2)

##    
##     female male
##   0   0.09 0.53
##   1   0.26 0.12

Most of passengers were male (65%).

round(prop.table(table(titanic_data$Survived, titanic_data$SibSp)),2)

##    
##        0    1    2    3    4    5    8
##   0 0.45 0.11 0.02 0.01 0.02 0.01 0.01
##   1 0.24 0.13 0.01 0.00 0.00 0.00 0.00

None of passengers were with siblings/spouses (69%).

round(prop.table(table(titanic_data$Survived, titanic_data$Parch)),2)

##    
##        0    1    2    3    4    5    6
##   0 0.50 0.06 0.04 0.00 0.00 0.00 0.00
##   1 0.26 0.07 0.04 0.00 0.00 0.00 0.00

None of passengers were with parents/child (76%).

round(prop.table(table(titanic_data$Survived, titanic_data$Age_2)),2)

##    
##     Child Early Adult Early Elderly Early Teenager Late Adult Late Elderly
##   0  0.02        0.25          0.04           0.02       0.07         0.02
##   1  0.01        0.14          0.03           0.02       0.04         0.01
##    
##     Late Teenager Old Man Toddler
##   0          0.14    0.05    0.01
##   1          0.07    0.03    0.03

Most of passengers were early and late adult (50%).

7.2 Checking class-imbalance (cross validation)

round(prop.table(table(titanic_data$Survived)),2)

## 
##    0    1 
## 0.62 0.38

Based on the proportion of the target variable above, we can conclude that our target variable can be considered to be imbalance. Hence we will have to balance the same before using it for our models.

8 Preparing Train and Test Dataset (Cross Validation)

8.1 For Naive Bayes

# Splitting data
set.seed(100)
split_nb <- initial_split(data = titanic_data, prop = 0.8, strata = Survived)
train_nb <- training(split_nb)
test_nb <- testing(split_nb)

prop.table(table(titanic_data$Survived))

## 
##         0         1 
## 0.6161616 0.3838384

prop.table(table(train_nb$Survived))

## 
##         0         1 
## 0.6162465 0.3837535

prop.table(table(test_nb$Survived))

## 
##         0         1 
## 0.6158192 0.3841808

# Upsampling data
set.seed(100)
train_nb_up <- upSample(x= train_nb[,-1],y = train_nb$Survived, yname = "Survived")
prop.table(table(train_nb_up$Survived))

## 
##   0   1 
## 0.5 0.5

table(train_nb$Survived)

## 
##   0   1 
## 440 274

table(train_nb_up$Survived)

## 
##   0   1 
## 440 440

Now that we have the train (already upsampled) and test datasets. We will store the ground truth labels as well. They are necessary for the training of the classifer as well as simply model evaluation later.

# Preparing data labels
train_nb_up.Label <- train_nb_up[,1]

# Preparing data labels
test_nb.Label <- test_nb[,1]

8.2 For Decision Tree

# Splitting data
set.seed(100)
split_dt <- initial_split(data = titanic_data, prop = 0.8, strata = Survived)
train_dt <- training(split_dt)
test_dt <- testing(split_dt)

prop.table(table(titanic_data$Survived))

## 
##         0         1 
## 0.6161616 0.3838384

prop.table(table(train_dt$Survived))

## 
##         0         1 
## 0.6162465 0.3837535

prop.table(table(test_dt$Survived))

## 
##         0         1 
## 0.6158192 0.3841808

# Upsampling data
set.seed(100)
train_dt_up <- upSample(x= train_dt[,-1],y = train_dt$Survived, yname = "Survived")
prop.table(table(train_dt_up$Survived))

## 
##   0   1 
## 0.5 0.5

table(train_dt$Survived)

## 
##   0   1 
## 440 274

table(train_dt_up$Survived)

## 
##   0   1 
## 440 440

# Preparing data labels
train_dt_up.Label <- train_dt_up[,1]

# Preparing data labels
test_dt.Label <- test_dt[,1]

8.3 For Random Forest

# Splitting data
set.seed(100)
split_rf <- initial_split(data = titanic_data, prop = 0.8, strata = Survived)
train_rf <- training(split_rf)
test_rf <- testing(split_rf)

prop.table(table(titanic_data$Survived))

## 
##         0         1 
## 0.6161616 0.3838384

prop.table(table(train_rf$Survived))

## 
##         0         1 
## 0.6162465 0.3837535

prop.table(table(test_rf$Survived))

## 
##         0         1 
## 0.6158192 0.3841808

# Upsampling data
set.seed(100)
train_rf_up <- upSample(x= train_rf[,-1],y = train_rf$Survived, yname = "Survived")
prop.table(table(train_rf_up$Survived))

## 
##   0   1 
## 0.5 0.5

table(train_rf$Survived)

## 
##   0   1 
## 440 274

table(train_rf_up$Survived)

## 
##   0   1 
## 440 440

9 Developing Model

9.1 For Naive Bayes

model_nb <- naiveBayes(formula = Survived ~., data = train_nb_up, laplace = 1)
model_nb

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##   0   1 
## 0.5 0.5 
## 
## Conditional probabilities:
##    Pclass
## Y           1         2         3
##   0 0.1512415 0.1805869 0.6681716
##   1 0.3724605 0.2641084 0.3634312
## 
##    Sex
## Y      female      male
##   0 0.1425339 0.8574661
##   1 0.6764706 0.3235294
## 
##    SibSp
## Y             0           1           2           3           4           5
##   0 0.711409396 0.176733781 0.031319911 0.022371365 0.029082774 0.013422819
##   1 0.592841163 0.333333333 0.046979866 0.013422819 0.008948546 0.002237136
##    SibSp
## Y             8
##   0 0.015659955
##   1 0.002237136
## 
##    Parch
## Y             0           1           2           3           4           5
##   0 0.791946309 0.100671141 0.078299776 0.006711409 0.008948546 0.008948546
##   1 0.666666667 0.205816555 0.102908277 0.013422819 0.002237136 0.006711409
##    Parch
## Y             6
##   0 0.004474273
##   1 0.002237136
## 
##    Embarked
## Y                         C           Q           S
##   0 0.002252252 0.135135135 0.096846847 0.765765766
##   1 0.006756757 0.283783784 0.085585586 0.623873874
## 
##    Age_2
## Y        Child Early Adult Early Elderly Early Teenager Late Adult Late Elderly
##   0 0.03118040  0.39643653    0.05790646     0.03118040 0.11358575   0.03340757
##   1 0.02227171  0.35634744    0.08463252     0.05345212 0.11135857   0.02449889
##    Age_2
## Y   Late Teenager    Old Man    Toddler
##   0    0.22271715 0.08240535 0.03118040
##   1    0.18485523 0.06904232 0.09354120

9.2 For Decision Tree

model_dt <- ctree(formula = Survived ~., data = train_dt_up)
model_dt

## 
## Model formula:
## Survived ~ Pclass + Sex + SibSp + Parch + Embarked + Age_2
## 
## Fitted party:
## [1] root
## |   [2] Sex in female
## |   |   [3] Pclass in 1, 2: 1 (n = 208, err = 3.4%)
## |   |   [4] Pclass in 3
## |   |   |   [5] Age_2 in Child, Early Elderly, Late Adult: 0 (n = 18, err = 11.1%)
## |   |   |   [6] Age_2 in Early Adult, Early Teenager, Late Elderly, Late Teenager, Old Man, Toddler
## |   |   |   |   [7] Embarked in C, Q: 1 (n = 59, err = 15.3%)
## |   |   |   |   [8] Embarked in S: 1 (n = 75, err = 40.0%)
## |   [9] Sex in male
## |   |   [10] Pclass in 1: 0 (n = 120, err = 47.5%)
## |   |   [11] Pclass in 2, 3
## |   |   |   [12] Age_2 in Child, Early Adult, Early Elderly, Early Teenager, Late Adult, Late Elderly, Late Teenager, Old Man
## |   |   |   |   [13] Embarked in C
## |   |   |   |   |   [14] Parch in 0: 0 (n = 42, err = 26.2%)
## |   |   |   |   |   [15] Parch in 1: 1 (n = 10, err = 30.0%)
## |   |   |   |   [16] Embarked in Q, S: 0 (n = 321, err = 15.0%)
## |   |   |   [17] Age_2 in Toddler
## |   |   |   |   [18] SibSp in 0, 1, 2: 1 (n = 17, err = 0.0%)
## |   |   |   |   [19] SibSp in 3, 4, 5: 0 (n = 10, err = 20.0%)
## 
## Number of inner nodes:     9
## Number of terminal nodes: 10

plot(model_dt, type ="simple")

9.3 For Random Forest

set.seed(283)
ctrl <- trainControl(method = "cv", number = 5,repeats = 3)

## Warning: `repeats` has no meaning for this resampling method.

model_rf <- train(form = Survived ~., data = train_rf_up, trControl= ctrl)
model_rf

## Random Forest 
## 
## 880 samples
##   6 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 704, 704, 704, 704, 704 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7693182  0.5386364
##   14    0.8034091  0.6068182
##   26    0.7965909  0.5931818
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 14.

10 Predicting Target

10.1 For Naive Bayes

pred_nb <- predict(model_nb, test_nb)
head(pred_nb)

## [1] 1 0 0 1 1 1
## Levels: 0 1

10.2 For Decision Tree

pred_dt <- predict(model_dt, test_dt)
head(pred_dt)

##  2 14 22 25 29 31 
##  1  0  0  0  1  0 
## Levels: 0 1

pred_dt_prob <- predict(model_dt, test_dt, type = "prob")
head(pred_dt_prob)

##             0         1
## 2  0.03365385 0.9663462
## 14 0.85046729 0.1495327
## 22 0.85046729 0.1495327
## 25 0.88888889 0.1111111
## 29 0.15254237 0.8474576
## 31 0.52500000 0.4750000

10.3 For Random Forest

pred_rf <- predict(model_rf, test_rf)
head(pred_rf)

## [1] 1 0 0 0 1 1
## Levels: 0 1

11 Evaluating Target

11.1 For Naive Bayes

confusionMatrix(pred_nb, test_nb.Label, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 85 14
##          1 24 54
##                                           
##                Accuracy : 0.7853          
##                  95% CI : (0.7174, 0.8434)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 1.07e-06        
##                                           
##                   Kappa : 0.5585          
##                                           
##  Mcnemar's Test P-Value : 0.1443          
##                                           
##             Sensitivity : 0.7941          
##             Specificity : 0.7798          
##          Pos Pred Value : 0.6923          
##          Neg Pred Value : 0.8586          
##              Prevalence : 0.3842          
##          Detection Rate : 0.3051          
##    Detection Prevalence : 0.4407          
##       Balanced Accuracy : 0.7870          
##                                           
##        'Positive' Class : 1               
##

11.2 For Decision Tree

confusionMatrix(pred_dt, test_dt$Survived, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 93 17
##          1 16 51
##                                           
##                Accuracy : 0.8136          
##                  95% CI : (0.7483, 0.8681)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 1.058e-08       
##                                           
##                   Kappa : 0.6049          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7500          
##             Specificity : 0.8532          
##          Pos Pred Value : 0.7612          
##          Neg Pred Value : 0.8455          
##              Prevalence : 0.3842          
##          Detection Rate : 0.2881          
##    Detection Prevalence : 0.3785          
##       Balanced Accuracy : 0.8016          
##                                           
##        'Positive' Class : 1               
##

titan_roc <- ROCR::prediction(pred_dt_prob[,2], test_dt$Survived)  
plot(performance(titan_roc, "tpr", "fpr"))

titan_auc <- performance(titan_roc, "auc")
titan_auc@y.values[[1]]

## [1] 0.8600243

The AUC number 0.8579331, closed to 1, meaning that the classification model have a good prediction between two classes.

11.3 For Random Forest

confusionMatrix(pred_rf, test_rf$Survived, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 92 20
##          1 17 48
##                                           
##                Accuracy : 0.791           
##                  95% CI : (0.7236, 0.8483)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 4.548e-07       
##                                           
##                   Kappa : 0.5545          
##                                           
##  Mcnemar's Test P-Value : 0.7423          
##                                           
##             Sensitivity : 0.7059          
##             Specificity : 0.8440          
##          Pos Pred Value : 0.7385          
##          Neg Pred Value : 0.8214          
##              Prevalence : 0.3842          
##          Detection Rate : 0.2712          
##    Detection Prevalence : 0.3672          
##       Balanced Accuracy : 0.7750          
##                                           
##        'Positive' Class : 1               
##

12 Summary

The evaluation of confusion matrix of Naive Bayes method from data train are as follows:

Accuracy reveals value 0.7853, meaning that 78.53% of our data is correctly classified.
Sensitivity / Recall reveals value 0.7941, meaning that FN (False Negative) portion 79.41% of our positive outcomes are correctly classified.
Pos Pred Value / Precision reveals value 0.6923, meaning that FP (False Positive) portion 69.23 % of our positive predictions are correct.

The evaluation of confusion matrix of Decision Tree method from data train are as follows:

Accuracy reveals value 0.8192, meaning that 81.92% of our data is correctly classified.
Sensitivity / Recall reveals value 0.7500, meaning that FN (False Negative) portion 75% of our positive outcomes are correctly classified.
Pos Pred Value / Precision reveals value 0.7727, meaning that FP (False Positive) portion 77.27 % of our positive predictions are correct.

The evaluation of confusion matrix of Random Forest method from data train are as follows:

Accuracy reveals value 0.791, meaning that 79.1% of our data is correctly classified.
Sensitivity / Recall reveals value 0.7059, meaning that FN (False Negative) portion 70.59% of our positive outcomes are correctly classified.
Pos Pred Value / Precision reveals value 0.7385, meaning that FP (False Positive) portion 73.85 % of our positive predictions are correct.

From comparison of three confusion matrixes above, the accuracy and precision level of Decision Tree is the highest. Nevertheless, the recall level of Decision Tree is lower than Naive Bayes. The FN (False Negative) of Decision Tree 17, while the FN of Naive Bayes is 14. If we have a point of view that in such condition the probability to survive is small, we will choose Decision Tree method as our best prediction model.

Naive Bayes, Decision Tree & Random Forest Analysis - Titanic Dataset

Laurensius Wiwiek Winarta

2020-03-08