This report is part of LBB and will predict the survivability of the Titanic passenger in the incident that happened in 1912 using data provided by Kaggle Titanic Competition.

The RMS Titanic sank in the early morning hours of 15 April 1912 in the North Atlantic Ocean, four days into her maiden voyage from Southampton to New York City. The largest ocean liner in service at the time, Titanic had an estimated 2,224 people on board when she struck an iceberg at around 23:40 (ship’s time) on Sunday, 14 April 1912. Her sinking two hours and forty minutes later at 02:20 (ship’s time; 05:18 GMT) on Monday, 15 April, resulted in the deaths of more than 1,500 people, making it one of the deadliest peacetime maritime disasters in history.

Our aim is to predict what of passengers will likely to survive the Titanic shipwreck.

0. Library

library(dplyr)
library(tidyverse)
library(gtools)
library(ggplot2)
library(caret)
library(class)
library(e1071)
library(ROCR)
library(rpart)
library(rattle)
library(rpart.plot)
options(scipen=99)

1. Read Data & Understanding

We will use data provided by Kaggle from their machine learning competition.

titanic <- read.csv("train.csv")

glimpse(titanic)

## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1~
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3~
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl~
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal~
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ~
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0~
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0~
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37~
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,~
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C~
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"~

The data set contains following columns:

Survived : Survival (0 = No, 1 = Yes)
Pclass : Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
Name : Name of the Passengers
Sex : Sex
Age : Age in years
SibSp : number of siblings / spouses aboard the Titanic
Parch : number of parents / children aboard the Titanic
Ticket : Ticket number
Fare : Passenger fare
Cabin : Cabin number
Embarked : Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

head(titanic)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

2. Data Manipulation

a. Data Types & Missing Values

We need to change data types of some variables to factor:

Survived
Pclass
Sex
Embarked

titanic_clean <- titanic %>%
  mutate(Survived=as.factor(Survived),
         Pclass=as.factor(Pclass),
         Sex=as.factor(Sex),
         Embarked=as.factor(Embarked)
         )

Also, we need to check whether there is missing value or not.

colSums(is.na(titanic_clean))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

There are 177 missing values in Age. We decided to fill the missing Age with median value.

summary(titanic_clean$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177

Replacing NA with median Age

titanic_clean <- titanic_clean %>%
  mutate(Age=replace_na(Age,28.00))

As we can see, there is no more missing values.

colSums(is.na(titanic_clean))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Below are the summary of the data set after initial wrangling.

summary(titanic_clean)

##   PassengerId    Survived Pclass      Name               Sex     
##  Min.   :  1.0   0:549    1:216   Length:891         female:314  
##  1st Qu.:223.5   1:342    2:184   Class :character   male  :577  
##  Median :446.0            3:491   Mode  :character               
##  Mean   :446.0                                                   
##  3rd Qu.:668.5                                                   
##  Max.   :891.0                                                   
##       Age            SibSp           Parch           Ticket         
##  Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Length:891        
##  1st Qu.:22.00   1st Qu.:0.000   1st Qu.:0.0000   Class :character  
##  Median :28.00   Median :0.000   Median :0.0000   Mode  :character  
##  Mean   :29.36   Mean   :0.523   Mean   :0.3816                     
##  3rd Qu.:35.00   3rd Qu.:1.000   3rd Qu.:0.0000                     
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000                     
##       Fare           Cabin           Embarked
##  Min.   :  0.00   Length:891          :  2   
##  1st Qu.:  7.91   Class :character   C:168   
##  Median : 14.45   Mode  :character   Q: 77   
##  Mean   : 32.20                      S:644   
##  3rd Qu.: 31.00                              
##  Max.   :512.33

b. Predictor Modification

`title`

Beside that, we thought that there is necessity to extract title from Name that reflects their social status which also reflects their ability to afford higher class ticket in Titanic. We want to see their titles can increase their survivablity or not.

So, we extract the title using gsub.

titanic_clean <- titanic_clean %>% 
  mutate(title=gsub("(.*),\\s(.*)\\.(.*)", "\\2",Name)) %>% 
  mutate(title=ifelse(grepl("Martin", title), "Mrs", title))

After title extract we can see the Titanic passengers’ title.

table(titanic_clean$title)

## 
##         Capt          Col          Don           Dr     Jonkheer         Lady 
##            1            2            1            7            1            1 
##        Major       Master         Miss         Mlle          Mme           Mr 
##            2           40          182            2            1          517 
##          Mrs           Ms          Rev          Sir the Countess 
##          125            1            6            1            1

We can see that most common title are Mr, Mrs ans Miss. Other titles are less common/rare compared to Mr, Mrs ans Miss. To simplify we will categorize to only 4 categories which are Mr, Mrs, Miss, Rare.

titanic_clean <- titanic_clean %>% 
  mutate(title=ifelse(!grepl("Mr|Miss|Mrs", title), "Rare", title))

After simplifying title, we now have 4 categories.

table(titanic_clean$title)

## 
## Miss   Mr  Mrs Rare 
##  182  517  125   67

`alone`

We need to know whether the passengers are alone or not by looking at SibSp and Parch. If passenger with either SibSp or Parch that greater than zero, then the passenger are not travelling alone. And We want to see whether the solo passenger have greater chance of survive or not.

We are going to add column alone by using formula below:

titanic_clean <- titanic_clean %>% 
  mutate(alone=ifelse(SibSp>0 | Parch>0, 0, 1)) %>% 
  mutate(alone=as.factor(alone),
         title=as.factor(title))

head(titanic_clean)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  28     0     0
##             Ticket    Fare Cabin Embarked title alone
## 1        A/5 21171  7.2500              S    Mr     0
## 2         PC 17599 71.2833   C85        C   Mrs     0
## 3 STON/O2. 3101282  7.9250              S  Miss     1
## 4           113803 53.1000  C123        S   Mrs     0
## 5           373450  8.0500              S    Mr     1
## 6           330877  8.4583              Q    Mr     1

After modification done above, we will drop some of the variables that are unique or if converted as factors will not gain much insight.

Ticket (dropped due to unique value)
Name (dropped due to unique value)
PassangerId (dropped due to unique value)
Cabin (dropped due to unique value)
SibSp (dropped after add alone variables)
Parch (dropped after add alone variables)

titanic_clean <- titanic_clean %>%
  select(-Name,-Ticket,-PassengerId,-Cabin,-SibSp,-Parch)

Below are the glance of titanic_clean.

head(titanic_clean)

##   Survived Pclass    Sex Age    Fare Embarked title alone
## 1        0      3   male  22  7.2500        S    Mr     0
## 2        1      1 female  38 71.2833        C   Mrs     0
## 3        1      3 female  26  7.9250        S  Miss     1
## 4        1      1 female  35 53.1000        S   Mrs     0
## 5        0      3   male  35  8.0500        S    Mr     1
## 6        0      3   male  28  8.4583        Q    Mr     1

summary(titanic_clean)

##  Survived Pclass      Sex           Age             Fare        Embarked
##  0:549    1:216   female:314   Min.   : 0.42   Min.   :  0.00    :  2   
##  1:342    2:184   male  :577   1st Qu.:22.00   1st Qu.:  7.91   C:168   
##           3:491                Median :28.00   Median : 14.45   Q: 77   
##                                Mean   :29.36   Mean   : 32.20   S:644   
##                                3rd Qu.:35.00   3rd Qu.: 31.00           
##                                Max.   :80.00   Max.   :512.33           
##   title     alone  
##  Miss:182   0:354  
##  Mr  :517   1:537  
##  Mrs :125          
##  Rare: 67          
##                    
##

3. Logistic Regression

a. Pre-Processing

In this part, we want to see the distribution of target class which is Survived.

prop.table(table(titanic_clean$Survived))

## 
##         0         1 
## 0.6161616 0.3838384

table(titanic_clean$Survived)

## 
##   0   1 
## 549 342

As we can see above, the proportion is not that perfectly balance but still tolerable. So, we decided not to modify the proportion for this model training.

b. Splitting Data

We will split the data for train and test with proportion of 80% and 20% respectively.

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(909)
intrain <- sample(nrow(titanic_clean), nrow(titanic_clean)*0.8)
tntc_train <- titanic_clean[intrain,]
tntc_test <- titanic_clean[-intrain,]

Target variable proportion on training data.

prop.table(table(tntc_train$Survived))

## 
##        0        1 
## 0.630618 0.369382

Target variable proportion on test data.

prop.table(table(tntc_test$Survived))

## 
##         0         1 
## 0.5586592 0.4413408

From the proportion comparison above, the proportion of target variable are more or less the same.

c. Modelling

tn_model <- glm(formula=Survived~., data=tntc_train, family="binomial")

summary(tn_model)

## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = tntc_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3610  -0.6779  -0.3742   0.5982   2.5463  
## 
## Coefficients:
##                Estimate  Std. Error z value        Pr(>|z|)    
## (Intercept)   16.303183 1455.397603   0.011        0.991062    
## Pclass2       -0.767544    0.339520  -2.261        0.023779 *  
## Pclass3       -2.199526    0.327859  -6.709 0.0000000000196 ***
## Sexmale      -14.648009  525.370416  -0.028        0.977757    
## Age           -0.034023    0.009184  -3.704        0.000212 ***
## Fare           0.001362    0.002444   0.557        0.577380    
## EmbarkedC    -13.154767 1455.397578  -0.009        0.992788    
## EmbarkedQ    -13.194515 1455.397594  -0.009        0.992767    
## EmbarkedS    -13.797635 1455.397568  -0.009        0.992436    
## titleMr       12.213164  525.370523   0.023        0.981453    
## titleMrs       0.862217    0.377330   2.285        0.022310 *  
## titleRare     13.600986  525.370365   0.026        0.979346    
## alone1         0.446838    0.253354   1.764        0.077784 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 937.88  on 711  degrees of freedom
## Residual deviance: 624.21  on 699  degrees of freedom
## AIC: 650.21
## 
## Number of Fisher Scoring iterations: 14

d. Model Fitting

We will use backward step model fitting to improve the model.

tn_model_2 <- step(object=tn_model, direction="backward", trace=0)

summary(tn_model_2)

## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + Embarked + title + 
##     alone, family = "binomial", data = tntc_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3236  -0.6753  -0.3725   0.6002   2.5518  
## 
## Coefficients:
##                Estimate  Std. Error z value            Pr(>|z|)    
## (Intercept)   16.458576 1455.397573   0.011            0.990977    
## Pclass2       -0.841417    0.312579  -2.692            0.007106 ** 
## Pclass3       -2.283798    0.291288  -7.840 0.00000000000000449 ***
## Sexmale      -14.639543  525.688466  -0.028            0.977783    
## Age           -0.034302    0.009164  -3.743            0.000182 ***
## EmbarkedC    -13.158638 1455.397574  -0.009            0.992786    
## EmbarkedQ    -13.212855 1455.397590  -0.009            0.992756    
## EmbarkedS    -13.820439 1455.397563  -0.009            0.992423    
## titleMr       12.201338  525.688572   0.023            0.981483    
## titleMrs       0.849238    0.376589   2.255            0.024128 *  
## titleRare     13.575700  525.688413   0.026            0.979397    
## alone1         0.410959    0.244490   1.681            0.092786 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 937.88  on 711  degrees of freedom
## Residual deviance: 624.53  on 700  degrees of freedom
## AIC: 648.53
## 
## Number of Fisher Scoring iterations: 14

From model fitting above we can see that the Fare variable is dropped.

e. Prediction

After making the model using logistic regression after model fitting, we will predict the result using test data in tntc_test using predict using tn_model_2 model.

tntc_survival <- predict(tn_model_2,newdata = tntc_test, type = "response")

tntc_survival_label <- as.factor(ifelse(tntc_survival > 0.5, "1", "0"))

After we made the prediction, we will try to see the proportion of survivability between prediction and actual data.

Prediction

prop.table(table(tntc_survival_label))

## tntc_survival_label
##         0         1 
## 0.5865922 0.4134078

Actual Data

prop.table(table(tntc_test$Survived))

## 
##         0         1 
## 0.5586592 0.4413408

From the proportion table above, we can see that the survivability distribution between Prediction and Actual is slightly different. Looking by table above only we cannot know the accuracy, so we need to do model evaluation in the next section to see how accurate the model is.

f. Evaluation

In this section we will evaluate tn_model_2 using confusion matrix.

tntc_conf <- confusionMatrix(tntc_survival_label, tntc_test$Survived, positive="1")
tntc_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 90 15
##          1 10 64
##                                              
##                Accuracy : 0.8603             
##                  95% CI : (0.8008, 0.9075)   
##     No Information Rate : 0.5587             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.7149             
##                                              
##  Mcnemar's Test P-Value : 0.4237             
##                                              
##             Sensitivity : 0.8101             
##             Specificity : 0.9000             
##          Pos Pred Value : 0.8649             
##          Neg Pred Value : 0.8571             
##              Prevalence : 0.4413             
##          Detection Rate : 0.3575             
##    Detection Prevalence : 0.4134             
##       Balanced Accuracy : 0.8551             
##                                              
##        'Positive' Class : 1                  
##

TP = 64
TN = 90
FN = 15
FP = 10

Recall <- round((TP)/(TP+FN),2)
Specificity <- round((TN)/(TN+FP),2)
Accuracy <- round((TP+TN)/(TP+TN+FP+FN),2)
Precision <- round((TP)/(TP+FP),2)

performance <- cbind.data.frame(Accuracy, Recall, Precision, Specificity)
performance

##   Accuracy Recall Precision Specificity
## 1     0.86   0.81      0.86         0.9

Accuracy: How accurate our model predict target class globally
Sensitivity/ Recall: How well our model predict positive class, using actual data as reference
Specificity: How well our model predict negative class, using actual data as reference
Pos Pred Value/Precision: How precise our model to predict positive class

From the Confusion matrix above we can see that:

The model ability to predict the target value which is passengers’ survivability in Titanic incident is 86%
The model ability to predict passenger that survived in the incident, using actual data, is 81%
The model can predict passenger that perished in the incident, using actual data, is 90%
From all of predictions resulted from the model, the model ability to predict survived passengers correctly is 86%

g. Model Interpretation

exp(tn_model_2$coefficients) %>% 
  data.frame()

##                                  .
## (Intercept) 14056232.6202546972781
## Pclass2            0.4310990159640
## Pclass3            0.1018964764177
## Sexmale            0.0000004386593
## Age                0.9662798651412
## EmbarkedC          0.0000019287501
## EmbarkedQ          0.0000018269642
## EmbarkedS          0.0000009950836
## titleMr       199055.2946955674561
## titleMrs           2.3378657515973
## titleRare     786777.0621204958297
## alone1             1.5082636870279

We can interpret if the passenger have rare title the odds of surviving the accident is higher than other passengers.

4. K-NN

a. Pre-Processing

For this K-NN, we need to make dummy variables.

dummy <- dummyVars("~Survived+Pclass+Sex+Age+Fare+Embarked+title+alone", data=titanic_clean)
dummy <- data.frame(predict(dummy, newdata = titanic_clean))
str(dummy)

## 'data.frame':    891 obs. of  19 variables:
##  $ Survived.0: num  1 0 0 0 1 1 1 1 0 0 ...
##  $ Survived.1: num  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass.1  : num  0 1 0 1 0 0 1 0 0 0 ...
##  $ Pclass.2  : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Pclass.3  : num  1 0 1 0 1 1 0 1 1 0 ...
##  $ Sex.female: num  0 1 1 1 0 0 0 0 1 1 ...
##  $ Sex.male  : num  1 0 0 0 1 1 1 1 0 0 ...
##  $ Age       : num  22 38 26 35 35 28 54 2 27 14 ...
##  $ Fare      : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked. : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Embarked.C: num  0 1 0 0 0 0 0 0 0 1 ...
##  $ Embarked.Q: num  0 0 0 0 0 1 0 0 0 0 ...
##  $ Embarked.S: num  1 0 1 1 1 0 1 1 1 0 ...
##  $ title.Miss: num  0 0 1 0 0 0 0 0 0 0 ...
##  $ title.Mr  : num  1 0 0 0 1 1 1 0 0 0 ...
##  $ title.Mrs : num  0 1 0 1 0 0 0 0 1 1 ...
##  $ title.Rare: num  0 0 0 0 0 0 0 1 0 0 ...
##  $ alone.0   : num  1 1 0 1 0 0 0 1 1 1 ...
##  $ alone.1   : num  0 0 1 0 1 1 1 0 0 0 ...

Dropping variables with two categories so there will be no duplicate.

dummy <- dummy %>% 
  select(-c(Survived.0, Sex.female, alone.0))

We can see dummy table below for K-NN.

glimpse(dummy)

## Rows: 891
## Columns: 16
## $ Survived.1 <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1,~
## $ Pclass.1   <dbl> 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Pclass.2   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,~
## $ Pclass.3   <dbl> 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1,~
## $ Sex.male   <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0,~
## $ Age        <dbl> 22, 38, 26, 35, 35, 28, 54, 2, 27, 14, 4, 58, 20, 39, 14, 5~
## $ Fare       <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, ~
## $ Embarked.  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Embarked.C <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,~
## $ Embarked.Q <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,~
## $ Embarked.S <dbl> 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,~
## $ title.Miss <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0,~
## $ title.Mr   <dbl> 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0,~
## $ title.Mrs  <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,~
## $ title.Rare <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,~
## $ alone.1    <dbl> 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1,~

dim(dummy)

## [1] 891  16

We will split the training and test data using intrain we generated in previous section.

set.seed(909)
dummy_tr <- dummy[intrain,2:16]
dummy_ts <- dummy[-intrain,2:16]

dummy_tr_lab <- dummy[intrain,1]
dummy_ts_lab <- dummy[-intrain,1]

Determining K for K-NN using square root of number of rows of training data.

dummy_tr %>% 
  nrow() %>% 
  sqrt()

## [1] 26.68333

b. K-NN Prediction

Running K-NN Prediction using knn.

knn_tntc <- knn(train=dummy_tr,
                test=dummy_ts,
                cl= dummy_tr_lab,
                k=27)

Running prediction of first knn_tntc model.

pred_knn_tntc <- confusionMatrix(data = knn_tntc,
                                 reference = as.factor(dummy_ts_lab),
                                 positive = "1")
pred_knn_tntc

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 85 42
##          1 15 37
##                                           
##                Accuracy : 0.6816          
##                  95% CI : (0.6079, 0.7491)
##     No Information Rate : 0.5587          
##     P-Value [Acc > NIR] : 0.0005182       
##                                           
##                   Kappa : 0.3302          
##                                           
##  Mcnemar's Test P-Value : 0.0005736       
##                                           
##             Sensitivity : 0.4684          
##             Specificity : 0.8500          
##          Pos Pred Value : 0.7115          
##          Neg Pred Value : 0.6693          
##              Prevalence : 0.4413          
##          Detection Rate : 0.2067          
##    Detection Prevalence : 0.2905          
##       Balanced Accuracy : 0.6592          
##                                           
##        'Positive' Class : 1               
##

As we can see from confusion matrix, most of the matrix are showing not that good accuracy compared to logistic regression above. So we need to do some tuning by scaling the dummy_tr and dummy_ts using Z-score method.

Below are the scaling process for dummy_tr and dummy_ts using scale.

dummy_tr_scaled <- scale(dummy_tr)
dummy_ts_scaled <- scale(dummy_ts,
                center = attr(dummy_tr_scaled,"scaled:center"),
                scale = attr(dummy_tr_scaled, "scaled:scale"))

c. KNN Using Scaled Predictors

We will run K-NN using scaled train and test data.

knn_tntc_scaled <- knn(train=dummy_tr_scaled,
                test=dummy_ts_scaled,
                cl= dummy_tr_lab,
                k=27)

pred_knn_tntc_scaled <- confusionMatrix(data = knn_tntc_scaled,
                                 reference = as.factor(dummy_ts_lab),
                                 positive = "1")
pred_knn_tntc_scaled

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 97 28
##          1  3 51
##                                              
##                Accuracy : 0.8268             
##                  95% CI : (0.7633, 0.8792)   
##     No Information Rate : 0.5587             
##     P-Value [Acc > NIR] : 0.00000000000002639
##                                              
##                   Kappa : 0.6367             
##                                              
##  Mcnemar's Test P-Value : 0.00001628657576648
##                                              
##             Sensitivity : 0.6456             
##             Specificity : 0.9700             
##          Pos Pred Value : 0.9444             
##          Neg Pred Value : 0.7760             
##              Prevalence : 0.4413             
##          Detection Rate : 0.2849             
##    Detection Prevalence : 0.3017             
##       Balanced Accuracy : 0.8078             
##                                              
##        'Positive' Class : 1                  
##

From the confusion matrix above we can see that the model is better than the knn_tntc and with following description:

K_TP = 51
K_TN = 97
K_FN = 28
K_FP = 3

K_Recall <- round((K_TP)/(K_TP+K_FN),2)
K_Specificity <- round((K_TN)/(K_TN+K_FP),2)
K_Accuracy <- round((K_TP+K_TN)/(K_TP+K_TN+K_FP+K_FN),2)
K_Precision <- round((K_TP)/(K_TP+K_FP),2)

K_performance <- cbind.data.frame(K_Accuracy, K_Recall, K_Precision, K_Specificity)
K_performance

##   K_Accuracy K_Recall K_Precision K_Specificity
## 1       0.83     0.65        0.94          0.97

The model ability to predict the target value which is passengers’ survivability in Titanic incident is 83%
The model ability to predict passenger that survived in the incident, using actual data, is 65%
The model can predict passenger that perished in the incident, using actual data, is 97%
From all of predictions resulted from the model, the model ability to predict survived passengers correctly is 94%

5. K-NN and Logistic Regression Comparison

K_performance # K-NN Performance

##   K_Accuracy K_Recall K_Precision K_Specificity
## 1       0.83     0.65        0.94          0.97

performance # Logistic Performance

##   Accuracy Recall Precision Specificity
## 1     0.86   0.81      0.86         0.9

Looking at table above, if we want to choose the model can predict the passengers that is going to survive in the incident correctly using K-NN is better model compared to logistic due to Precision in K-NN is 94% compared to Logistic 86%.

6. Naive Bayes

For this Naive Bayes, we are still using same tntc_train and tntc_test from logistic regression above.

a. Modelling

Running Naive Bayes model.

nb_tntc <- naiveBayes(formula=Survived~.,data=tntc_train)

Making Prediction using nb_tntc with tntc_test.

pred_nb_tntc <- predict(object=nb_tntc,newdata = tntc_test, type="class")

b. Model Evaluation

We will do evaluate model accuracy nb_tntc using confusionMatrix.

confusionMatrix(pred_nb_tntc, reference = tntc_test$Survived, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 88 18
##          1 12 61
##                                               
##                Accuracy : 0.8324              
##                  95% CI : (0.7695, 0.884)     
##     No Information Rate : 0.5587              
##     P-Value [Acc > NIR] : 0.000000000000006858
##                                               
##                   Kappa : 0.6574              
##                                               
##  Mcnemar's Test P-Value : 0.3613              
##                                               
##             Sensitivity : 0.7722              
##             Specificity : 0.8800              
##          Pos Pred Value : 0.8356              
##          Neg Pred Value : 0.8302              
##              Prevalence : 0.4413              
##          Detection Rate : 0.3408              
##    Detection Prevalence : 0.4078              
##       Balanced Accuracy : 0.8261              
##                                               
##        'Positive' Class : 1                   
##

nb_TP = 61
nb_TN = 88
nb_FN = 18
nb_FP = 12

nb_Recall <- round((nb_TP)/(nb_TP+nb_FN),2)
nb_Specificity <- round((nb_TN)/(nb_TN+nb_FP),2)
nb_Accuracy <- round((nb_TP+nb_TN)/(nb_TP+nb_TN+nb_FP+nb_FN),2)
nb_Precision <- round((nb_TP)/(nb_TP+nb_FP),2)

nb_performance <- cbind.data.frame(nb_Accuracy, nb_Recall, nb_Precision, nb_Specificity)
nb_performance

##   nb_Accuracy nb_Recall nb_Precision nb_Specificity
## 1        0.83      0.77         0.84           0.88

From the confusion matrix above we can see the model that:

The model ability to predict the target value which is passengers’ survivability in Titanic incident is 83%
The model ability to predict passenger that survived in the incident, using actual data, is 77%
The model can predict passenger that perished in the incident, using actual data, is 88%
From all of predictions resulted from the model, the model ability to predict survived passengers correctly is 84%

Other methods of model evaluation are ROC/AUC as shown below:

ROC

First, we need to change the prediction outcom to raw data so we can see the probability.

prob_nb_tntc <- predict(object=nb_tntc,newdata = tntc_test, type="raw")
head(prob_nb_tntc)

##              0          1
## [1,] 0.2121731 0.78782687
## [2,] 0.1914160 0.80858399
## [3,] 0.1689987 0.83100130
## [4,] 0.9726496 0.02735036
## [5,] 0.2101217 0.78987826
## [6,] 0.1143287 0.88567130

Then, we combine the postive probability with actual outcome in tntc_test as the base data to generate ROC plot.

tntc_nb_roc <- data.frame(pred=prob_nb_tntc[,2],
                         label=tntc_test$Survived)
head(tntc_nb_roc)

##         pred label
## 1 0.78782687     1
## 2 0.80858399     1
## 3 0.83100130     1
## 4 0.02735036     0
## 5 0.78987826     0
## 6 0.88567130     1

Below is the plotting of ROC.

tntc_nb_roc_perf <- prediction(predictions=tntc_nb_roc$pred,
                               labels=tntc_nb_roc$label)
plot(performance(prediction.obj=tntc_nb_roc_perf, measure="tpr", x.measure="fpr"))

From the line formed in chart above we can see the the model has high True Positive Rate low positive rate. We can say the model is sufficient to predict the passengers survival. To show how good the model perform in terms of number we need to check AUC.

AUC

We will perform AUC for nb_tntc.

tntc_nb_auc <- performance(prediction.obj=tntc_nb_roc_perf, measure="auc")
tntc_nb_auc@y.values

## [[1]]
## [1] 0.8512658

We can see the AUC of nb_tntc is 0.8512658 which is sufficient in terms of performance.

7. Decision Tree (DT)

For this decision tree model, we will use tntc_train and tntc_test from logistic model above.

a.Modelling

We will run the decision tree using rpart so we can neatly visualized the decision tree also to make interpretation easier.

dt_tntc <- rpart(formula=Survived~., data=tntc_train, method = "class")
dt_tntc

## n= 712 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 712 263 0 (0.63061798 0.36938202)  
##     2) title=Mr 413  64 0 (0.84503632 0.15496368)  
##       4) Fare< 26.26875 309  29 0 (0.90614887 0.09385113) *
##       5) Fare>=26.26875 104  35 0 (0.66346154 0.33653846)  
##        10) Fare>=26.775 92  26 0 (0.71739130 0.28260870) *
##        11) Fare< 26.775 12   3 1 (0.25000000 0.75000000) *
##     3) title=Miss,Mrs,Rare 299 100 1 (0.33444816 0.66555184)  
##       6) Pclass=3 148  67 0 (0.54729730 0.45270270)  
##        12) Fare>=20.825 43   5 0 (0.88372093 0.11627907) *
##        13) Fare< 20.825 105  43 1 (0.40952381 0.59047619)  
##          26) Sex=female 96  43 1 (0.44791667 0.55208333)  
##            52) Fare>=7.8875 61  28 0 (0.54098361 0.45901639)  
##             104) Fare< 15.3729 40  13 0 (0.67500000 0.32500000) *
##             105) Fare>=15.3729 21   6 1 (0.28571429 0.71428571) *
##            53) Fare< 7.8875 35  10 1 (0.28571429 0.71428571) *
##          27) Sex=male 9   0 1 (0.00000000 1.00000000) *
##       7) Pclass=1,2 151  19 1 (0.12582781 0.87417219)  
##        14) Sex=male 27  12 1 (0.44444444 0.55555556)  
##          28) Age>=15.5 17   5 0 (0.70588235 0.29411765) *
##          29) Age< 15.5 10   0 1 (0.00000000 1.00000000) *
##        15) Sex=female 124   7 1 (0.05645161 0.94354839) *

After we run the model, we will visualize it using fancyRpartPlot.

fancyRpartPlot(dt_tntc, sub=NULL,cex=0.6)

table(titanic_clean$Embarked)

## 
##       C   Q   S 
##   2 168  77 644

From plot above, we can interpret that:

If a passenger is has Mr. as title and buy ticket fare that less than $26, there is 43% chance that he will not survive.
Meanwhile, if the passenger not a Mr, and has 3rd class ticket and is a male, there is 17% the passenger will survive.

dt_tntc_pred<- predict(object=dt_tntc,newdata = tntc_test, type="class")

b. Evaluation

We will do evaluate model accuracy dt_tntc using confusionMatrix.

confusionMatrix(dt_tntc_pred, reference = tntc_test$Survived, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 94 23
##          1  6 56
##                                               
##                Accuracy : 0.838               
##                  95% CI : (0.7757, 0.8887)    
##     No Information Rate : 0.5587              
##     P-Value [Acc > NIR] : 0.000000000000001714
##                                               
##                   Kappa : 0.6639              
##                                               
##  Mcnemar's Test P-Value : 0.002967            
##                                               
##             Sensitivity : 0.7089              
##             Specificity : 0.9400              
##          Pos Pred Value : 0.9032              
##          Neg Pred Value : 0.8034              
##              Prevalence : 0.4413              
##          Detection Rate : 0.3128              
##    Detection Prevalence : 0.3464              
##       Balanced Accuracy : 0.8244              
##                                               
##        'Positive' Class : 1                   
##

dt_TP = 56
dt_TN = 94
dt_FN = 23
dt_FP = 6

dt_Recall <- round((dt_TP)/(dt_TP+dt_FN),2)
dt_Specificity <- round((dt_TN)/(dt_TN+dt_FP),2)
dt_Accuracy <- round((dt_TP+dt_TN)/(dt_TP+dt_TN+dt_FP+dt_FN),2)
dt_Precision <- round((dt_TP)/(dt_TP+dt_FP),2)

dt_performance <- cbind.data.frame(dt_Accuracy, dt_Recall, dt_Precision, dt_Specificity)
dt_performance

##   dt_Accuracy dt_Recall dt_Precision dt_Specificity
## 1        0.84      0.71          0.9           0.94

From the confusion matrix above we can see the model that:

The model ability to predict the target value which is passengers’ survivability in Titanic incident is 84%
The model ability to predict passenger that survived in the incident, using actual data, is 71%
The model can predict passenger that perished in the incident, using actual data, is 94%
From all of predictions resulted from the model, the model ability to predict survived passengers correctly is 90%

Other methods of model evaluation are ROC/AUC as shown below:

ROC

First, we need to change the prediction outcom to prob data so we can see the probability.

prob_dt_tntc <- predict(object=dt_tntc,newdata = tntc_test, type="prob")
head(prob_dt_tntc)

##             0          1
## 9  0.67500000 0.32500000
## 11 0.28571429 0.71428571
## 16 0.05645161 0.94354839
## 21 0.90614887 0.09385113
## 25 0.88372093 0.11627907
## 40 0.67500000 0.32500000

Then, we combine the postive probability with actual outcome in tntc_test as the base data to generate ROC plot.

tntc_dt_roc <- data.frame(pred=prob_dt_tntc[,2],
                         label=tntc_test$Survived)
head(tntc_dt_roc)

##          pred label
## 9  0.32500000     1
## 11 0.71428571     1
## 16 0.94354839     1
## 21 0.09385113     0
## 25 0.11627907     0
## 40 0.32500000     1

Below is the plotting of ROC.

tntc_dt_roc_perf <- prediction(predictions=tntc_dt_roc$pred,
                               labels=tntc_dt_roc$label)
plot(performance(prediction.obj=tntc_dt_roc_perf, measure="tpr", x.measure="fpr"))

AUC

tntc_dt_auc <- performance(prediction.obj=tntc_dt_roc_perf, measure="auc")
tntc_dt_auc@y.values

## [[1]]
## [1] 0.8837975

We can see the AUC of dt_tntc is 0.8837975 which is sligtly better than nb_tntc.

8. Naive Bayes and DT Comparison

Looking AUC, DT has better AUC (0.8837975) compared to Naive Bayes (0.8512658). Also, if we want to see the correct prediction of survival, DT has better performance, 0.9 for DT and 0.84 for Naive Bayes. DT model have better performance than Naive Bayes.

titanic

Yogie Pratama

2/26/2022

0. Library

1. Read Data & Understanding

2. Data Manipulation

a. Data Types & Missing Values

b. Predictor Modification

`title`

`alone`

3. Logistic Regression

a. Pre-Processing

b. Splitting Data

c. Modelling

d. Model Fitting

e. Prediction

f. Evaluation

g. Model Interpretation

4. K-NN

a. Pre-Processing

b. K-NN Prediction

c. KNN Using Scaled Predictors

5. K-NN and Logistic Regression Comparison

6. Naive Bayes

a. Modelling

b. Model Evaluation

ROC

AUC

7. Decision Tree (DT)

a.Modelling

b. Evaluation

ROC

AUC

8. Naive Bayes and DT Comparison

Reference