This report is part of LBB and will predict the survivability of the Titanic passenger in the incident that happened in 1912 using data provided by Kaggle Titanic Competition.
The RMS Titanic sank in the early morning hours of 15 April 1912 in the North Atlantic Ocean, four days into her maiden voyage from Southampton to New York City. The largest ocean liner in service at the time, Titanic had an estimated 2,224 people on board when she struck an iceberg at around 23:40 (ship’s time) on Sunday, 14 April 1912. Her sinking two hours and forty minutes later at 02:20 (ship’s time; 05:18 GMT) on Monday, 15 April, resulted in the deaths of more than 1,500 people, making it one of the deadliest peacetime maritime disasters in history.
Our aim is to predict what of passengers will likely to survive the Titanic shipwreck.
library(dplyr)
library(tidyverse)
library(gtools)
library(ggplot2)
library(caret)
library(class)
library(e1071)
library(ROCR)
library(rpart)
library(rattle)
library(rpart.plot)
options(scipen=99)We will use data provided by Kaggle from their machine learning competition.
titanic <- read.csv("train.csv")glimpse(titanic)## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1~
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3~
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl~
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal~
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ~
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0~
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0~
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37~
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,~
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C~
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"~
The data set contains following columns:
head(titanic)## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
We need to change data types of some variables to factor:
SurvivedPclassSexEmbarkedtitanic_clean <- titanic %>%
mutate(Survived=as.factor(Survived),
Pclass=as.factor(Pclass),
Sex=as.factor(Sex),
Embarked=as.factor(Embarked)
)Also, we need to check whether there is missing value or not.
colSums(is.na(titanic_clean))## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
There are 177 missing values in Age. We decided to fill the missing Age with median value.
summary(titanic_clean$Age)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 20.12 28.00 29.70 38.00 80.00 177
Replacing NA with median Age
titanic_clean <- titanic_clean %>%
mutate(Age=replace_na(Age,28.00))As we can see, there is no more missing values.
colSums(is.na(titanic_clean))## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 0
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
Below are the summary of the data set after initial wrangling.
summary(titanic_clean)## PassengerId Survived Pclass Name Sex
## Min. : 1.0 0:549 1:216 Length:891 female:314
## 1st Qu.:223.5 1:342 2:184 Class :character male :577
## Median :446.0 3:491 Mode :character
## Mean :446.0
## 3rd Qu.:668.5
## Max. :891.0
## Age SibSp Parch Ticket
## Min. : 0.42 Min. :0.000 Min. :0.0000 Length:891
## 1st Qu.:22.00 1st Qu.:0.000 1st Qu.:0.0000 Class :character
## Median :28.00 Median :0.000 Median :0.0000 Mode :character
## Mean :29.36 Mean :0.523 Mean :0.3816
## 3rd Qu.:35.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## Fare Cabin Embarked
## Min. : 0.00 Length:891 : 2
## 1st Qu.: 7.91 Class :character C:168
## Median : 14.45 Mode :character Q: 77
## Mean : 32.20 S:644
## 3rd Qu.: 31.00
## Max. :512.33
titleBeside that, we thought that there is necessity to extract title from Name that reflects their social status which also reflects their ability to afford higher class ticket in Titanic. We want to see their titles can increase their survivablity or not.
So, we extract the title using gsub.
titanic_clean <- titanic_clean %>%
mutate(title=gsub("(.*),\\s(.*)\\.(.*)", "\\2",Name)) %>%
mutate(title=ifelse(grepl("Martin", title), "Mrs", title))After title extract we can see the Titanic passengers’ title.
table(titanic_clean$title)##
## Capt Col Don Dr Jonkheer Lady
## 1 2 1 7 1 1
## Major Master Miss Mlle Mme Mr
## 2 40 182 2 1 517
## Mrs Ms Rev Sir the Countess
## 125 1 6 1 1
We can see that most common title are Mr, Mrs ans Miss. Other titles are less common/rare compared to Mr, Mrs ans Miss. To simplify we will categorize to only 4 categories which are Mr, Mrs, Miss, Rare.
titanic_clean <- titanic_clean %>%
mutate(title=ifelse(!grepl("Mr|Miss|Mrs", title), "Rare", title))After simplifying title, we now have 4 categories.
table(titanic_clean$title)##
## Miss Mr Mrs Rare
## 182 517 125 67
aloneWe need to know whether the passengers are alone or not by looking at SibSp and Parch. If passenger with either SibSp or Parch that greater than zero, then the passenger are not travelling alone. And We want to see whether the solo passenger have greater chance of survive or not.
We are going to add column alone by using formula below:
titanic_clean <- titanic_clean %>%
mutate(alone=ifelse(SibSp>0 | Parch>0, 0, 1)) %>%
mutate(alone=as.factor(alone),
title=as.factor(title))head(titanic_clean)## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male 28 0 0
## Ticket Fare Cabin Embarked title alone
## 1 A/5 21171 7.2500 S Mr 0
## 2 PC 17599 71.2833 C85 C Mrs 0
## 3 STON/O2. 3101282 7.9250 S Miss 1
## 4 113803 53.1000 C123 S Mrs 0
## 5 373450 8.0500 S Mr 1
## 6 330877 8.4583 Q Mr 1
After modification done above, we will drop some of the variables that are unique or if converted as factors will not gain much insight.
Ticket (dropped due to unique value)Name (dropped due to unique value)PassangerId (dropped due to unique value)Cabin (dropped due to unique value)SibSp (dropped after add alone variables)Parch (dropped after add alone variables)titanic_clean <- titanic_clean %>%
select(-Name,-Ticket,-PassengerId,-Cabin,-SibSp,-Parch)Below are the glance of titanic_clean.
head(titanic_clean)## Survived Pclass Sex Age Fare Embarked title alone
## 1 0 3 male 22 7.2500 S Mr 0
## 2 1 1 female 38 71.2833 C Mrs 0
## 3 1 3 female 26 7.9250 S Miss 1
## 4 1 1 female 35 53.1000 S Mrs 0
## 5 0 3 male 35 8.0500 S Mr 1
## 6 0 3 male 28 8.4583 Q Mr 1
summary(titanic_clean)## Survived Pclass Sex Age Fare Embarked
## 0:549 1:216 female:314 Min. : 0.42 Min. : 0.00 : 2
## 1:342 2:184 male :577 1st Qu.:22.00 1st Qu.: 7.91 C:168
## 3:491 Median :28.00 Median : 14.45 Q: 77
## Mean :29.36 Mean : 32.20 S:644
## 3rd Qu.:35.00 3rd Qu.: 31.00
## Max. :80.00 Max. :512.33
## title alone
## Miss:182 0:354
## Mr :517 1:537
## Mrs :125
## Rare: 67
##
##
In this part, we want to see the distribution of target class which is Survived.
prop.table(table(titanic_clean$Survived))##
## 0 1
## 0.6161616 0.3838384
table(titanic_clean$Survived)##
## 0 1
## 549 342
As we can see above, the proportion is not that perfectly balance but still tolerable. So, we decided not to modify the proportion for this model training.
We will split the data for train and test with proportion of 80% and 20% respectively.
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(909)
intrain <- sample(nrow(titanic_clean), nrow(titanic_clean)*0.8)
tntc_train <- titanic_clean[intrain,]
tntc_test <- titanic_clean[-intrain,]Target variable proportion on training data.
prop.table(table(tntc_train$Survived))##
## 0 1
## 0.630618 0.369382
Target variable proportion on test data.
prop.table(table(tntc_test$Survived))##
## 0 1
## 0.5586592 0.4413408
From the proportion comparison above, the proportion of target variable are more or less the same.
tn_model <- glm(formula=Survived~., data=tntc_train, family="binomial")
summary(tn_model)##
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = tntc_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3610 -0.6779 -0.3742 0.5982 2.5463
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 16.303183 1455.397603 0.011 0.991062
## Pclass2 -0.767544 0.339520 -2.261 0.023779 *
## Pclass3 -2.199526 0.327859 -6.709 0.0000000000196 ***
## Sexmale -14.648009 525.370416 -0.028 0.977757
## Age -0.034023 0.009184 -3.704 0.000212 ***
## Fare 0.001362 0.002444 0.557 0.577380
## EmbarkedC -13.154767 1455.397578 -0.009 0.992788
## EmbarkedQ -13.194515 1455.397594 -0.009 0.992767
## EmbarkedS -13.797635 1455.397568 -0.009 0.992436
## titleMr 12.213164 525.370523 0.023 0.981453
## titleMrs 0.862217 0.377330 2.285 0.022310 *
## titleRare 13.600986 525.370365 0.026 0.979346
## alone1 0.446838 0.253354 1.764 0.077784 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 937.88 on 711 degrees of freedom
## Residual deviance: 624.21 on 699 degrees of freedom
## AIC: 650.21
##
## Number of Fisher Scoring iterations: 14
We will use backward step model fitting to improve the model.
tn_model_2 <- step(object=tn_model, direction="backward", trace=0)summary(tn_model_2)##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + Embarked + title +
## alone, family = "binomial", data = tntc_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3236 -0.6753 -0.3725 0.6002 2.5518
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 16.458576 1455.397573 0.011 0.990977
## Pclass2 -0.841417 0.312579 -2.692 0.007106 **
## Pclass3 -2.283798 0.291288 -7.840 0.00000000000000449 ***
## Sexmale -14.639543 525.688466 -0.028 0.977783
## Age -0.034302 0.009164 -3.743 0.000182 ***
## EmbarkedC -13.158638 1455.397574 -0.009 0.992786
## EmbarkedQ -13.212855 1455.397590 -0.009 0.992756
## EmbarkedS -13.820439 1455.397563 -0.009 0.992423
## titleMr 12.201338 525.688572 0.023 0.981483
## titleMrs 0.849238 0.376589 2.255 0.024128 *
## titleRare 13.575700 525.688413 0.026 0.979397
## alone1 0.410959 0.244490 1.681 0.092786 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 937.88 on 711 degrees of freedom
## Residual deviance: 624.53 on 700 degrees of freedom
## AIC: 648.53
##
## Number of Fisher Scoring iterations: 14
From model fitting above we can see that the Fare variable is dropped.
After making the model using logistic regression after model fitting, we will predict the result using test data in tntc_test using predict using tn_model_2 model.
tntc_survival <- predict(tn_model_2,newdata = tntc_test, type = "response")tntc_survival_label <- as.factor(ifelse(tntc_survival > 0.5, "1", "0"))After we made the prediction, we will try to see the proportion of survivability between prediction and actual data.
Prediction
prop.table(table(tntc_survival_label))## tntc_survival_label
## 0 1
## 0.5865922 0.4134078
Actual Data
prop.table(table(tntc_test$Survived))##
## 0 1
## 0.5586592 0.4413408
From the proportion table above, we can see that the survivability distribution between Prediction and Actual is slightly different. Looking by table above only we cannot know the accuracy, so we need to do model evaluation in the next section to see how accurate the model is.
In this section we will evaluate tn_model_2 using confusion matrix.
tntc_conf <- confusionMatrix(tntc_survival_label, tntc_test$Survived, positive="1")
tntc_conf## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 90 15
## 1 10 64
##
## Accuracy : 0.8603
## 95% CI : (0.8008, 0.9075)
## No Information Rate : 0.5587
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.7149
##
## Mcnemar's Test P-Value : 0.4237
##
## Sensitivity : 0.8101
## Specificity : 0.9000
## Pos Pred Value : 0.8649
## Neg Pred Value : 0.8571
## Prevalence : 0.4413
## Detection Rate : 0.3575
## Detection Prevalence : 0.4134
## Balanced Accuracy : 0.8551
##
## 'Positive' Class : 1
##
TP = 64
TN = 90
FN = 15
FP = 10Recall <- round((TP)/(TP+FN),2)
Specificity <- round((TN)/(TN+FP),2)
Accuracy <- round((TP+TN)/(TP+TN+FP+FN),2)
Precision <- round((TP)/(TP+FP),2)
performance <- cbind.data.frame(Accuracy, Recall, Precision, Specificity)
performance## Accuracy Recall Precision Specificity
## 1 0.86 0.81 0.86 0.9
positive class, using actual data as referencenegative class, using actual data as referenceFrom the Confusion matrix above we can see that:
exp(tn_model_2$coefficients) %>%
data.frame()## .
## (Intercept) 14056232.6202546972781
## Pclass2 0.4310990159640
## Pclass3 0.1018964764177
## Sexmale 0.0000004386593
## Age 0.9662798651412
## EmbarkedC 0.0000019287501
## EmbarkedQ 0.0000018269642
## EmbarkedS 0.0000009950836
## titleMr 199055.2946955674561
## titleMrs 2.3378657515973
## titleRare 786777.0621204958297
## alone1 1.5082636870279
We can interpret if the passenger have rare title the odds of surviving the accident is higher than other passengers.
For this K-NN, we need to make dummy variables.
dummy <- dummyVars("~Survived+Pclass+Sex+Age+Fare+Embarked+title+alone", data=titanic_clean)
dummy <- data.frame(predict(dummy, newdata = titanic_clean))
str(dummy)## 'data.frame': 891 obs. of 19 variables:
## $ Survived.0: num 1 0 0 0 1 1 1 1 0 0 ...
## $ Survived.1: num 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass.1 : num 0 1 0 1 0 0 1 0 0 0 ...
## $ Pclass.2 : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Pclass.3 : num 1 0 1 0 1 1 0 1 1 0 ...
## $ Sex.female: num 0 1 1 1 0 0 0 0 1 1 ...
## $ Sex.male : num 1 0 0 0 1 1 1 1 0 0 ...
## $ Age : num 22 38 26 35 35 28 54 2 27 14 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Embarked.C: num 0 1 0 0 0 0 0 0 0 1 ...
## $ Embarked.Q: num 0 0 0 0 0 1 0 0 0 0 ...
## $ Embarked.S: num 1 0 1 1 1 0 1 1 1 0 ...
## $ title.Miss: num 0 0 1 0 0 0 0 0 0 0 ...
## $ title.Mr : num 1 0 0 0 1 1 1 0 0 0 ...
## $ title.Mrs : num 0 1 0 1 0 0 0 0 1 1 ...
## $ title.Rare: num 0 0 0 0 0 0 0 1 0 0 ...
## $ alone.0 : num 1 1 0 1 0 0 0 1 1 1 ...
## $ alone.1 : num 0 0 1 0 1 1 1 0 0 0 ...
Dropping variables with two categories so there will be no duplicate.
dummy <- dummy %>%
select(-c(Survived.0, Sex.female, alone.0))We can see dummy table below for K-NN.
glimpse(dummy)## Rows: 891
## Columns: 16
## $ Survived.1 <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1,~
## $ Pclass.1 <dbl> 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Pclass.2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,~
## $ Pclass.3 <dbl> 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1,~
## $ Sex.male <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0,~
## $ Age <dbl> 22, 38, 26, 35, 35, 28, 54, 2, 27, 14, 4, 58, 20, 39, 14, 5~
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, ~
## $ Embarked. <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Embarked.C <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,~
## $ Embarked.Q <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,~
## $ Embarked.S <dbl> 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,~
## $ title.Miss <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0,~
## $ title.Mr <dbl> 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0,~
## $ title.Mrs <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,~
## $ title.Rare <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,~
## $ alone.1 <dbl> 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1,~
dim(dummy)## [1] 891 16
We will split the training and test data using intrain we generated in previous section.
set.seed(909)
dummy_tr <- dummy[intrain,2:16]
dummy_ts <- dummy[-intrain,2:16]
dummy_tr_lab <- dummy[intrain,1]
dummy_ts_lab <- dummy[-intrain,1]Determining K for K-NN using square root of number of rows of training data.
dummy_tr %>%
nrow() %>%
sqrt()## [1] 26.68333
Running K-NN Prediction using knn.
knn_tntc <- knn(train=dummy_tr,
test=dummy_ts,
cl= dummy_tr_lab,
k=27)Running prediction of first knn_tntc model.
pred_knn_tntc <- confusionMatrix(data = knn_tntc,
reference = as.factor(dummy_ts_lab),
positive = "1")
pred_knn_tntc## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 85 42
## 1 15 37
##
## Accuracy : 0.6816
## 95% CI : (0.6079, 0.7491)
## No Information Rate : 0.5587
## P-Value [Acc > NIR] : 0.0005182
##
## Kappa : 0.3302
##
## Mcnemar's Test P-Value : 0.0005736
##
## Sensitivity : 0.4684
## Specificity : 0.8500
## Pos Pred Value : 0.7115
## Neg Pred Value : 0.6693
## Prevalence : 0.4413
## Detection Rate : 0.2067
## Detection Prevalence : 0.2905
## Balanced Accuracy : 0.6592
##
## 'Positive' Class : 1
##
As we can see from confusion matrix, most of the matrix are showing not that good accuracy compared to logistic regression above. So we need to do some tuning by scaling the dummy_tr and dummy_ts using Z-score method.
Below are the scaling process for dummy_tr and dummy_ts using scale.
dummy_tr_scaled <- scale(dummy_tr)
dummy_ts_scaled <- scale(dummy_ts,
center = attr(dummy_tr_scaled,"scaled:center"),
scale = attr(dummy_tr_scaled, "scaled:scale"))We will run K-NN using scaled train and test data.
knn_tntc_scaled <- knn(train=dummy_tr_scaled,
test=dummy_ts_scaled,
cl= dummy_tr_lab,
k=27)pred_knn_tntc_scaled <- confusionMatrix(data = knn_tntc_scaled,
reference = as.factor(dummy_ts_lab),
positive = "1")
pred_knn_tntc_scaled## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 97 28
## 1 3 51
##
## Accuracy : 0.8268
## 95% CI : (0.7633, 0.8792)
## No Information Rate : 0.5587
## P-Value [Acc > NIR] : 0.00000000000002639
##
## Kappa : 0.6367
##
## Mcnemar's Test P-Value : 0.00001628657576648
##
## Sensitivity : 0.6456
## Specificity : 0.9700
## Pos Pred Value : 0.9444
## Neg Pred Value : 0.7760
## Prevalence : 0.4413
## Detection Rate : 0.2849
## Detection Prevalence : 0.3017
## Balanced Accuracy : 0.8078
##
## 'Positive' Class : 1
##
From the confusion matrix above we can see that the model is better than the knn_tntc and with following description:
K_TP = 51
K_TN = 97
K_FN = 28
K_FP = 3K_Recall <- round((K_TP)/(K_TP+K_FN),2)
K_Specificity <- round((K_TN)/(K_TN+K_FP),2)
K_Accuracy <- round((K_TP+K_TN)/(K_TP+K_TN+K_FP+K_FN),2)
K_Precision <- round((K_TP)/(K_TP+K_FP),2)
K_performance <- cbind.data.frame(K_Accuracy, K_Recall, K_Precision, K_Specificity)
K_performance## K_Accuracy K_Recall K_Precision K_Specificity
## 1 0.83 0.65 0.94 0.97
K_performance # K-NN Performance## K_Accuracy K_Recall K_Precision K_Specificity
## 1 0.83 0.65 0.94 0.97
performance # Logistic Performance## Accuracy Recall Precision Specificity
## 1 0.86 0.81 0.86 0.9
Looking at table above, if we want to choose the model can predict the passengers that is going to survive in the incident correctly using K-NN is better model compared to logistic due to Precision in K-NN is 94% compared to Logistic 86%.
For this Naive Bayes, we are still using same tntc_train and tntc_test from logistic regression above.
Running Naive Bayes model.
nb_tntc <- naiveBayes(formula=Survived~.,data=tntc_train)Making Prediction using nb_tntc with tntc_test.
pred_nb_tntc <- predict(object=nb_tntc,newdata = tntc_test, type="class")We will do evaluate model accuracy nb_tntc using confusionMatrix.
confusionMatrix(pred_nb_tntc, reference = tntc_test$Survived, positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 88 18
## 1 12 61
##
## Accuracy : 0.8324
## 95% CI : (0.7695, 0.884)
## No Information Rate : 0.5587
## P-Value [Acc > NIR] : 0.000000000000006858
##
## Kappa : 0.6574
##
## Mcnemar's Test P-Value : 0.3613
##
## Sensitivity : 0.7722
## Specificity : 0.8800
## Pos Pred Value : 0.8356
## Neg Pred Value : 0.8302
## Prevalence : 0.4413
## Detection Rate : 0.3408
## Detection Prevalence : 0.4078
## Balanced Accuracy : 0.8261
##
## 'Positive' Class : 1
##
nb_TP = 61
nb_TN = 88
nb_FN = 18
nb_FP = 12nb_Recall <- round((nb_TP)/(nb_TP+nb_FN),2)
nb_Specificity <- round((nb_TN)/(nb_TN+nb_FP),2)
nb_Accuracy <- round((nb_TP+nb_TN)/(nb_TP+nb_TN+nb_FP+nb_FN),2)
nb_Precision <- round((nb_TP)/(nb_TP+nb_FP),2)
nb_performance <- cbind.data.frame(nb_Accuracy, nb_Recall, nb_Precision, nb_Specificity)
nb_performance## nb_Accuracy nb_Recall nb_Precision nb_Specificity
## 1 0.83 0.77 0.84 0.88
From the confusion matrix above we can see the model that:
Other methods of model evaluation are ROC/AUC as shown below:
First, we need to change the prediction outcom to raw data so we can see the probability.
prob_nb_tntc <- predict(object=nb_tntc,newdata = tntc_test, type="raw")
head(prob_nb_tntc)## 0 1
## [1,] 0.2121731 0.78782687
## [2,] 0.1914160 0.80858399
## [3,] 0.1689987 0.83100130
## [4,] 0.9726496 0.02735036
## [5,] 0.2101217 0.78987826
## [6,] 0.1143287 0.88567130
Then, we combine the postive probability with actual outcome in tntc_test as the base data to generate ROC plot.
tntc_nb_roc <- data.frame(pred=prob_nb_tntc[,2],
label=tntc_test$Survived)
head(tntc_nb_roc)## pred label
## 1 0.78782687 1
## 2 0.80858399 1
## 3 0.83100130 1
## 4 0.02735036 0
## 5 0.78987826 0
## 6 0.88567130 1
Below is the plotting of ROC.
tntc_nb_roc_perf <- prediction(predictions=tntc_nb_roc$pred,
labels=tntc_nb_roc$label)
plot(performance(prediction.obj=tntc_nb_roc_perf, measure="tpr", x.measure="fpr"))From the line formed in chart above we can see the the model has high True Positive Rate low positive rate. We can say the model is sufficient to predict the passengers survival. To show how good the model perform in terms of number we need to check AUC.
We will perform AUC for nb_tntc.
tntc_nb_auc <- performance(prediction.obj=tntc_nb_roc_perf, measure="auc")
tntc_nb_auc@y.values## [[1]]
## [1] 0.8512658
We can see the AUC of nb_tntc is 0.8512658 which is sufficient in terms of performance.
For this decision tree model, we will use tntc_train and tntc_test from logistic model above.
We will run the decision tree using rpart so we can neatly visualized the decision tree also to make interpretation easier.
dt_tntc <- rpart(formula=Survived~., data=tntc_train, method = "class")
dt_tntc## n= 712
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 712 263 0 (0.63061798 0.36938202)
## 2) title=Mr 413 64 0 (0.84503632 0.15496368)
## 4) Fare< 26.26875 309 29 0 (0.90614887 0.09385113) *
## 5) Fare>=26.26875 104 35 0 (0.66346154 0.33653846)
## 10) Fare>=26.775 92 26 0 (0.71739130 0.28260870) *
## 11) Fare< 26.775 12 3 1 (0.25000000 0.75000000) *
## 3) title=Miss,Mrs,Rare 299 100 1 (0.33444816 0.66555184)
## 6) Pclass=3 148 67 0 (0.54729730 0.45270270)
## 12) Fare>=20.825 43 5 0 (0.88372093 0.11627907) *
## 13) Fare< 20.825 105 43 1 (0.40952381 0.59047619)
## 26) Sex=female 96 43 1 (0.44791667 0.55208333)
## 52) Fare>=7.8875 61 28 0 (0.54098361 0.45901639)
## 104) Fare< 15.3729 40 13 0 (0.67500000 0.32500000) *
## 105) Fare>=15.3729 21 6 1 (0.28571429 0.71428571) *
## 53) Fare< 7.8875 35 10 1 (0.28571429 0.71428571) *
## 27) Sex=male 9 0 1 (0.00000000 1.00000000) *
## 7) Pclass=1,2 151 19 1 (0.12582781 0.87417219)
## 14) Sex=male 27 12 1 (0.44444444 0.55555556)
## 28) Age>=15.5 17 5 0 (0.70588235 0.29411765) *
## 29) Age< 15.5 10 0 1 (0.00000000 1.00000000) *
## 15) Sex=female 124 7 1 (0.05645161 0.94354839) *
After we run the model, we will visualize it using fancyRpartPlot.
fancyRpartPlot(dt_tntc, sub=NULL,cex=0.6)table(titanic_clean$Embarked)##
## C Q S
## 2 168 77 644
From plot above, we can interpret that:
dt_tntc_pred<- predict(object=dt_tntc,newdata = tntc_test, type="class")We will do evaluate model accuracy dt_tntc using confusionMatrix.
confusionMatrix(dt_tntc_pred, reference = tntc_test$Survived, positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 94 23
## 1 6 56
##
## Accuracy : 0.838
## 95% CI : (0.7757, 0.8887)
## No Information Rate : 0.5587
## P-Value [Acc > NIR] : 0.000000000000001714
##
## Kappa : 0.6639
##
## Mcnemar's Test P-Value : 0.002967
##
## Sensitivity : 0.7089
## Specificity : 0.9400
## Pos Pred Value : 0.9032
## Neg Pred Value : 0.8034
## Prevalence : 0.4413
## Detection Rate : 0.3128
## Detection Prevalence : 0.3464
## Balanced Accuracy : 0.8244
##
## 'Positive' Class : 1
##
dt_TP = 56
dt_TN = 94
dt_FN = 23
dt_FP = 6dt_Recall <- round((dt_TP)/(dt_TP+dt_FN),2)
dt_Specificity <- round((dt_TN)/(dt_TN+dt_FP),2)
dt_Accuracy <- round((dt_TP+dt_TN)/(dt_TP+dt_TN+dt_FP+dt_FN),2)
dt_Precision <- round((dt_TP)/(dt_TP+dt_FP),2)
dt_performance <- cbind.data.frame(dt_Accuracy, dt_Recall, dt_Precision, dt_Specificity)
dt_performance## dt_Accuracy dt_Recall dt_Precision dt_Specificity
## 1 0.84 0.71 0.9 0.94
From the confusion matrix above we can see the model that:
Other methods of model evaluation are ROC/AUC as shown below:
First, we need to change the prediction outcom to prob data so we can see the probability.
prob_dt_tntc <- predict(object=dt_tntc,newdata = tntc_test, type="prob")
head(prob_dt_tntc)## 0 1
## 9 0.67500000 0.32500000
## 11 0.28571429 0.71428571
## 16 0.05645161 0.94354839
## 21 0.90614887 0.09385113
## 25 0.88372093 0.11627907
## 40 0.67500000 0.32500000
Then, we combine the postive probability with actual outcome in tntc_test as the base data to generate ROC plot.
tntc_dt_roc <- data.frame(pred=prob_dt_tntc[,2],
label=tntc_test$Survived)
head(tntc_dt_roc)## pred label
## 9 0.32500000 1
## 11 0.71428571 1
## 16 0.94354839 1
## 21 0.09385113 0
## 25 0.11627907 0
## 40 0.32500000 1
Below is the plotting of ROC.
tntc_dt_roc_perf <- prediction(predictions=tntc_dt_roc$pred,
labels=tntc_dt_roc$label)
plot(performance(prediction.obj=tntc_dt_roc_perf, measure="tpr", x.measure="fpr"))From the line formed in chart above we can see the the model has high True Positive Rate low positive rate. We can say the model is sufficient to predict the passengers survival. To show how good the model perform in terms of number we need to check AUC.
tntc_dt_auc <- performance(prediction.obj=tntc_dt_roc_perf, measure="auc")
tntc_dt_auc@y.values## [[1]]
## [1] 0.8837975
We can see the AUC of dt_tntc is 0.8837975 which is sligtly better than nb_tntc.
Looking AUC, DT has better AUC (0.8837975) compared to Naive Bayes (0.8512658). Also, if we want to see the correct prediction of survival, DT has better performance, 0.9 for DT and 0.84 for Naive Bayes. DT model have better performance than Naive Bayes.