The data is about biography of passengers of Titanic. It contains information about gender, who survived or not, ticket class, age, total sibling that aboard the titanic, number of parents aboard, passenger fare, and port of embarkation. This research is to predict the rate survival of Titanic’s passenger based on those thing. This research will use two models, logistic regression and KNN analysis. At the end of this analysis, this will compare performance between logistic and KNN model.
The steps of the analysis are as followed :
titanic <- read.csv("train_and_test3.csv")
The data consist of 9 variable as followed :
0 as male and 1 as female1 as first class, 2 as second class, 3 as third class0 as Cherbourg, 1 as Queenstown, 2 as Southampton1 as Survived and 0 as Died# To see data structure
str(titanic)
## 'data.frame': 1309 obs. of 15 variables:
## $ Passengerid: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : num 22 38 26 35 35 28 54 2 27 14 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Sex : int 0 1 1 1 0 0 0 0 1 1 ...
## $ sibsp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Embarked : int 2 0 2 2 2 1 2 2 2 0 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Age1 : Factor w/ 5 levels "0-10","11-20",..: 3 3 3 3 3 3 4 1 3 2 ...
## $ X : logi NA NA NA NA NA NA ...
## $ X.1 : logi NA NA NA NA NA NA ...
## $ X.2 : logi NA NA NA NA NA NA ...
## $ X.3 : logi NA NA NA NA NA NA ...
## $ X.4 : logi NA NA NA NA NA NA ...
Class of the data mostly are integer and numeric.
# Based on data structur and data variation, we would not use `PassengerID` and `Fare`
library(dplyr)
titanic1 <- titanic %>%
select(Sex, sibsp, Parch, Pclass, Embarked, Survived, Age1)
# to see if there is missing value on data frame
colSums(is.na(titanic1))
## Sex sibsp Parch Pclass Embarked Survived Age1
## 0 0 0 0 2 0 0
There are 2 missing value of ‘Embarked’ Variable. Therefore we need to remove NA observation.
# Then we use `dplyr` to mutate variable from numeric to factor
library(dplyr)
library(tidyverse)
titanic1 <- titanic1 %>%
drop_na(Embarked) %>%
mutate(Survived = case_when(Survived == 1 ~ "Yes",
Survived == 0 ~ "No"),
Survived = as.factor(Survived),
Pclass = case_when(Pclass == 1 ~ "First Class",
Pclass == 2 ~ "Second Class",
Pclass == 3 ~ "Third Class"),
Sex = case_when(Sex == 1 ~ "Female",
Sex == 0 ~ "Male"),
Embarked = case_when(Embarked == 0 ~ "Cherbourg",
Embarked == 1 ~ "Queenstown",
Embarked == 2 ~ "Southampton")
)
library(ggplot2)
titanic1 %>%
ggplot(mapping = aes(x = Sex)) +
geom_density(aes(fill = Survived)) +
theme_minimal()
set.seed(100)
idx_l <- sample(nrow(titanic1), size=nrow(titanic1)*0.8)
train <- titanic1[idx_l,]
test <- titanic1[-idx_l,]
# checking whether the data train has been distributed equally
prop.table(table(train$Survived))
##
## No Yes
## 0.7320574 0.2679426
the data train has not been distributed eqaully. therefore, need to do upSample
library(caret)
train_up <- upSample(x = train %>%
select(-Survived),
y = as.factor(train$Survived),
yname = "Survived")
prop.table(table(train_up$Survived))
##
## No Yes
## 0.5 0.5
The train data has been distributed equally for the number of Survival
using stepwise backward
model_backward_up <- step(glm(Survived~., data = train_up, family="binomial"), direction = "backward")
## Start: AIC=1684.28
## Survived ~ Sex + sibsp + Parch + Pclass + Embarked + Age1
##
## Df Deviance AIC
## - Embarked 2 1662.1 1682.1
## - Parch 1 1661.9 1683.9
## <none> 1660.3 1684.3
## - sibsp 1 1674.4 1696.4
## - Age1 4 1709.0 1725.0
## - Pclass 2 1737.6 1757.6
## - Sex 1 1919.7 1941.7
##
## Step: AIC=1682.07
## Survived ~ Sex + sibsp + Parch + Pclass + Age1
##
## Df Deviance AIC
## - Parch 1 1663.8 1681.8
## <none> 1662.1 1682.1
## - sibsp 1 1676.8 1694.8
## - Age1 4 1711.3 1723.3
## - Pclass 2 1753.2 1769.2
## - Sex 1 1938.0 1956.0
##
## Step: AIC=1681.75
## Survived ~ Sex + sibsp + Pclass + Age1
##
## Df Deviance AIC
## <none> 1663.8 1681.8
## - sibsp 1 1681.8 1697.8
## - Age1 4 1711.5 1721.5
## - Pclass 2 1755.3 1769.3
## - Sex 1 1942.0 1958.0
summary(model_backward_up)$call
## glm(formula = Survived ~ Sex + sibsp + Pclass + Age1, family = "binomial",
## data = train_up)
Variable ‘embarked’ is not significant for this model, therefore the logistic model to predict who survived on Titanic is:
glm(formula = Survived ~ Sex + sibsp + Parch + Pclass + Age1, family = “binomial”, data = train_up)
we found that port embarked, parch (number of parent) do not include in this model.
library(gtools)
data.frame("coef" = coef(model_backward_up),
"Odds_ratio" = exp(coef(model_backward_up)),
"prob" = inv.logit(coef(model_backward_up)))
Interpretation : - Class Based on regression we can conclude that probability of Titanic’s Passenger second class and third class to survive are 38% and 23% respectively. or comparing to first class Titanic’s passenger their probability to survive are smaller.
Sex Probability of male to survive is smaller than female, in which probability of Male passenger of Titanic to survive is 14%,
sibSp (number of siblings passenger aboard the Titanic) So number of family aboard on Titanic also affect to probability someone to survive on Titanic. they who have more number of siblings were tend to have more probability to die. Probability of Titanic’s Passenger to survive is 45%.
Age, probability of child (0-10) to survive is larger than adult. Moreover, probability of adult (21-40) to survive is higher than elder (61-80).
To sum up, they who are first class, female, young and have less family aboard on Titanic has the more probability to survive.
# Predict the data using data test.
test$probability_up <- predict(model_backward_up, newdata = test
, type = "response")
test1 <- test
test <- test %>%
select(-probability_up)
# Comparing result prediction to data actual
test1 %>%
select(probability_up, Survived)
#convert to categorical
test1$prediction_up <- ifelse(test1$probability_up > 0.5, "Yes", "No")
test1 %>%
select(prediction_up, Survived)
table("prediction" = test1$prediction_up,
"actual" = test1$Survived)
## actual
## prediction No Yes
## No 139 16
## Yes 63 44
library(caret)
confusionMatrix(data = as.factor(test1$prediction_up),
reference = as.factor(test1$Survived),
positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 139 16
## Yes 63 44
##
## Accuracy : 0.6985
## 95% CI : (0.639, 0.7534)
## No Information Rate : 0.771
## P-Value [Acc > NIR] : 0.9973
##
## Kappa : 0.3305
##
## Mcnemar's Test P-Value : 0.0000002274
##
## Sensitivity : 0.7333
## Specificity : 0.6881
## Pos Pred Value : 0.4112
## Neg Pred Value : 0.8968
## Prevalence : 0.2290
## Detection Rate : 0.1679
## Detection Prevalence : 0.4084
## Balanced Accuracy : 0.7107
##
## 'Positive' Class : Yes
##
from the model above, we know that Accuracy level is 75%, Sensitivity / Recall is 73%, and precision is 50%. There is no preference to use sensitivity nor precision. Therefore the metric that will be used is Accuracy.
Preparing predictor and target variable. In KNN analysis only variable numeric that can be analyzed. therfore, we need to mutate variable gender into numeric variable.
train_up_knn <- train_up %>%
mutate(Sex = case_when(Sex == "Female" ~ 1,
Sex == "Male" ~ 0))
test_knn <- test %>%
mutate(Sex = case_when(Sex == "Female" ~ 1,
Sex == "Male" ~ 0))
#Persiapan data X
train_x <- train_up_knn %>%
select_if(is.numeric)
test_x <- test_knn %>%
select_if(is.numeric)
#Persiapan data y
train_y <- train_up_knn %>%
select(Survived)
test_y <- test_knn %>%
select(Survived)
# To standardize the Z-score
train_x <- train_x %>%
scale()
test_x <- test_x %>%
scale(center= attr(train_x, "scaled:center"),
scale = attr(train_x, "scaled:scale"))
# to find out the square root of the data in order to use in KNN prediction
sqrt(nrow(train_x))
## [1] 39.11521
dim(train_x)
## [1] 1530 3
dim(test_x)
## [1] 262 3
length(train_up$Survived)
## [1] 1530
#since the length of data is 1542 (even) therefore for k value we take 'odd' number (39)
library(class)
knn_prediction <- knn(train = train_x,
test = test_x,
cl = train_y$Survived,
k= 39)
confusionMatrix(data = as.factor(knn_prediction),
reference = test_y$Survived,
positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 138 16
## Yes 64 44
##
## Accuracy : 0.6947
## 95% CI : (0.635, 0.7498)
## No Information Rate : 0.771
## P-Value [Acc > NIR] : 0.9983
##
## Kappa : 0.3251
##
## Mcnemar's Test P-Value : 0.0000001482
##
## Sensitivity : 0.7333
## Specificity : 0.6832
## Pos Pred Value : 0.4074
## Neg Pred Value : 0.8961
## Prevalence : 0.2290
## Detection Rate : 0.1679
## Detection Prevalence : 0.4122
## Balanced Accuracy : 0.7083
##
## 'Positive' Class : Yes
##
Based on the analysis using matrix confusion above, we know that Accuracy level resulted using logistic regression and KNN analysis is 75.5% and 75.1% respectively. Therefore we could say performance of both model are same. MeanWhile, based on sensitivity from logistic regression we found 72% that slightly below compared to KNN (75%). However, precision value of logistic regression and KNN analysis are same at 50%.
Therefore, based on that performance we can use logistic regression to predict the survival rate on Titanic over KNN analysis. Besides, there are more variable that can be used in logistic regression to predict the survival rate/