Introduction

This page contains dataset of titanic passenger who were onboard during the disaster. I obtained the data from kaggle.com/datasets.
I am going to predict the passengers of the ship whether they survived or not. I will use machine learning algorithm with R programming language.

Happy Reading!

Preparation

Package Loading

library(tidyverse)
library(car)
library(GGally)
library(brglm2)
library(e1071)
library(caret)
library(recipes)
library(rsample)
library(class)

Data Preparation

First of all, I have to read the data and store in to train and test

train <- read.csv("train.csv")
test <- read.csv("test.csv")

Combine data

test$Survived <- NA;
comb <- rbind(train, test);
tr_idx <- seq(nrow(train))

Now, I will fix the dataset

The following fixes SibSp/Parch values for two passengers (Id=280 and Id=1284) according to this kernel because a 16 year old can’t have a 13 year old son! So I may fix the problem.

comb$SibSp[comb$PassengerId==280] = 0
comb$Parch[comb$PassengerId==280] = 2
comb$SibSp[comb$PassengerId==1284] = 1
comb$Parch[comb$PassengerId==1284] = 1

I will fix the class of the column into the correct ones.

comb <- comb %>%
    mutate(Survived = as.factor(Survived),
         Pclass = as.factor(Pclass),
         Embarked = as.factor(Embarked),
         Sex = as.factor(Sex))

Check if there any missing value

colSums(is.na(comb))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0         418           0           0           0         263 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           1           0           0

There are missing values which have to be imputed and after that we can split our data gain into training and testing. We may use recipes library.

set.seed(100)
split <- initial_split(data = comb, prop = 890/1309, strata = "Survived")

rec<- recipe(Survived~., training(split)) %>% 
  step_meanimpute(Age, Fare) %>%
  step_modeimpute(Survived) %>% 
  prep()

Split the data into learn and exam

learn <- juice(rec)
colSums(is.na(learn))

## PassengerId      Pclass        Name         Sex         Age       SibSp 
##           0           0           0           0           0           0 
##       Parch      Ticket        Fare       Cabin    Embarked    Survived 
##           0           0           0           0           0           0

exam <- bake (rec, testing(split))
colSums(is.na(exam))

## PassengerId      Pclass        Name         Sex         Age       SibSp 
##           0           0           0           0           0           0 
##       Parch      Ticket        Fare       Cabin    Embarked    Survived 
##           0           0           0           0           0           0

There are no NA or missing values in the datasets.

Let’s check data proportion of learn first.

table(learn$Survived) %>%
  prop.table()

## 
##         0         1 
## 0.7429854 0.2570146

We found that the ratio of data is imbalance, hence it has to be balanced by using upsample function

learn_up <- upSample(x = learn %>%  select(-Survived), 
                     y = learn$Survived,yname = "Survived")

Check again the proportion

table(learn_up$Survived) %>% 
  prop.table()

## 
##   0   1 
## 0.5 0.5

Yup, the data is balance now.

Exploratory Data Analysis (EDA)

In this part, I am going to observe my data. I will see the correlations between the predictors and target, hence I can make the model.

head(learn_up)

##   PassengerId Pclass                                 Name    Sex      Age SibSp
## 1           5      3             Allen, Mr. William Henry   male 35.00000     0
## 2           6      3                     Moran, Mr. James   male 30.15165     0
## 3           8      3       Palsson, Master. Gosta Leonard   male  2.00000     3
## 4          13      3       Saundercock, Mr. William Henry   male 20.00000     0
## 5          14      3          Andersson, Mr. Anders Johan   male 39.00000     1
## 6          15      3 Vestrom, Miss. Hulda Amanda Adolfina female 14.00000     0
##   Parch    Ticket    Fare Cabin Embarked Survived
## 1     0    373450  8.0500              S        0
## 2     0    330877  8.4583              Q        0
## 3     1    349909 21.0750              S        0
## 4     0 A/5. 2151  8.0500              S        0
## 5     5    347082 31.2750              S        0
## 6     0    350406  7.8542              S        0

ggcorr(learn_up, label = T)

## Warning in ggcorr(learn_up, label = T): data in column(s) 'Pclass', 'Name',
## 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Survived' are not numeric and were
## ignored

It seems that PassengerId has no correlation with other variable.

Modelling

Logistics Regression

Model Building

model1<- glm (formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + 
                Embarked, family = "binomial",
              data = learn_up)
summary (model1)

## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + 
##     Fare + Embarked, family = "binomial", data = learn_up)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2261  -0.7977  -0.1501   0.8538   2.0261  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  16.290578 353.286877   0.046  0.96322    
## Pclass2      -0.683230   0.208522  -3.277  0.00105 ** 
## Pclass3      -1.407711   0.201693  -6.979 2.96e-12 ***
## Sexmale      -1.883793   0.136971 -13.753  < 2e-16 ***
## Age          -0.031988   0.005621  -5.690 1.27e-08 ***
## SibSp        -0.194806   0.076142  -2.558  0.01051 *  
## Parch        -0.020145   0.083601  -0.241  0.80958    
## Fare         -0.001161   0.001397  -0.831  0.40573    
## EmbarkedC   -13.107603 353.286786  -0.037  0.97040    
## EmbarkedQ   -13.532995 353.286830  -0.038  0.96944    
## EmbarkedS   -13.427360 353.286774  -0.038  0.96968    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1835.5  on 1323  degrees of freedom
## Residual deviance: 1469.4  on 1313  degrees of freedom
## AIC: 1491.4
## 
## Number of Fisher Scoring iterations: 13

pred1<- predict(object = model1, newdata = exam, type ="response")
pred_round1 <- as.factor(ifelse(pred1 >= 0.5, "1", "0"))

Save the data frame to data object

dat <- data.frame(PassengerId = exam$PassengerId, Survived = pred_round1)

confusionMatrix(data = pred_round1, reference = exam$Survived, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 236  27
##          1  69  86
##                                          
##                Accuracy : 0.7703         
##                  95% CI : (0.727, 0.8098)
##     No Information Rate : 0.7297         
##     P-Value [Acc > NIR] : 0.033          
##                                          
##                   Kappa : 0.4788         
##                                          
##  Mcnemar's Test P-Value : 2.857e-05      
##                                          
##             Sensitivity : 0.7611         
##             Specificity : 0.7738         
##          Pos Pred Value : 0.5548         
##          Neg Pred Value : 0.8973         
##              Prevalence : 0.2703         
##          Detection Rate : 0.2057         
##    Detection Prevalence : 0.3708         
##       Balanced Accuracy : 0.7674         
##                                          
##        'Positive' Class : 1              
##

From the matrix above, it can be concluded that model1 has already had a good accuracy pf prediction, 0,77 ~ 77%.

Model Improvement

If 77% of accuracy is not enough, we can still try to improve it. One way to do it is by resize the treshold of the prediction. Since the sensitivity of the data is necessary, I will try to increase the treshold to 0.7.

pred_round2 <- as.factor(ifelse(pred1 >= 0.7, "1", "0"))

confusionMatrix(data = pred_round2, reference = exam$Survived, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 281  58
##          1  24  55
##                                           
##                Accuracy : 0.8038          
##                  95% CI : (0.7625, 0.8408)
##     No Information Rate : 0.7297          
##     P-Value [Acc > NIR] : 0.0002688       
##                                           
##                   Kappa : 0.4507          
##                                           
##  Mcnemar's Test P-Value : 0.0002682       
##                                           
##             Sensitivity : 0.4867          
##             Specificity : 0.9213          
##          Pos Pred Value : 0.6962          
##          Neg Pred Value : 0.8289          
##              Prevalence : 0.2703          
##          Detection Rate : 0.1316          
##    Detection Prevalence : 0.1890          
##       Balanced Accuracy : 0.7040          
##                                           
##        'Positive' Class : 1               
##

As we see, the accuracy of the model is increased to 80%. Great!
Store it in data2

dat2<- data.frame(PassengerId = exam$PassengerId, Survived = pred_round2)

KNN Method

Besides logistic regression, there is a method called knn method which can be used as classification function.

I will split my learn and exam data to learn_x (for predictors) and learn_ y (for target variable survived) and exam_x (for predictors) exam_y (target variable survived).

First of all, I will scale my learn and exam data. Since knn method is effective only in numeric variables, I eliminate all factor and character variable from the data.

learn_z <- learn_up %>% 
  select(-c(PassengerId, Pclass, Name , Sex, Ticket, Cabin, Embarked)) %>% 
  mutate_if(is.numeric, scale) 

exam_z <- exam %>% 
  select(-c(PassengerId, Pclass, Name , Sex, Ticket, Cabin, Embarked)) %>% 
  mutate_if(is.numeric, scale)

Split into train and test.

train_x <- learn_z[,-5 ]
train_y <- learn_z [,5]
test_x <- exam_z [, -5]
test_y <- exam_z [ ,5]

To get the k value off knn, I will find the squared root of my total rows. The K value then determined by the total levels of my Survived levels. If it is even so I will choose odd K, while it is odd, I will choose even K number.

sqrt(nrow(learn_up))

## [1] 36.38681

Since my Survived levels are consisted of 2 levels, “0” and “1” , I will pick odd number of K which is 37.

Now build the knn model using library class

pred_knn <- knn(train = train_x, test = test_x, cl = train_y, k = 37 )

Using the confusion matrix once again

confusionMatrix(data = pred_knn, reference = test_y$Survived, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 198  44
##          1 107  69
##                                           
##                Accuracy : 0.6388          
##                  95% CI : (0.5907, 0.6849)
##     No Information Rate : 0.7297          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.221           
##                                           
##  Mcnemar's Test P-Value : 4.524e-07       
##                                           
##             Sensitivity : 0.6106          
##             Specificity : 0.6492          
##          Pos Pred Value : 0.3920          
##          Neg Pred Value : 0.8182          
##              Prevalence : 0.2703          
##          Detection Rate : 0.1651          
##    Detection Prevalence : 0.4211          
##       Balanced Accuracy : 0.6299          
##                                           
##        'Positive' Class : 1               
##

As we see on the confusion matrix, accuracy of this model is only 63%. This can be concluded that the knn method doesn’t work well on titanic data.

Conlusion

To make the predictions, I used two different methods of machine learning classification, logistic regression and knn method. Based on those methods, it can be said that logistic regression is more likely to be used rather than knn method due to their accuracy of prediction. This thing happened because knn method uses only the numeric variables, hence it denies all factor and character. In result, bad prediction produced.

Here is the final result of the model dat2 which has the best accuracy.

head(dat2)

##   PassengerId Survived
## 1           1        0
## 2           7        0
## 3           9        0
## 4          11        1
## 5          17        0
## 6          20        0

I will store it in csv

write.csv(x = dat2, file = "submission.csv")

Titanic Prediction using Logistic Regression and KNN Method

Alfado Sembiring