The RMS Titanic in port during 1912 (Source: Universal History Archive/Getty Images)

The RMS Titanic in port during 1912 (Source: Universal History Archive/Getty Images)

An overview

The infamous Titanic disaster that occurred in 1912 took the lives of at least 1,500 people, both passengers and crews alike and leaving only 706 passengers who managed to survived. Quoted from the the Titanic Kaggle Competition, “While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this challenge, we ask you to build a predictive model that answers the question: what sorts of people were more likely to survive?”

To answer that question, we will try to predict whether a passenger will survived or not, based on the given information about each of the Titanic passengers. There are three data sets that we will use here, namely the train, test and gender submission data sets. For Titanic Survival Prediction Part 1, we will try to compare model performances from two different algorithms that we can use in classification, that is the logistic regression and k-Nearest Neighbor (kNN).

Let’s get started!

Import files and packages

# Import packages
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.5     v stringr 1.4.0
## v tidyr   1.1.4     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tracerer) # to automatically calculate mode
## Warning: package 'tracerer' was built under R version 4.1.2
library(caret) # model evaluation
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(psych) # create dummy variables
## Warning: package 'psych' was built under R version 4.1.2
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(class) # kNN
## Warning: package 'class' was built under R version 4.1.2
# Import files
data_train <- read_csv("Titanic/train.csv")
## Rows: 891 Columns: 12
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
data_train

Columns description:

  • Survived: Survival. 0 = No, 1 = Yes.
  • Pclass: Ticket class. A proxy for socio-economic status (SES)
    • 1st = Upper
    • 2nd = Middle
    • 3rd = Lower
  • Sex: Sex
  • Age: Age in years. Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
  • SibSp: of siblings / spouses aboard the Titanic. The dataset defines family relations in this way:
    • Sibling = brother, sister, stepbrother, stepsister
    • Spouse = husband, wife (mistresses and fiancés were ignored)
  • parch: of parents / children aboard the Titanic. The dataset defines family relations in this way:
    • Parent = mother, father
    • Child = daughter, son, stepdaughter, stepson
    • Some children travelled only with a nanny, therefore parch=0 for them.
  • Ticket: Ticket’s number
  • Fare: Passenger fare
  • Cabin: Cabin’s number
  • Embarked: Port of embarkation.
    • C = Cherbourg
    • Q = Queenstown
    • S = Southampton

Data wrangling

head(data_train)
glimpse(data_train)
## Rows: 891
## Columns: 12
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
## $ Survived    <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1~
## $ Pclass      <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3~
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl~
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal~
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ~
## $ SibSp       <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0~
## $ Parch       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0~
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37~
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,~
## $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "C~
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"~
# maximum missing values that we tolerated for each columns
nrow(data_train)*0.05
## [1] 44.55
# check missing values from all columns
colSums(is.na(data_train))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

For data wrangling, we would need to:

  1. Change columns to its correct data types
  • Columns that need to be converted to factor: Survived, Pclass, Sex and Embarked.
  1. Drop column Cabin, Name, PassengerId and Ticket.
  • PassengerId only stored the identifier numbers for each passengers and doesn’t contain any valuable informations for us to analyse later.
  • For Cabin, we will drop it since it has a total of missing values that counts more than half of the total observations.
  • In the case of column Name, to be honest I think it actually stored valuable information that we can use to explore the family relations among the titanic passengers. It is a very useful variable to use when we are exploring the data sets with more advanced algorithm in classification. We could have keep it as it is but since we are using a much simpler model, that is with logistic regression and kNN, we would not use it to create our models.
  • Ticket has a character data type and stored no valuable informations.
# change columns data types into the correct ones
titanic_train <- data_train %>% 
  mutate_at(.vars = c("Survived", "Pclass", "Sex", "Embarked"), .funs = as.factor) %>% 
  select(-c(PassengerId, Cabin, Name, Ticket))

head(titanic_train)

Treatment for missing values

# check again for columns that have missing values
colSums(is.na(titanic_train))
## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0      177        0        0        0        2

We have located the columns that have missing values and those columns consisted of Age and Embarked. Since Embarked is a factor, we will replace the two missing values with its mode or the value that appeared most frequently. As for the Age column, we can either replace the 177 missing values with the mean or median. If there’s no outlier, then we’ll replace them with mean but if it does, then it is better to use the median number instead.

# check the mode for column `Embarked`
table(titanic_train$Embarked)
## 
##   C   Q   S 
## 168  77 644

Turns out, most of the passengers in the Titanic embarked from the Southampton (S) port with 644 passengers in total.

# replace missing values in column `Embarked` with the mode (Southampton port)
which(is.na(titanic_train$Embarked)) # locate missing values index in Embarked
## [1]  62 830
titanic_train <- titanic_train %>% 
  mutate(Embarked = if_else(is.na(Embarked), calc_mode(Embarked), Embarked))
# check outliers in Age
boxplot(titanic_train$Age)

Since we have plenty of outliers in the Age column, it is better to replace the missing values with the median.

# replace missing values in column `Age` with the median
median(titanic_train$Age, na.rm = T) # check the median age from all passengers without taking into account the NA
## [1] 28
titanic_train <- titanic_train %>% 
  mutate(Age = if_else(is.na(Age), median(Age, na.rm = T), Age))
# check again whether we still have missing values or not
anyNA(titanic_train)
## [1] FALSE

Great! All variables within the titanic_train dataframe is now stored in their correct data types and has no more missing values. So far, we have performed data cleansing and we’re ready to go and create classification models from it.

titanic_train

Machine Learning

As we have mentioned earlier, we will create a prediction of whether a passenger will survived or not using two different classification models; the logistic regression and the k-Nearest Neighbor (kNN).

The first step in creating a prediction with both models is to decide which of the variable that we want to set as the target and the predictor variables. Now, because our aim is to predicted which passengers that will survived or not, we will use the Survived column as the target variable while the rest of the variables will be used as the predictors.

  1. Target variable: Column Survived. We will also take “1” (Survived) as the positive class for our models.
  2. Predictor variable(s): All variables beside column Survived

Check whether the target variables have balanced or imbalanced class.

prop.table(table(titanic_train$Survived))
## 
##         0         1 
## 0.6161616 0.3838384

The proportion for our target class seems to be quite balanced and at this stage, we won’t try to downsampling / upsampling the target variable just yet.

Logistic Regression

Since we were already provided with the separate .csv files for the training and test data set, we don’t have to go through the cross-validation step anymore and we could import the test.csv and the gender_submission.csv files to our environment.

# importing test.csv and the gender_submission.csv
data_test <- read_csv("Titanic/test.csv")
## Rows: 418 Columns: 11
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (6): PassengerId, Pclass, Age, SibSp, Parch, Fare
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
gender_sub <- read_csv("Titanic/gender_submission.csv")
## Rows: 418 Columns: 2
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (2): PassengerId, Survived
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#left join column 'Survived' in gender_sub to data_test
data_test_merge <- merge(data_test, gender_sub, by = "PassengerId")
glimpse(data_test_merge)
## Rows: 418
## Columns: 12
## $ PassengerId <dbl> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903~
## $ Pclass      <dbl> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 2, 3, 3, 3~
## $ Name        <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Needs)", "M~
## $ Sex         <chr> "male", "female", "male", "male", "female", "male", "femal~
## $ Age         <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18.0, 21.0~
## $ SibSp       <dbl> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0~
## $ Parch       <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Ticket      <chr> "330911", "363272", "240276", "315154", "3101298", "7538",~
## $ Fare        <dbl> 7.8292, 7.0000, 9.6875, 8.6625, 12.2875, 9.2250, 7.6292, 2~
## $ Cabin       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "B45", NA,~
## $ Embarked    <chr> "Q", "S", "Q", "S", "S", "S", "Q", "S", "C", "S", "S", "S"~
## $ Survived    <dbl> 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1~
colSums(is.na(data_test_merge))
## PassengerId      Pclass        Name         Sex         Age       SibSp 
##           0           0           0           0          86           0 
##       Parch      Ticket        Fare       Cabin    Embarked    Survived 
##           0           0           1         327           0           0
# change columns data types into the correct ones
titanic_test <- data_test_merge %>% 
  mutate_at(.vars = c("Survived", "Pclass", "Sex", "Embarked"), .funs = as.factor) %>% 
  select(-c(PassengerId, Cabin, Name, Ticket))

head(titanic_train)

For the titanic_test data set, the missing values can be located in columns Age and Fare. Since column Fare data type is a numeric, we’ll decide whether to replace the missing values with its mean or median.

# Check for outliers in columns 'Age' and 'Fare'
boxplot(titanic_test$Age)

boxplot(titanic_test$Fare)

Both columns have outliers.

# Check the range of fare
range(titanic_test$Fare, na.rm = T)
## [1]   0.0000 512.3292
# Check the range of passenger's age
range(titanic_test$Age, na.rm = T)
## [1]  0.17 76.00
titanic_test %>% 
  filter(Fare == 0)
# Treatment for missing values in column `Age` and `Fare`

titanic_test <- titanic_test %>% 
  mutate(Age = if_else(is.na(Age), median(Age, na.rm = T), Age),
         Fare = if_else(is.na(Fare), median(Fare, na.rm = T), Fare))
anyNA(titanic_test)
## [1] FALSE

Create model

titanic_train
prop.table(table(titanic_train$Survived))
## 
##         0         1 
## 0.6161616 0.3838384
model_lm <- glm(Survived~., data = titanic_train, family = "binomial")
summary(model_lm)
## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = titanic_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6199  -0.6089  -0.4176   0.6187   2.4514  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.064159   0.472813   8.596  < 2e-16 ***
## Pclass2     -0.919468   0.297326  -3.092  0.00199 ** 
## Pclass3     -2.150048   0.297720  -7.222 5.13e-13 ***
## Sexmale     -2.719444   0.200977 -13.531  < 2e-16 ***
## Age         -0.038517   0.007855  -4.903 9.43e-07 ***
## SibSp       -0.321794   0.109193  -2.947  0.00321 ** 
## Parch       -0.093329   0.118856  -0.785  0.43232    
## Fare         0.002339   0.002469   0.947  0.34346    
## EmbarkedQ   -0.056267   0.381471  -0.148  0.88274    
## EmbarkedS   -0.434226   0.239530  -1.813  0.06986 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  785.04  on 881  degrees of freedom
## AIC: 805.04
## 
## Number of Fisher Scoring iterations: 5

Several things to concluded from the summary above:

  1. Column PClass, Sex, Age, SibSp and Embarked are the columns that significantly affected our target variable with p-value that has a score of less than 0.05.
  2. AIC or the information loss for model_lm is 805.04 while it has a residual deviance of 785.04.

The Estimate number display the log of odds for each columns. If we want to interpreted it, we can use the exp() function to converted them to their odds.

# log of odds -> odds (for each predictor variables)
data.frame(odds = exp(model_lm$coefficients))

Feature selection

# feature selection
model_all <- step(model_lm, direction = "backward", trace = F)
summary(model_all)
## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Embarked, 
##     family = "binomial", data = titanic_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6021  -0.6012  -0.4151   0.6218   2.4651  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.22353    0.41876  10.086  < 2e-16 ***
## Pclass2     -1.03964    0.26956  -3.857 0.000115 ***
## Pclass3     -2.30238    0.25453  -9.046  < 2e-16 ***
## Sexmale     -2.69944    0.19541 -13.814  < 2e-16 ***
## Age         -0.03874    0.00782  -4.954 7.26e-07 ***
## SibSp       -0.32720    0.10369  -3.155 0.001602 ** 
## EmbarkedQ   -0.05588    0.37822  -0.148 0.882536    
## EmbarkedS   -0.46779    0.23610  -1.981 0.047557 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  786.37  on 883  degrees of freedom
## AIC: 802.37
## 
## Number of Fisher Scoring iterations: 5
# model evaluation with aic
model_lm$aic
## [1] 805.0398
model_all$aic
## [1] 802.3658

Based on the AIC, model_all has a slightly lower number compared to model_lm. Therefore, we will use model_all instead of model_sm since lower AIC number means that our model has a better performance.

Predict

Add new variable to titanic_test that stored the probability of our prediction.

titanic_test$Prediction.Prob <- predict(object = model_all, newdata = titanic_test, type = "response")

Create another variable that contains the label of survival prediction for each passengers. If a passenger has a probability of more than 0.5, we would labeled them with “1” (positive class/survived). But, if they scored less than 0.5, they would be put in the negative class (“0”/Not survived).

titanic_test$Prediction.Label <- ifelse(titanic_test$Prediction.Prob > 0.5, "1","0")
head(titanic_test)
# change Prediction.Label variable to factor
titanic_test <- titanic_test %>% 
  mutate(Prediction.Label = as.factor(Prediction.Label))

Model evaluation

After we have done with making our prediction, we are going to evaluate our model by using the confusion matrix from package caret.

# from package `caret`
lm_model_eval <- confusionMatrix(data = titanic_test$Prediction.Label,
                                 reference = titanic_test$Survived,
                                 positive = "1")
lm_model_eval
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 246   8
##          1  20 144
##                                          
##                Accuracy : 0.933          
##                  95% CI : (0.9046, 0.955)
##     No Information Rate : 0.6364         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.8577         
##                                          
##  Mcnemar's Test P-Value : 0.03764        
##                                          
##             Sensitivity : 0.9474         
##             Specificity : 0.9248         
##          Pos Pred Value : 0.8780         
##          Neg Pred Value : 0.9685         
##              Prevalence : 0.3636         
##          Detection Rate : 0.3445         
##    Detection Prevalence : 0.3923         
##       Balanced Accuracy : 0.9361         
##                                          
##        'Positive' Class : 1              
## 

From the summary above, we can see that our model is able to correctly predict both its positive and negative classes. But unfortunately, the number of the correct prediction from the true positive class (144) is way lower than the negative class (246). It means our model is better at predicting from one class only, which is the negative class. Ideally, we want our model to learn equally well from both classes and not limited to one class only.

Although the target class is considered to be imbalanced, it turns out that our model has a quite high score both for its recall (Sensitivity) and precision (Pos Pred Value). Since our target class isn’t that balanced, we wouldn’t use Accuracy as the metrics for our model evaluation. Instead, we will focus on the other two metrics, namely recall and precision.

k-Nearest Neighbor (kNN)

glimpse(titanic_train)
## Rows: 891
## Columns: 8
## $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0~
## $ Pclass   <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2~
## $ Sex      <fct> male, female, female, female, male, male, male, male, female,~
## $ Age      <dbl> 22, 38, 26, 35, 35, 28, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55,~
## $ SibSp    <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0, 0~
## $ Parch    <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0~
## $ Fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21~
## $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C, S~
titanic.train <- titanic_train %>%
  mutate(Sex = if_else(Sex == "male", 0, 1),
         Pclass = as.integer(Pclass)) %>%
  select(-Embarked)
         
head(titanic.train)
prop.table(table(titanic.train$Survived))
## 
##         0         1 
## 0.6161616 0.3838384

Before we proceed with creating a kNN model, we first need to check the range of each variables. If the predictor variables have a huge gap in terms of range, then we would have to perform a normalization or scaling. This is done particularly because kNN

summary(titanic.train)
##  Survived     Pclass           Sex              Age            SibSp      
##  0:549    Min.   :1.000   Min.   :0.0000   Min.   : 0.42   Min.   :0.000  
##  1:342    1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:22.00   1st Qu.:0.000  
##           Median :3.000   Median :0.0000   Median :28.00   Median :0.000  
##           Mean   :2.309   Mean   :0.3524   Mean   :29.36   Mean   :0.523  
##           3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:35.00   3rd Qu.:1.000  
##           Max.   :3.000   Max.   :1.0000   Max.   :80.00   Max.   :8.000  
##      Parch             Fare       
##  Min.   :0.0000   Min.   :  0.00  
##  1st Qu.:0.0000   1st Qu.:  7.91  
##  Median :0.0000   Median : 14.45  
##  Mean   :0.3816   Mean   : 32.20  
##  3rd Qu.:0.0000   3rd Qu.: 31.00  
##  Max.   :6.0000   Max.   :512.33

Summary from titanic.train: 1. The age of the titanic.train passengers range from 0.42 - 80 years old with a median of 28 years old. 2. Judging from the SibSp column, 50% of the passengers traveled without being accompanied by a family member. 3. The average fare that the passengers had to pay is 14.45.

# data wrangling
# replace value "male" with 0 and "female" with 1 in column Sex
# remove other categorical data and predictions result from previous titanic_test dataset that we used for logistic regression

titanic.test <- titanic_test %>% 
  mutate(Sex = if_else(Sex == "male", 0, 1),
         Pclass = as.integer(Pclass)) %>% 
  select(-c(Embarked, Prediction.Prob, Prediction.Label))

head(titanic.test)
# check total of columns for both data set
ncol(titanic.test)
## [1] 7
ncol(titanic.train)
## [1] 7
# filter out predictor variables
titanic_train_x <- titanic.train %>% select_if(is.numeric)
titanic_test_x <- titanic.test %>% select_if(is.numeric)

# filter out target variables
titanic_train_y <- titanic.train[,"Survived"]
titanic_test_y <- titanic.test[,"Survived"]
head(titanic_train_x)
head(titanic_train_y)
# scaling
titanic_train_xs <- scale(titanic_train_x)
  
titanic_test_xs <- scale(titanic_test_x, 
                      center = attr(titanic_train_xs,"scaled:center"), # center = mean
                      scale = attr(titanic_train_xs, "scaled:scale")) # scale = standard deviation
head(titanic_train_xs)
##          Pclass       Sex        Age      SibSp      Parch       Fare
## [1,]  0.8269128 -0.737281 -0.5654189  0.4325504 -0.4734077 -0.5021631
## [2,] -1.5652278  1.354813  0.6634884  0.4325504 -0.4734077  0.7864036
## [3,]  0.8269128  1.354813 -0.2581921 -0.4742788 -0.4734077 -0.4885799
## [4,] -1.5652278  1.354813  0.4330683  0.4325504 -0.4734077  0.4204941
## [5,]  0.8269128 -0.737281  0.4330683 -0.4742788 -0.4734077 -0.4860644
## [6,]  0.8269128 -0.737281 -0.1045787 -0.4742788 -0.4734077 -0.4778481
head(titanic_test_xs)
##          Pclass       Sex        Age      SibSp      Parch       Fare
## [1,]  0.8269128 -0.737281  0.3946649 -0.4742788 -0.4734077 -0.4905077
## [2,]  0.8269128  1.354813  1.3547487  0.4325504 -0.4734077 -0.5071940
## [3,] -0.3691575 -0.737281  2.5068493 -0.4742788 -0.4734077 -0.4531124
## [4,]  0.8269128 -0.737281 -0.1813854 -0.4742788 -0.4734077 -0.4737389
## [5,]  0.8269128  1.354813 -0.5654189  0.4325504  0.7671990 -0.4007916
## [6,]  0.8269128 -0.737281 -1.1798725 -0.4742788 -0.4734077 -0.4624195

Predict

# find optimum k
sqrt(nrow(titanic_train_xs))
## [1] 29.84962

When we have an even number for target class, we should use an odd number for its optimum k and vice versa.

  • Total of target class: 2 (even number)
  • K: 29 (odd number)
dim(titanic_train_xs)
## [1] 891   6
dim(titanic_train_y)
## [1] 891   1
# change titanic_train_y data types. from data frame to a vector
titanic_y_vec = titanic_train_y %>% pull("Survived")

titanic_pred <- knn(train = titanic_train_xs, # `train` : predictor variables in data train
                   test = titanic_test_xs, # `test` : predictor variables in data test
                   cl = titanic_y_vec, # target variables in data train
                   k = 29)
titanic_pred
##   [1] 0 1 0 0 0 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 1
##  [38] 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0
##  [75] 1 1 0 1 0 1 1 1 0 0 0 0 1 1 1 1 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0
## [112] 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0
## [149] 0 0 1 0 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1
## [186] 0 1 0 0 0 0 0 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0
## [223] 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1
## [260] 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0
## [297] 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0
## [334] 1 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 0 1 0 0 1 1 0
## [371] 0 1 0 0 1 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0
## [408] 1 1 1 1 1 1 0 1 0 0 0
## Levels: 0 1
knn_model_eval <- confusionMatrix(data = as.factor(titanic_pred), reference = titanic_test_y, positive = "1")
knn_model_eval
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 247   8
##          1  19 144
##                                          
##                Accuracy : 0.9354         
##                  95% CI : (0.9074, 0.957)
##     No Information Rate : 0.6364         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.8626         
##                                          
##  Mcnemar's Test P-Value : 0.05429        
##                                          
##             Sensitivity : 0.9474         
##             Specificity : 0.9286         
##          Pos Pred Value : 0.8834         
##          Neg Pred Value : 0.9686         
##              Prevalence : 0.3636         
##          Detection Rate : 0.3445         
##    Detection Prevalence : 0.3900         
##       Balanced Accuracy : 0.9380         
##                                          
##        'Positive' Class : 1              
## 

We’ll measure how well our kNN model was able to learn and correctly predict the target variables by using the confusion matrix model evaluation, just like when we did for logistic regression. Based on the three above metrics, kNN has a quite good performance with a recall of 94.74%, a precision of 88.34% and 93.54% of accuracy. Now we’ll compare the performance from both models between the logistic regression and the kNN.

# comparing the model performances between knn and logistic regression
# model evaluation when we use logistic regression
lm_model_eval
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 246   8
##          1  20 144
##                                          
##                Accuracy : 0.933          
##                  95% CI : (0.9046, 0.955)
##     No Information Rate : 0.6364         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.8577         
##                                          
##  Mcnemar's Test P-Value : 0.03764        
##                                          
##             Sensitivity : 0.9474         
##             Specificity : 0.9248         
##          Pos Pred Value : 0.8780         
##          Neg Pred Value : 0.9685         
##              Prevalence : 0.3636         
##          Detection Rate : 0.3445         
##    Detection Prevalence : 0.3923         
##       Balanced Accuracy : 0.9361         
##                                          
##        'Positive' Class : 1              
## 

Accuracy - Logistic regression: 93.3% - kNN: 93.54% (slightly better)

Recall (equal) - Logistic regression: 94.74% - kNN: 94.74%

Precision - Logistic regression: 87.80% - kNN: 88.34% (slightly better)

Conclusion

We have gone through the process of creating a model with logistic regression and k-Nearest Neighbor (kNN) to find out which of the two models that can better predict whether a Titanic passenger will survived or not. As it turns out, the model that was created with kNN has a better result in predicting the survival rate of a passenger in comparison to the one that we made with logistic regression. Since the target variables and the proportion for TN (true negative) / TF (true false) class is imbalanced, we chose to use both recall and precision as the metrics for our model evaluation.

Also, by using the logistic regression model, we found out several parameters that have a significant contribution in predicting whether a passenger will most likely to survived or not. The parameters are such as follows: 1. Ticket class (Upper, Middle or Lower class). A proxy for socio-economic status of a passenger. 2. Sex 3. Age 4. Number of siblings / spouses aboard the Titanic. 5. Port of embarkation