Would You Survive the Titanic?

Library

library(tidyverse)      # collection of best packages
library(caret)          # machine learning functions
library(MLmetrics)      # machine learning metrics
library(car)            # VIF calculation
library(rpart)          # decision tree
library(class)          # k-NN

Problem Statement

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this project, we will build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data.

Dataset

The dataset was obtained from Kaggle, with train and test dataset were prepared separately. Let’s read them.

train <- read.csv('train.csv', na.strings=c('', 'NA'))
test <- read.csv('test.csv', na.strings=c('', 'NA'))
glimpse(train)
#> Rows: 891
#> Columns: 12
#> $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
#> $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0...
#> $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
#> $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley ...
#> $ Sex         <chr> "male", "female", "female", "female", "male", "male", "...
#> $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 1...
#> $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
#> $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
#> $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", ...
#> $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
#> $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6",...
#> $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", ...
dim(train)
#> [1] 891  12

The train dataset has 12 columns as follows:

  1. Survived: survival (0 = No, 1 = Yes)
  2. Pclass: ticket class, a proxy for socio-economic status (1 = 1st, 2 = 2nd, 3 = 3rd)
  3. Sex: sex
  4. Age: age in years
  5. Sibsp: number of siblings / spouses aboard the Titanic
  6. Parch: number of parents / children aboard the Titanic
  7. Ticket: ticket number
  8. Fare: passenger fare
  9. Cabin: cabin number
  10. Embarked: port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
  11. PassengerId: passenger id
  12. Name: name

Data Cleaning

First, let’s check if there are any duplicated observations on train dataset.

anyDuplicated(train)
#> [1] 0

Great! Now, we inspect missing values.

colSums(is.na(train))
#> PassengerId    Survived      Pclass        Name         Sex         Age 
#>           0           0           0           0           0         177 
#>       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
#>           0           0           0           0         687           2
colSums(is.na(test))
#> PassengerId      Pclass        Name         Sex         Age       SibSp 
#>           0           0           0           0          86           0 
#>       Parch      Ticket        Fare       Cabin    Embarked    Survived 
#>           0           0           1         327           0           0

There are 4 features with missing values: Cabin, Age, Embarked, and Fare. We will impute all of them, each with special treatment.

  1. Cabin
unique(train$Cabin)
#>   [1] NA                "C85"             "C123"            "E46"            
#>   [5] "G6"              "C103"            "D56"             "A6"             
#>   [9] "C23 C25 C27"     "B78"             "D33"             "B30"            
#>  [13] "C52"             "B28"             "C83"             "F33"            
#>  [17] "F G73"           "E31"             "A5"              "D10 D12"        
#>  [21] "D26"             "C110"            "B58 B60"         "E101"           
#>  [25] "F E69"           "D47"             "B86"             "F2"             
#>  [29] "C2"              "E33"             "B19"             "A7"             
#>  [33] "C49"             "F4"              "A32"             "B4"             
#>  [37] "B80"             "A31"             "D36"             "D15"            
#>  [41] "C93"             "C78"             "D35"             "C87"            
#>  [45] "B77"             "E67"             "B94"             "C125"           
#>  [49] "C99"             "C118"            "D7"              "A19"            
#>  [53] "B49"             "D"               "C22 C26"         "C106"           
#>  [57] "C65"             "E36"             "C54"             "B57 B59 B63 B66"
#>  [61] "C7"              "E34"             "C32"             "B18"            
#>  [65] "C124"            "C91"             "E40"             "T"              
#>  [69] "C128"            "D37"             "B35"             "E50"            
#>  [73] "C82"             "B96 B98"         "E10"             "E44"            
#>  [77] "A34"             "C104"            "C111"            "C92"            
#>  [81] "E38"             "D21"             "E12"             "E63"            
#>  [85] "A14"             "B37"             "C30"             "D20"            
#>  [89] "B79"             "E25"             "D46"             "B73"            
#>  [93] "C95"             "B38"             "B39"             "B22"            
#>  [97] "C86"             "C70"             "A16"             "C101"           
#> [101] "C68"             "A10"             "E68"             "B41"            
#> [105] "A20"             "D19"             "D50"             "D9"             
#> [109] "A23"             "B50"             "A26"             "D48"            
#> [113] "E58"             "C126"            "B71"             "B51 B53 B55"    
#> [117] "D49"             "B5"              "B20"             "F G63"          
#> [121] "C62 C64"         "E24"             "C90"             "C45"            
#> [125] "E8"              "B101"            "D45"             "C46"            
#> [129] "D30"             "E121"            "D11"             "E77"            
#> [133] "F38"             "B3"              "D6"              "B82 B84"        
#> [137] "D17"             "A36"             "B102"            "B69"            
#> [141] "E49"             "C47"             "D28"             "E17"            
#> [145] "A24"             "C50"             "B42"             "C148"

As a categorical feature, Cabin has too many categories. Additionally, Cabin has too many missing values. The best simple way to impute Cabin’s missing values is to assign them a new category, let’s say X0.

train$Cabin <- replace_na(train$Cabin, 'X0')
test$Cabin <- replace_na(test$Cabin, 'X0')
  1. Age

We can impute Age by mean or median, but we already have something to make a more educated guess: Name. The Name feature has values with formatting “[Surname], [Title] [Name]”. We can extract [Surname] and [Title], and discard [Name] entirely since [Name] is unique for each PassengerId and doesn’t add any additional information to our data. We then save [Surname] and [Title] as new features called Surname and Title. Age can be estimated based on Title, while Surname may be used in later analysis.

train$Surname <- sapply(str_split(train$Name, ','), `[`, 1) %>% str_trim()
temp <- sapply(str_split(train$Name, ','), `[`, 2)
train$Title <- sapply(str_split(temp, '\\.'), `[`, 1) %>% str_trim()
train <- train %>% select(-Name)

test$Surname <- sapply(str_split(test$Name, ','), `[`, 1) %>% str_trim()
temp <- sapply(str_split(test$Name, ','), `[`, 2)
test$Title <- sapply(str_split(temp, '\\.'), `[`, 1) %>% str_trim()
test <- test %>% select(-Name)

Let’s see the values that take place in Title.

unique(train$Title)
#>  [1] "Mr"           "Mrs"          "Miss"         "Master"       "Don"         
#>  [6] "Rev"          "Dr"           "Mme"          "Ms"           "Major"       
#> [11] "Lady"         "Sir"          "Mlle"         "Col"          "Capt"        
#> [16] "the Countess" "Jonkheer"
unique(test$Title)
#> [1] "Mr"     "Mrs"    "Miss"   "Master" "Ms"     "Col"    "Rev"    "Dr"    
#> [9] "Dona"

There is one Title in test dataset which doesn’t exist in train dataset, that is Doña. Doña is a spanish Title, translates to Mrs in english. We can change Doña to Mrs right away.

test[test$Title == 'Dona', 'Title'] = 'Mrs'

The strategy is to group Age by Title then take the median of each group. After that, assign the median to Age’s missing values as per the corresponding Title.

age_by_title <- train %>% 
  group_by(Title) %>% 
  summarise(median = median(Age, na.rm = TRUE))

train <- merge(train, age_by_title)
train[is.na(train$Age), 'Age'] <- train[is.na(train$Age), 'median']
train <- train %>% select(-median)

test <- merge(test, age_by_title)
test[is.na(test$Age), 'Age'] <- test[is.na(test$Age), 'median']
test <- test %>% select(-median)
  1. Embarked

Embarked is a categorical features with mode ‘S’.

table(train$Embarked)
#> 
#>   C   Q   S 
#> 168  77 644

Since the number of missing values in Embarked is pretty small, we can simply impute them with its mode.

train$Embarked <- replace_na(train$Embarked, 'S')
test$Embarked <- replace_na(test$Embarked, 'S')
  1. Fare

Passenger Fare should correlate with Pclass: the higher the class, the higher the fare. Just like imputing Age, the strategy is to group Fare by Pclass then take the median of each group. After that, assign the median to Fare’s missing values as per the corresponding Pclass.

fare_by_pclass <- train %>% 
  group_by(Pclass) %>% 
  summarise(median = median(Fare, na.rm = TRUE))

train <- merge(train, fare_by_pclass)
train[is.na(train$Fare), 'Fare'] <- train[is.na(train$Fare), 'median']
train <- train %>% select(-median)

test <- merge(test, fare_by_pclass)
test[is.na(test$Fare), 'Fare'] <- test[is.na(test$Fare), 'median']
test <- test %>% select(-median)

Now, let’s revisit the missing values.

colSums(is.na(train))
#>      Pclass       Title PassengerId    Survived         Sex         Age 
#>           0           0           0           0           0           0 
#>       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
#>           0           0           0           0           0           0 
#>     Surname 
#>           0
colSums(is.na(test))
#>      Pclass       Title PassengerId         Sex         Age       SibSp 
#>           0           0           0           0           0           0 
#>       Parch      Ticket        Fare       Cabin    Embarked    Survived 
#>           0           0           0           0           0           0 
#>     Surname 
#>           0

Awesome! No missing value exists in both dataset. Lastly, convert each feature class into its appropriate one.

train <- train %>% 
  mutate_at(vars(Pclass, Title, Survived, Sex, Cabin, Embarked), as.factor)

test <- test %>% 
  mutate_at(vars(Pclass, Title, Survived, Sex, Cabin, Embarked), as.factor)

glimpse(train)
#> Rows: 891
#> Columns: 13
#> $ Pclass      <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
#> $ Title       <fct> Capt, Col, Col, Don, Dr, Mr, Dr, Mr, Dr, Dr, Dr, Jonkhe...
#> $ PassengerId <int> 746, 695, 648, 31, 633, 371, 797, 494, 767, 661, 246, 8...
#> $ Survived    <fct> 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0...
#> $ Sex         <fct> male, male, male, male, male, male, female, male, male,...
#> $ Age         <dbl> 70.00, 60.00, 56.00, 40.00, 32.00, 25.00, 49.00, 71.00,...
#> $ SibSp       <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 2, 2, 0, 1, 0, 0, 0, 0, 1, 1...
#> $ Parch       <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
#> $ Ticket      <chr> "WE/P 5735", "113800", "13213", "PC 17601", "13214", "1...
#> $ Fare        <dbl> 71.0000, 26.5500, 35.5000, 27.7208, 30.5000, 55.4417, 2...
#> $ Cabin       <fct> B22, X0, A26, X0, B50, E50, D17, X0, X0, X0, C78, X0, A...
#> $ Embarked    <fct> S, S, C, C, C, C, S, C, C, S, Q, S, C, S, S, S, C, S, S...
#> $ Surname     <chr> "Crosby", "Weir", "Simonius-Blumer", "Uruchurtu", "Stah...

Exploratory Data Analysis

Quoting this Wikipedia page:

“Women and children first” […] is a code of conduct dating from 1852, whereby the lives of women and children were to be saved first in a life-threatening situation, typically abandoning ship, when survival resources such as lifeboats were limited.

For starters, let’s ignore the children and see what happens when we predict all women were survived and all men were deceased. To make it simple, we’ll use accuracy metrics for the moment. We will consider other metrics later.

train$gender_class <- train$Sex
levels(train$gender_class) <- list('0' = 'male', '1' = 'female')
Accuracy(train$gender_class, train$Survived)
#> [1] 0.7867565
test$gender_class <- test$Sex
levels(test$gender_class) <- list('0' = 'male', '1' = 'female')
Accuracy(test$gender_class, test$Survived)
#> [1] 0.7655502

We got 79% accuracy on train dataset and 76% on test dataset. Not bad! It seems that it really was “women first” were to be saved in a life-threatening situation.

Now, the question is: which males survived and which females didn’t?

  1. Male
ggplot(train %>% filter(Sex == 'male'),
       aes(x = Age, group = Survived, fill = Survived)) +
  geom_density(alpha = .5) + 
  annotate('text', x = 5, y = .02, label = "Master") + 
  ggtitle("Age Distribution of Male Passengers")

There’s a clear tendency that males below the age of 15 years were more likely to survived. Since the maximum Age for people with Master as their Title is 12 years old as confirmed below, all of these Masters are included in 0-15 year old males range. Hence “women and children first” were prioritized in rescue?

max(train %>% filter(Title == 'Master') %>% select(Age))
#> [1] 12
  1. Female
for (sex in c('male', 'female')) {
  print(
    ggplot(train %>% 
             filter(Sex == sex) %>% 
             group_by(Pclass, Survived) %>% 
             count(name = 'passenger_count'),
           aes(x = Pclass, y = passenger_count, fill = Survived)) + 
      geom_bar(stat = 'identity', position = 'dodge') + 
      ggtitle(glue::glue("{sex} Passenger Count per Pclass"))
  )
}

There’s something going on here. Compared to Pclass 1 or 2, female with Pclass 3 has almost the same number of survived and not survived passengers. Maybe Pclass 3 was located at the most accident-prone area of the ship? If that so, why is the corresponding ratio of survival different for male with Pclass 3? This needs further investigation.

  1. Master and Female

Finally, let’s focus on Master and female passengers as suggested before that they are prioritized in the rescue. We would like to find if there are any relations between the two. Our approach is to see whether they are survived or deceased together as a family using Surname feature.

all_survived <- c()
all_deceased <- c()
combined <- c()

for (s in unique(train$Surname)) {
  temp <- train %>% filter((Title == 'Master' | Sex == 'female') & (Surname == s))
  if (nrow(temp) >= 2) {
    if (all(temp['Survived'] == 1)) {
      all_survived <- append(all_survived, s)
    } else if (!any(temp['Survived'] == 1)) {
      all_deceased <- append(all_deceased, s)
    } else {
      combined <- append(combined, s)
    }
  }
}
cat('Family who all survived          :', sort(all_survived), '\n')
#> Family who all survived          : Baclini Becker Brown Caldwell Collyer Coutts Doling Fortune Goldsmith Graham Hamalainen Harper Hart Hays Herman Hippach Johnson Kelly Laroche Mellinger Moor Moubarek Murphy Navratil Newell Nicola-Yarred Peter Quick Richards Ryerson Sandstrom Taussig West Wick
cat('Family who are all deceased      :', sort(all_deceased), '\n')
#> Family who are all deceased      : Barbara Boulos Bourke Ford Goodwin Jussila Lefebre Palsson Panula Rice Sage Skoog Strom Van Impe Vander Planke Zabour
cat('Family both survived and deceased:', sort(combined), '\n')
#> Family both survived and deceased: Allison Andersson Asplund Carter

As can be seen above, Master and female in a family tends to survive or decease together. Only four families who are not in accordance with this: Allison, Andersson, Asplund, and Carter. Let’s dive inside.

for (family in combined) {
  print(
    train %>% 
      filter((Title == 'Master' | Sex == 'female') & (Surname == family)) %>% 
      select(c('Surname', 'Title', 'Sex', 'Survived'))
  )
}
#>   Surname  Title    Sex Survived
#> 1 Allison Master   male        1
#> 2 Allison   Miss female        0
#> 3 Allison    Mrs female        0
#>   Surname  Title    Sex Survived
#> 1  Carter   Miss female        1
#> 2  Carter Master   male        1
#> 3  Carter    Mrs female        1
#> 4  Carter    Mrs female        0
#>     Surname  Title    Sex Survived
#> 1 Andersson Master   male        0
#> 2 Andersson   Miss female        0
#> 3 Andersson   Miss female        0
#> 4 Andersson   Miss female        1
#> 5 Andersson   Miss female        0
#> 6 Andersson   Miss female        0
#> 7 Andersson    Mrs female        0
#>   Surname  Title    Sex Survived
#> 1 Asplund Master   male        0
#> 2 Asplund   Miss female        1
#> 3 Asplund Master   male        1
#> 4 Asplund    Mrs female        1

There are only 4 Masters in the whole train dataset who weren’t survived or deceased together with their female family!

Metrics, Validation, and Class Imbalance

We are only interested in predicting the correct classification of Titanic passenger’s survival without emphasizing one of the class. Also, there is no specific impact in predicting false positives or false negatives in either class. Hence, the metrics we choose is accuracy.

We validate the performance of our model simply by applying new data test.csv to it and see the accuracy result. We don’t do k-fold cross validation since the data is small.

Finally, we may check as follows that the positive and negative class is 38% and 62% of train dataset, respectively. This is still tolerable and can be considered as balanced.

prop.table(table(train$Survived))
#> 
#>         0         1 
#> 0.6161616 0.3838384

Modeling

We’ll build and compare 2 models for our dataset: Logistic Regression and k-Nearest Neighbors.

Logistic Regression

Before plugging in the data into the model, we can discard some features like Cabin and gender_class since they give little to no information to the data.

train <- train %>% select(-c(Cabin, gender_class))
test <- test %>% select(-c(Cabin, gender_class))

glimpse(train)
#> Rows: 891
#> Columns: 12
#> $ Pclass      <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
#> $ Title       <fct> Capt, Col, Col, Don, Dr, Mr, Dr, Mr, Dr, Dr, Dr, Jonkhe...
#> $ PassengerId <int> 746, 695, 648, 31, 633, 371, 797, 494, 767, 661, 246, 8...
#> $ Survived    <fct> 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0...
#> $ Sex         <fct> male, male, male, male, male, male, female, male, male,...
#> $ Age         <dbl> 70.00, 60.00, 56.00, 40.00, 32.00, 25.00, 49.00, 71.00,...
#> $ SibSp       <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 2, 2, 0, 1, 0, 0, 0, 0, 1, 1...
#> $ Parch       <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
#> $ Ticket      <chr> "WE/P 5735", "113800", "13213", "PC 17601", "13214", "1...
#> $ Fare        <dbl> 71.0000, 26.5500, 35.5000, 27.7208, 30.5000, 55.4417, 2...
#> $ Embarked    <fct> S, S, C, C, C, C, S, C, C, S, Q, S, C, S, S, S, C, S, S...
#> $ Surname     <chr> "Crosby", "Weir", "Simonius-Blumer", "Uruchurtu", "Stah...

Logistic Regression will be implemented with automatic feature selection using backward elimination. Starting from using all features, the backward elimination process will iteratively discard some and evaluate the model until it finds one with the lowest Akaike Information Criterion (AIC). Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models based on information loss. Lower AIC means better model. We’ll use step() function to apply backward elimination in a greedy manner.

log <- glm(formula = Survived ~ . - PassengerId - Surname - Ticket, data = train, family = "binomial")
step(log, direction = "backward")
#> Start:  AIC=768.44
#> Survived ~ (Pclass + Title + PassengerId + Sex + Age + SibSp + 
#>     Parch + Ticket + Fare + Embarked + Surname) - PassengerId - 
#>     Surname - Ticket
#> 
#>            Df Deviance    AIC
#> - Embarked  2   719.20 767.20
#> - Sex       1   718.16 768.16
#> <none>          716.44 768.44
#> - Fare      1   718.46 768.46
#> - Parch     1   723.51 773.51
#> - Age       1   726.17 776.17
#> - SibSp     1   741.40 791.40
#> - Title    16   780.85 800.85
#> - Pclass    2   765.34 813.34
#> 
#> Step:  AIC=767.2
#> Survived ~ Pclass + Title + Sex + Age + SibSp + Parch + Fare
#> 
#>          Df Deviance    AIC
#> - Sex     1   720.71 766.71
#> <none>        719.20 767.20
#> - Fare    1   722.12 768.12
#> - Parch   1   726.82 772.82
#> - Age     1   729.66 775.66
#> - SibSp   1   747.08 793.08
#> - Title  16   784.51 800.51
#> - Pclass  2   769.65 813.65
#> 
#> Step:  AIC=766.71
#> Survived ~ Pclass + Title + Age + SibSp + Parch + Fare
#> 
#>          Df Deviance     AIC
#> <none>        720.71  766.71
#> - Fare    1   723.49  767.49
#> - Parch   1   728.24  772.24
#> - Age     1   731.06  775.06
#> - SibSp   1   748.92  792.92
#> - Pclass  2   771.69  813.69
#> - Title  16  1022.25 1036.25
#> 
#> Call:  glm(formula = Survived ~ Pclass + Title + Age + SibSp + Parch + 
#>     Fare, family = "binomial", data = train)
#> 
#> Coefficients:
#>       (Intercept)            Pclass2            Pclass3           TitleCol  
#>        -13.790086          -1.135920          -2.165957          15.419467  
#>          TitleDon            TitleDr      TitleJonkheer          TitleLady  
#>         -1.677743          15.231237          -1.626590          32.239610  
#>        TitleMajor        TitleMaster          TitleMiss          TitleMlle  
#>         15.142188          17.862131          17.356322          30.843328  
#>          TitleMme            TitleMr           TitleMrs            TitleMs  
#>         30.802977          14.422186          18.195639          32.286637  
#>          TitleRev           TitleSir  Titlethe Countess                Age  
#>         -0.309327          32.200063          31.005928          -0.030247  
#>             SibSp              Parch               Fare  
#>         -0.591082          -0.353554           0.004028  
#> 
#> Degrees of Freedom: 890 Total (i.e. Null);  868 Residual
#> Null Deviance:       1187 
#> Residual Deviance: 720.7     AIC: 766.7

Taking the best combination of features, the following model is obtained.

step <- glm(formula = Survived ~ Pclass + Title + Age + SibSp + Parch + 
    Fare, family = "binomial", data = train)

There are several assumptions need to be satisfied when we use Logistic Regression model:

  1. Multicollinearity: there is no strong correlation between predictors.

This can be checked with Variance Inflation Factor (VIF). The VIF of a predictor is a measure for how easily it is predicted from a logistic regression using the other predictors. A general guideline is that a VIF larger than 5 or 10 is large, indicating that the model has problems estimating the coefficient. However, this in general does not degrade the quality of predictions. VIF can be calculated using vif() function from car library.

vif(step)
#>            GVIF Df GVIF^(1/(2*Df))
#> Pclass 2.048806  2        1.196397
#> Title  2.857972 16        1.033360
#> Age    1.970403  1        1.403710
#> SibSp  1.593886  1        1.262492
#> Parch  1.449790  1        1.204072
#> Fare   1.590052  1        1.260973

GVIF is a generalized VIF for many coefficient. To make GVIFs comparable across dimensions, we better use GVIF^(1/(2*Df)) where Df is the number of coefficients in the subset of coefficients. For more detailed information, please refer to this paper. As we can see, all of these values are below 5 which means our model passes the multicollinearity test.

  1. Independent of Observations: there are no observations come from repeated measurements. This is clearly true for our dataset since each observation corresponds to a single passenger.

  2. Linearity of Predictor & Log of Odds. We can see the coefficient summary of our model as follows.

data.frame(coefficient = round(coef(step),2),
           odds_ratio = round(exp(coef(step)),2)) %>% 
  arrange(odds_ratio)
#>                   coefficient         odds_ratio
#> (Intercept)            -13.79               0.00
#> Pclass3                 -2.17               0.11
#> TitleDon                -1.68               0.19
#> TitleJonkheer           -1.63               0.20
#> Pclass2                 -1.14               0.32
#> SibSp                   -0.59               0.55
#> Parch                   -0.35               0.70
#> TitleRev                -0.31               0.73
#> Age                     -0.03               0.97
#> Fare                     0.00               1.00
#> TitleMr                 14.42         1834322.34
#> TitleMajor              15.14         3768500.90
#> TitleDr                 15.23         4119476.59
#> TitleCol                15.42         4972668.43
#> TitleMiss               17.36        34494913.65
#> TitleMaster             17.86        57203833.26
#> TitleMrs                18.20        79848324.01
#> TitleMme                30.80  23854092362208.57
#> TitleMlle               30.84  24836308066697.64
#> Titlethe Countess       31.01  29221551357425.27
#> TitleSir                32.20  96451630265014.67
#> TitleLady               32.24 100342439367117.94
#> TitleMs                 32.29 105173926571503.95

The odds_ratio in the table above is the exponent of coefficient and represents the change of odds for a unit increase in the corresponding predictor variable, holding other variables constant. This means that holding all other variables constant, for one unit increase in Age, there will be a 0.97 times change in odds for a passenger to survive. This confirms that younger passengers are more likely to survive. Also, we can see that a lone traveler is also more likely to survive than a passenger departing with family. We can interpret this odds_ratio for each predictors which means our model passes this third assumption.

Now we know that our Logistic Regression model satisfies all assumptions, we are ready to use the model and predict the result.

lr_pred <- predict(step, train, type = "response")
lr_pred <- ifelse(lr_pred > 0.5, 1, 0)
confusionMatrix(as.factor(lr_pred), train$Survived, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 486  90
#>          1  63 252
#>                                               
#>                Accuracy : 0.8283              
#>                  95% CI : (0.8019, 0.8525)    
#>     No Information Rate : 0.6162              
#>     P-Value [Acc > NIR] : < 0.0000000000000002
#>                                               
#>                   Kappa : 0.6315              
#>                                               
#>  Mcnemar's Test P-Value : 0.03556             
#>                                               
#>             Sensitivity : 0.7368              
#>             Specificity : 0.8852              
#>          Pos Pred Value : 0.8000              
#>          Neg Pred Value : 0.8438              
#>              Prevalence : 0.3838              
#>          Detection Rate : 0.2828              
#>    Detection Prevalence : 0.3535              
#>       Balanced Accuracy : 0.8110              
#>                                               
#>        'Positive' Class : 1                   
#> 
lr_pred <- predict(step, test, type = "response")
lr_pred <- ifelse(lr_pred > 0.5, 1, 0)
lr <- confusionMatrix(as.factor(lr_pred), test$Survived, positive = "1")
lr
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 209  44
#>          1  51 114
#>                                           
#>                Accuracy : 0.7727          
#>                  95% CI : (0.7295, 0.812) 
#>     No Information Rate : 0.622           
#>     P-Value [Acc > NIR] : 0.00000000003173
#>                                           
#>                   Kappa : 0.5208          
#>                                           
#>  Mcnemar's Test P-Value : 0.5382          
#>                                           
#>             Sensitivity : 0.7215          
#>             Specificity : 0.8038          
#>          Pos Pred Value : 0.6909          
#>          Neg Pred Value : 0.8261          
#>              Prevalence : 0.3780          
#>          Detection Rate : 0.2727          
#>    Detection Prevalence : 0.3947          
#>       Balanced Accuracy : 0.7627          
#>                                           
#>        'Positive' Class : 1               
#> 

Logistic Regression model gives 83% accuracy on train dataset and 77% accuracy on test dataset. We can see a slight overfit to train dataset. Remember that the classification based only on gender that we did earlier gives 76% accuracy on test dataset. This means Logistic Regression model only improve the prediction just a bit and isn’t worth the effort. We will try to polish the model later.

k-Nearest Neighbors

Since k-NN is a distance-based model, the dataset has to be normalized prior to modeling so that the model can treat each feature equally. In other words, if a feature has relatively big values compared to others, then it will dominantly influence the model in selecting a datapoint’s neighbors. We will use scale() function to train dataset and apply the mean and standard deviation obtained to scale() test dataset.

The second problem is that kNN is only applicable to numerical features, meaning that we can only select 4 features Age, SibSp, Parch, and Fare to be plugged in to the model.

train_scaled <- scale(x = train %>% select(c('Age', 'SibSp', 'Parch', 'Fare')))
test_scaled <- scale(x = test %>% select(c('Age', 'SibSp', 'Parch', 'Fare')),
                     center = attr(train_scaled, "scaled:center"),
                     scale = attr(train_scaled, "scaled:scale"))

knn_pred_train <- knn(train = train_scaled,
                      test = train_scaled,
                      cl = train$Survived,
                      k = sqrt(nrow(train_scaled)))

knn_pred <- knn(train = train_scaled,
               test = test_scaled,
               cl = train$Survived,
               k = sqrt(nrow(train_scaled)))

confusionMatrix(knn_pred_train, train$Survived, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 467 158
#>          1  82 184
#>                                             
#>                Accuracy : 0.7306            
#>                  95% CI : (0.7002, 0.7595)  
#>     No Information Rate : 0.6162            
#>     P-Value [Acc > NIR] : 0.0000000000003963
#>                                             
#>                   Kappa : 0.4056            
#>                                             
#>  Mcnemar's Test P-Value : 0.0000012903844525
#>                                             
#>             Sensitivity : 0.5380            
#>             Specificity : 0.8506            
#>          Pos Pred Value : 0.6917            
#>          Neg Pred Value : 0.7472            
#>              Prevalence : 0.3838            
#>          Detection Rate : 0.2065            
#>    Detection Prevalence : 0.2985            
#>       Balanced Accuracy : 0.6943            
#>                                             
#>        'Positive' Class : 1                 
#> 
knn <- confusionMatrix(knn_pred, test$Survived, positive = "1")
knn
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 208  79
#>          1  52  79
#>                                           
#>                Accuracy : 0.6866          
#>                  95% CI : (0.6397, 0.7308)
#>     No Information Rate : 0.622           
#>     P-Value [Acc > NIR] : 0.003439        
#>                                           
#>                   Kappa : 0.3104          
#>                                           
#>  Mcnemar's Test P-Value : 0.023109        
#>                                           
#>             Sensitivity : 0.5000          
#>             Specificity : 0.8000          
#>          Pos Pred Value : 0.6031          
#>          Neg Pred Value : 0.7247          
#>              Prevalence : 0.3780          
#>          Detection Rate : 0.1890          
#>    Detection Prevalence : 0.3134          
#>       Balanced Accuracy : 0.6500          
#>                                           
#>        'Positive' Class : 1               
#> 

Clearly 4 features are not enough. The kNN model gives poor result with 73% accuracy on train dataset and 69% accuracy on test dataset, even worse than the previous gender-only classification.

Feature Engineering

Now we are attempting to improve the performance of our models above. First, let’s concatenate train and test and save the result to an object df. Then, make a new feature FamilySize which is just the total sum of Sibsp and Parch.

test <- test %>% arrange(PassengerId)
test_Survived <- test$Survived
test_wo_Survived <- data.frame(test)
test_wo_Survived$Survived <- NA
df <- rbind(train, test_wo_Survived)
df$FamilySize <- df$SibSp + df$Parch
df <- df %>% arrange(PassengerId)

We can leverage our finding from EDA that families (especially Master and female) tends to survive or decease together. Make a new feature FamilySurvival which takes the value of 0, 1, or 0.5 as follows:

  1. Fill FamilySurvival with default value 0.5 (this 0.5 means we are not sure if there’s any other family members survived or not for a particular passenger).
  2. Group df by Surname and Fare, discard any group that consists of only one person.
  3. Iterate each passenger in each group:
  • if there is any survival other than the passenger in their group, change that passenger FamilySurvival to 1
  • else if there is anyone who didn’t survive other than the passenger in their group, change that passenger FamilySurvival to 0
  • else keep that passenger FamilySurvival as 0.5
DEFAULT_SURVIVAL_VALUE <- 0.5
df$FamilySurvival <- DEFAULT_SURVIVAL_VALUE

df$Survived <- as.integer(df$Survived) - 1

groups <- split(df, list(df$Surname, df$Fare), drop = TRUE)
for (grp_df in groups) {
  if (nrow(grp_df != 1)) {
    for (ind in 1:nrow(grp_df)) {
      smax <- max(grp_df[-ind, 'Survived'], na.rm = TRUE)
      smin <- min(grp_df[-ind, 'Survived'], na.rm = TRUE)
      passId <- grp_df[ind, 'PassengerId']
      if (smax == 1.0) {
        df[df$PassengerId == passId, 'FamilySurvival'] <- 1
      } else if (smin == 0.0) {
        df[df$PassengerId == passId, 'FamilySurvival'] <- 0
      }
    }
  }
}

cat("Number of passengers with family survival information:", 
    nrow(df[df$FamilySurvival != 0.5, ]))
#> Number of passengers with family survival information: 420

In the end, there are 420 people of all passengers in df who have FamilySurvival information (either 0 or 1). But we are not done yet. We can gather more FamilySurvival information from Ticket as follow.

  1. Group df by Ticket, discard any group that consists of only one person.
  2. Iterate each passenger in each group that has FamilySurvival other than 1:
  • if there is any survival other than the passenger in their group, change that passenger FamilySurvival to 1
  • else if there is anyone who didn’t survive other than the passenger in their group, change that passenger FamilySurvival to 0
  • else keep that passenger FamilySurvival as it is
groups <- split(df, df$Ticket, drop = TRUE)
for (grp_df in groups) {
  if (nrow(grp_df != 1)) {
    for (ind in 1:nrow(grp_df)) {
      if (grp_df[ind, 'FamilySurvival'] != 1.0) {
        smax <- max(grp_df[-ind, 'Survived'], na.rm = TRUE)
        smin <- min(grp_df[-ind, 'Survived'], na.rm = TRUE)
        passId <- grp_df[ind, 'PassengerId']
        if (smax == 1.0) {
          df[df$PassengerId == passId, 'FamilySurvival'] <- 1
        } else if (smin == 0.0) {
          df[df$PassengerId == passId, 'FamilySurvival'] <- 0
        }
      }
    }
  }
}

cat("Number of passenger with family survival information:",
    nrow(df[df$FamilySurvival != 0.5, ]))
#> Number of passenger with family survival information: 546

Now the number of passenger with FamilySurvival information increased to 546 people.

Then, let’s convert Sex to integer and discretize Fare and Age features into 5 categories, saving them as new features FareBin and AgeBin respectively.

df$FareBin <- ntile(x = df$Fare, n = 5)
df$AgeBin <- ntile(x = df$Age, n = 5)
df$Sex <- as.integer(df$Sex) - 1

Here’s the final df dataset.

glimpse(df)
#> Rows: 1,309
#> Columns: 16
#> $ Pclass         <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2...
#> $ Title          <fct> Mr, Mrs, Miss, Mrs, Mr, Mr, Mr, Master, Mrs, Mrs, Mi...
#> $ PassengerId    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
#> $ Survived       <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1...
#> $ Sex            <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1...
#> $ Age            <dbl> 22, 38, 26, 35, 35, 30, 54, 2, 27, 14, 4, 58, 20, 39...
#> $ SibSp          <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0...
#> $ Parch          <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0...
#> $ Ticket         <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803...
#> $ Fare           <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51...
#> $ Embarked       <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S...
#> $ Surname        <chr> "Braund", "Cumings", "Heikkinen", "Futrelle", "Allen...
#> $ FamilySize     <int> 1, 1, 0, 1, 0, 0, 0, 4, 2, 1, 2, 0, 0, 6, 0, 0, 5, 0...
#> $ FamilySurvival <dbl> 0.5, 0.5, 0.5, 0.0, 0.5, 0.5, 0.5, 0.0, 1.0, 0.0, 1....
#> $ FareBin        <int> 1, 5, 2, 5, 2, 2, 5, 3, 3, 4, 3, 4, 2, 4, 1, 3, 4, 3...
#> $ AgeBin         <int> 2, 4, 2, 4, 4, 3, 5, 1, 3, 1, 1, 5, 1, 4, 1, 5, 1, 3...

Modeling (Part 2)

Logistic Regression

First of all, reproduce train and test from df. Don’t forget to change feature class appropriately.

train <- df %>% 
  mutate_at(c('Survived', 'Sex', 'FareBin', 'AgeBin'), as.factor) %>% 
  slice_head(n = nrow(train))

test <- df %>% 
  mutate_at(c('Survived', 'Sex', 'FareBin', 'AgeBin'), as.factor) %>% 
  slice_tail(n = nrow(test))

test$Survived <- test_Survived

Again, we rely on step() function as a feature selection method.

log <- glm(formula = Survived ~ . - PassengerId - Surname - Ticket, data = train, family = "binomial")
step(log, direction = "backward")
#> Start:  AIC=705.36
#> Survived ~ (Pclass + Title + PassengerId + Sex + Age + SibSp + 
#>     Parch + Ticket + Fare + Embarked + Surname + FamilySize + 
#>     FamilySurvival + FareBin + AgeBin) - PassengerId - Surname - 
#>     Ticket
#> 
#> 
#> Step:  AIC=705.36
#> Survived ~ Pclass + Title + Sex + Age + SibSp + Parch + Fare + 
#>     Embarked + FamilySurvival + FareBin + AgeBin
#> 
#>                  Df Deviance    AIC
#> - FareBin         4   637.48 699.48
#> - AgeBin          4   640.44 702.44
#> - Embarked        2   636.70 702.70
#> - Fare            1   635.69 703.69
#> - Age             1   636.35 704.35
#> <none>                635.36 705.36
#> - Sex             1   637.42 705.42
#> - Parch           1   645.20 713.20
#> - Pclass          2   648.74 714.74
#> - SibSp           1   648.98 716.98
#> - Title          16   687.99 725.99
#> - FamilySurvival  1   705.24 773.24
#> 
#> Step:  AIC=699.48
#> Survived ~ Pclass + Title + Sex + Age + SibSp + Parch + Fare + 
#>     Embarked + FamilySurvival + AgeBin
#> 
#>                  Df Deviance    AIC
#> - AgeBin          4   642.42 696.42
#> - Embarked        2   638.94 696.94
#> - Fare            1   637.76 697.76
#> - Age             1   638.53 698.53
#> <none>                637.48 699.48
#> - Sex             1   639.70 699.70
#> - Parch           1   645.63 705.63
#> - SibSp           1   649.97 709.97
#> - Title          16   691.27 721.27
#> - Pclass          2   675.76 733.76
#> - FamilySurvival  1   709.49 769.49
#> 
#> Step:  AIC=696.42
#> Survived ~ Pclass + Title + Sex + Age + SibSp + Parch + Fare + 
#>     Embarked + FamilySurvival
#> 
#>                  Df Deviance    AIC
#> - Embarked        2   643.65 693.65
#> - Fare            1   642.57 694.57
#> <none>                642.42 696.42
#> - Sex             1   644.54 696.54
#> - Age             1   650.86 702.86
#> - Parch           1   651.38 703.38
#> - SibSp           1   654.93 706.93
#> - Title          16   703.04 725.04
#> - Pclass          2   681.22 731.22
#> - FamilySurvival  1   716.44 768.44
#> 
#> Step:  AIC=693.65
#> Survived ~ Pclass + Title + Sex + Age + SibSp + Parch + Fare + 
#>     FamilySurvival
#> 
#>                  Df Deviance    AIC
#> - Fare            1   643.70 691.70
#> - Sex             1   645.62 693.62
#> <none>                643.65 693.65
#> - Age             1   652.78 700.78
#> - Parch           1   652.84 700.84
#> - SibSp           1   657.35 705.35
#> - Title          16   705.26 723.26
#> - Pclass          2   684.92 730.92
#> - FamilySurvival  1   719.20 767.20
#> 
#> Step:  AIC=691.7
#> Survived ~ Pclass + Title + Sex + Age + SibSp + Parch + FamilySurvival
#> 
#>                  Df Deviance    AIC
#> - Sex             1   645.68 691.68
#> <none>                643.70 691.70
#> - Age             1   652.78 698.78
#> - Parch           1   653.60 699.60
#> - SibSp           1   657.83 703.83
#> - Title          16   705.59 721.59
#> - Pclass          2   698.10 742.10
#> - FamilySurvival  1   722.12 768.12
#> 
#> Step:  AIC=691.68
#> Survived ~ Pclass + Title + Age + SibSp + Parch + FamilySurvival
#> 
#>                  Df Deviance    AIC
#> <none>                645.68 691.68
#> - Age             1   654.61 698.61
#> - Parch           1   655.49 699.49
#> - SibSp           1   660.30 704.30
#> - Pclass          2   700.41 742.41
#> - FamilySurvival  1   723.49 767.49
#> - Title          16   965.25 979.25
#> 
#> Call:  glm(formula = Survived ~ Pclass + Title + Age + SibSp + Parch + 
#>     FamilySurvival, family = "binomial", data = train)
#> 
#> Coefficients:
#>       (Intercept)            Pclass2            Pclass3           TitleCol  
#>         -16.42260           -1.17899           -1.96986           16.70775  
#>          TitleDon            TitleDr      TitleJonkheer          TitleLady  
#>          -0.40487           16.12237           -0.46560           33.49021  
#>        TitleMajor        TitleMaster          TitleMiss          TitleMlle  
#>          16.41929           19.23280           18.50439           31.66991  
#>          TitleMme            TitleMr           TitleMrs            TitleMs  
#>          30.76548           15.44193           19.83730           33.54188  
#>          TitleRev           TitleSir  Titlethe Countess                Age  
#>           0.67104           32.04461           31.03875           -0.03036  
#>             SibSp              Parch     FamilySurvival  
#>          -0.52004           -0.44988            2.95193  
#> 
#> Degrees of Freedom: 890 Total (i.e. Null);  868 Residual
#> Null Deviance:       1187 
#> Residual Deviance: 645.7     AIC: 691.7

Here’s the best model found by backward elimination process above.

step <- glm(formula = Survived ~ Pclass + Title + Age + SibSp + Parch + 
    FamilySurvival, family = "binomial", data = train)

Again, we can check whether this model satisties all three assumptions as explained before.

vif(step)
#>                    GVIF Df GVIF^(1/(2*Df))
#> Pclass         1.427430  2        1.093047
#> Title          2.923928 16        1.034097
#> Age            1.796540  1        1.340351
#> SibSp          1.528237  1        1.236219
#> Parch          1.318792  1        1.148387
#> FamilySurvival 1.264384  1        1.124448
data.frame(coefficient = round(coef(step),2),
           odds_ratio = round(exp(coef(step)),2)) %>% 
  arrange(odds_ratio)
#>                   coefficient         odds_ratio
#> (Intercept)            -16.42               0.00
#> Pclass3                 -1.97               0.14
#> Pclass2                 -1.18               0.31
#> SibSp                   -0.52               0.59
#> TitleJonkheer           -0.47               0.63
#> Parch                   -0.45               0.64
#> TitleDon                -0.40               0.67
#> Age                     -0.03               0.97
#> TitleRev                 0.67               1.96
#> FamilySurvival           2.95              19.14
#> TitleMr                 15.44         5085639.56
#> TitleDr                 16.12        10042816.52
#> TitleMajor              16.42        13514706.00
#> TitleCol                16.71        18033592.93
#> TitleMiss               18.50       108731201.14
#> TitleMaster             19.23       225268465.32
#> TitleMrs                19.84       412317649.76
#> TitleMme                30.77  22976113262919.57
#> Titlethe Countess       31.04  30196650158218.64
#> TitleMlle               31.67  56763502420145.39
#> TitleSir                32.04  82565439021119.67
#> TitleLady               33.49 350440419629764.88
#> TitleMs                 33.54 369024586621355.25

Great! Now, prediction time.

lr_pred <- predict(step, train, type = "response")
lr_pred <- ifelse(lr_pred > 0.5, 1, 0)
confusionMatrix(as.factor(lr_pred), as.factor(train$Survived), positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 499  84
#>          1  50 258
#>                                                
#>                Accuracy : 0.8496               
#>                  95% CI : (0.8244, 0.8725)     
#>     No Information Rate : 0.6162               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.676                
#>                                                
#>  Mcnemar's Test P-Value : 0.004361             
#>                                                
#>             Sensitivity : 0.7544               
#>             Specificity : 0.9089               
#>          Pos Pred Value : 0.8377               
#>          Neg Pred Value : 0.8559               
#>              Prevalence : 0.3838               
#>          Detection Rate : 0.2896               
#>    Detection Prevalence : 0.3457               
#>       Balanced Accuracy : 0.8317               
#>                                                
#>        'Positive' Class : 1                    
#> 
lr_pred <- predict(step, test, type = "response")
lr_pred <- ifelse(lr_pred > 0.5, 1, 0)
lr_fe <- confusionMatrix(as.factor(lr_pred), as.factor(test$Survived), positive = "1")
lr_fe
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 216  45
#>          1  44 113
#>                                             
#>                Accuracy : 0.7871            
#>                  95% CI : (0.7447, 0.8254)  
#>     No Information Rate : 0.622             
#>     P-Value [Acc > NIR] : 0.0000000000003036
#>                                             
#>                   Kappa : 0.5466            
#>                                             
#>  Mcnemar's Test P-Value : 1                 
#>                                             
#>             Sensitivity : 0.7152            
#>             Specificity : 0.8308            
#>          Pos Pred Value : 0.7197            
#>          Neg Pred Value : 0.8276            
#>              Prevalence : 0.3780            
#>          Detection Rate : 0.2703            
#>    Detection Prevalence : 0.3756            
#>       Balanced Accuracy : 0.7730            
#>                                             
#>        'Positive' Class : 1                 
#> 

Awesome! We successfully increased the performance of Logistic Regression model to 85% accuracy on train dataset and 79% accuracy on test dataset.

k-Nearest Neighbors

Let’s peek df one more time.

glimpse(df)
#> Rows: 1,309
#> Columns: 16
#> $ Pclass         <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2...
#> $ Title          <fct> Mr, Mrs, Miss, Mrs, Mr, Mr, Mr, Master, Mrs, Mrs, Mi...
#> $ PassengerId    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
#> $ Survived       <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1...
#> $ Sex            <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1...
#> $ Age            <dbl> 22, 38, 26, 35, 35, 30, 54, 2, 27, 14, 4, 58, 20, 39...
#> $ SibSp          <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0...
#> $ Parch          <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0...
#> $ Ticket         <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803...
#> $ Fare           <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51...
#> $ Embarked       <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S...
#> $ Surname        <chr> "Braund", "Cumings", "Heikkinen", "Futrelle", "Allen...
#> $ FamilySize     <int> 1, 1, 0, 1, 0, 0, 0, 4, 2, 1, 2, 0, 0, 6, 0, 0, 5, 0...
#> $ FamilySurvival <dbl> 0.5, 0.5, 0.5, 0.0, 0.5, 0.5, 0.5, 0.0, 1.0, 0.0, 1....
#> $ FareBin        <int> 1, 5, 2, 5, 2, 2, 5, 3, 3, 4, 3, 4, 2, 4, 1, 3, 4, 3...
#> $ AgeBin         <int> 2, 4, 2, 4, 4, 3, 5, 1, 3, 1, 1, 5, 1, 4, 1, 5, 1, 3...

Reproduce train and test from df. This time, we will select features manually following the rules below.

  1. Drop PassengerId as it adds no information to the data
  2. Drop Title as it is assumed to has been captured by Age and Sex
  3. Drop Surname and Ticket as they are already used to determine FamilySurvival
  4. Drop SibSp and Parch as they are combined to create FamilySize
  5. Drop Age and Fare as they are already represented by AgeBin and FareBin
  6. Drop Embarked as it serves little information to the data

Eventually, we are left with 7 features: Survived, Pclass, Sex, FamilySize, FamilySurvival, FareBin, and AgeBin, with Survived as the target feature. Since we are working with kNN, we convert all of these features to numeric. Please note that numerical conversion still makes sense for Pclass, FareBin, and AgeBin since they have ordinal meaning, but it is not the case for Sex. However, for now we’ll just continue and see what happens.

train <- df %>% 
  select(c(Survived, Pclass, Sex, FamilySize, FamilySurvival, FareBin, AgeBin)) %>% 
  mutate_all(., as.numeric) %>% 
  slice_head(n = nrow(train))

test <- df %>% 
  select(c(Survived, Pclass, Sex, FamilySize, FamilySurvival, FareBin, AgeBin)) %>% 
  mutate_all(., as.numeric) %>% 
  slice_tail(n = nrow(test))

test$Survived <- test_Survived

As before, we perform normalization to the predictors, fit data into the model, and predict the result.

train_scaled <- scale(x = train %>% select(-Survived))
test_scaled <- scale(x = test %>% select(-Survived),
                     center = attr(train_scaled, "scaled:center"),
                     scale = attr(train_scaled, "scaled:scale"))

knn_pred_train <- knn(train = train_scaled,
                      test = train_scaled,
                      cl = train$Survived,
                      k = sqrt(nrow(train_scaled)))

knn_pred <- knn(train = train_scaled,
               test = test_scaled,
               cl = train$Survived,
               k = sqrt(nrow(train_scaled)))

confusionMatrix(knn_pred_train, as.factor(train$Survived), positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 495  89
#>          1  54 253
#>                                                
#>                Accuracy : 0.8395               
#>                  95% CI : (0.8137, 0.863)      
#>     No Information Rate : 0.6162               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.654                
#>                                                
#>  Mcnemar's Test P-Value : 0.004466             
#>                                                
#>             Sensitivity : 0.7398               
#>             Specificity : 0.9016               
#>          Pos Pred Value : 0.8241               
#>          Neg Pred Value : 0.8476               
#>              Prevalence : 0.3838               
#>          Detection Rate : 0.2840               
#>    Detection Prevalence : 0.3446               
#>       Balanced Accuracy : 0.8207               
#>                                                
#>        'Positive' Class : 1                    
#> 
knn_fe <- confusionMatrix(knn_pred, as.factor(test$Survived), positive = "1")
knn_fe
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 225  47
#>          1  35 111
#>                                                
#>                Accuracy : 0.8038               
#>                  95% CI : (0.7625, 0.8408)     
#>     No Information Rate : 0.622                
#>     P-Value [Acc > NIR] : 0.0000000000000007055
#>                                                
#>                   Kappa : 0.5765               
#>                                                
#>  Mcnemar's Test P-Value : 0.2245               
#>                                                
#>             Sensitivity : 0.7025               
#>             Specificity : 0.8654               
#>          Pos Pred Value : 0.7603               
#>          Neg Pred Value : 0.8272               
#>              Prevalence : 0.3780               
#>          Detection Rate : 0.2656               
#>    Detection Prevalence : 0.3493               
#>       Balanced Accuracy : 0.7840               
#>                                                
#>        'Positive' Class : 1                    
#> 

kNN model reaches 84% accuracy on train dataset, slightly lower than Logistic Regression model, but performs really well on test dataset as well with 80% accuracy. We found the winner here!

Conclusion

rbind(
  "Logistic Regression" = lr$overall['Accuracy'], 
  "k-Nearest Neighbors" = knn$overall['Accuracy'], 
  "Logistic Regression with Feature Engineering" = lr_fe$overall['Accuracy'], 
  "k-Nearest Neighbors with Feature Engineering" = knn_fe$overall['Accuracy']
)
#>                                               Accuracy
#> Logistic Regression                          0.7727273
#> k-Nearest Neighbors                          0.6866029
#> Logistic Regression with Feature Engineering 0.7870813
#> k-Nearest Neighbors with Feature Engineering 0.8038278

kNN model is able to reach as high as 80% accuracy in predicting survival of Titanic passengers, beating Logistic Regression model. However, kNN alone has a really poor performance, even worse than a simple gender-only based classification. We see that feature engineering is really needed to boost kNN model performance to the top.