1 Introduction

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “What sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

2 Question or Problem Definition

Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.

We may also want to develop some early understanding about the domain of our problem.

On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.
Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

3 Data Preparation

3.1 Import library

library(tidyverse)
library(caret)
library(class)
library(doSNOW)
library(ggplot2)
library(ROCR)
library(rsample)

3.2 Read Data

train <- read_csv("data_input/train.csv")
test <- read_csv("data_input/test.csv")
solution <- read_csv("data_input/solution.csv")
glimpse(train)

#> Observations: 891
#> Variables: 12
#> $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
#> $ Survived    <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0...
#> $ Pclass      <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
#> $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley ...
#> $ Sex         <chr> "male", "female", "female", "female", "male", "male", "...
#> $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 1...
#> $ SibSp       <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
#> $ Parch       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
#> $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", ...
#> $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
#> $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6",...
#> $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", ...

summary(train)

#>   PassengerId       Survived          Pclass          Name          
#>  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
#>  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
#>  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
#>  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
#>  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
#>  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
#>                                                                     
#>      Sex                 Age            SibSp           Parch       
#>  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
#>  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
#>  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
#>                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
#>                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
#>                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
#>                     NA's   :177                                     
#>     Ticket               Fare           Cabin             Embarked        
#>  Length:891         Min.   :  0.00   Length:891         Length:891        
#>  Class :character   1st Qu.:  7.91   Class :character   Class :character  
#>  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
#>                     Mean   : 32.20                                        
#>                     3rd Qu.: 31.00                                        
#>                     Max.   :512.33                                        
#>

Data Explanation :

PassengerId : Unique ID of the passenger
Survived : Survived (1) or Died (0)
Pclass : Passenger’s class (1st, 2nd, or 3rd)
Name : Passenger’s name
Sex : Passenger’s sex
Age : Passenger’s age
SibSp : Number of siblings/spouses aboard the Titanic
Parch : Number of parents/children aboard the Titanic
Ticket : Ticket number
Fare : Fare paid for ticket
Cabin : Cabin number
Embarked : Where the passenger got on the ship (C - Cherbourg, S - Southampton, Q = Queenstown)

After loading the dataset, we can combine the train and test data by creating a new Survived column on the test data. From the kaggle competition dataset, it did not have the survival information. Now, we can filled it with NA values and store it under combine object.

test$Survived <- NA
combine <- rbind(train, test)
dim(combine)

#> [1] 1309   12

densityplot(combine$Age, na.rm = T, main = "Age Distribution", xlab = "Age")

3.3 Data Wrangling and Cleansing

combine <- combine %>% 
  select(-c(1, 4, 9)) %>% 
  mutate(Survived = as.factor(Survived)) %>% 
  mutate(Pclass = as.factor(Pclass)) %>% 
  mutate_if(is.character, as.factor)

levels(combine$Pclass)

#> [1] "1" "2" "3"

head(combine)

First of all, we can check the overall survival rate in training dataset.

prop.table(table(train$Survived))

#> 
#>         0         1 
#> 0.6161616 0.3838384

colSums(is.na(combine))

#> Survived   Pclass      Sex      Age    SibSp    Parch     Fare    Cabin 
#>      418        0        0      263        0        0        1     1014 
#> Embarked 
#>        2

These will require correcting.

Cabin > Age > Embarked features contain a number of null values in that order for the training dataset.
Cabin > Age > Fareare incomplete in case of test dataset.

Replace missing Fare & Embarked values with mean and mode

summary(combine$Fare)

#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>   0.000   7.896  14.454  33.295  31.275 512.329       1

summary(combine$Embarked)

#>    C    Q    S NA's 
#>  270  123  914    2

combine[is.na(combine$Fare),] # comes from Pclass == 3

summary(combine[combine$Pclass == 3,]$Fare)

#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>    0.00    7.75    8.05   13.30   15.25   69.55       1

combine <- combine %>% 
  mutate(Fare = replace_na(data = Fare, replace = 13.3),
         Embarked = replace_na(data = Embarked, replace = "S"))

NA Handling in Age column

There are numerous ways to deal with missing values. Imputing the mean of the feature can sometimes be an easy way to deal with it, but may lead to some bias in the model.

Another way to impute missing values could be using another type of model, such as linear regression, to predict what the missing Age values would be based on the other features in the dataset.

# Add a feature for tracking missing ages.
summary(combine$Age)

#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>    0.17   21.00   28.00   29.88   39.00   80.00     263

combine$MissingAge <- ifelse(is.na(combine$Age),
                           "Y", "N")
combine$MissingAge <- as.factor(combine$MissingAge)

# Add a feature for family size.
combine$FamilySize <- 1 + combine$SibSp + combine$Parch

# Caret supports a number of mechanism for imputing (i.e., predicting) missing values.
# Leverage bagged decision trees to impute missing values for the Age feature.

# First, transform all feature to dummy variables.
dummy.vars <- dummyVars(~ ., data = combine[, -c(1,8)])
combine.dummy <- predict(dummy.vars, combine[, -c(1,8)])

View(combine.dummy)

# Now, impute!
pre.process <- preProcess(combine.dummy, method = "bagImpute")
imputed.data <- predict(pre.process, combine.dummy)

View(imputed.data)

combine$Age <- imputed.data[, 6] #age

combine <- combine %>% 
  mutate(Age = round(Age))

summary(combine$Age)

#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>     0.0    22.0    28.0    29.7    37.0    80.0

NA Handling in Cabin column

colSums(is.na(combine))

#>   Survived     Pclass        Sex        Age      SibSp      Parch       Fare 
#>        418          0          0          0          0          0          0 
#>      Cabin   Embarked MissingAge FamilySize 
#>       1014          0          0          0

687/891

#> [1] 0.7710438

After some more mining, let make this even clearer, particularly for Cabin and Age. In the training data set, the overall survival rate is 38%,

for those who have Cabin info (204 in total), the survival rate is 136 / 204 = 0.66;
for those who don’t have Cabin info (687 in total), the survival rate is 206 / 687 = 0.3

I created a new variable called CabinLstoring the first letter of each cabin. Based on the Titanic deckplans, I saw that cabins were named according to floor of the ship. I thought this could be useful.

combine$CabinL <- NA
combine$CabinL <- factor(substr(combine$Cabin, start=1, stop=1))
## Check if it worked
table(combine$CabinL)

#> 
#>  A  B  C  D  E  F  G  T 
#> 22 65 94 46 41 21  5  1

table(is.na(combine$CabinL))

#> 
#> FALSE  TRUE 
#>   295  1014

Using the Titanic deckplan, I assigned each cabin letter its relative floor: Cabins beginning with A were on the uppermost part of the ship, hence the value 1. G cabins were the lowest on the ship and received the value 7. For the single cabin letter == T, which I could not find in the diagram, I assigned it floor 0. The passenger was traveling in 1st class, and 1st class occupied the upper decks according to the diagram. These floor numbers were stored in a new variable called CabinFloor. As an integer, I figured I would have better success at estimating CabinFloor than I would using unordered factor levels (i.e. letters).

combine$CabinFloor <- NA
combine$CabinFloor <- ifelse(combine$CabinL=="A", 1, 
                         ifelse(combine$CabinL=="B", 2, 
                        ifelse(combine$CabinL=="C", 3, 
                        ifelse(combine$CabinL=="D", 4, 
                        ifelse(combine$CabinL=="E", 5,
                        ifelse(combine$CabinL=="F", 6, 
                        ifelse(combine$CabinL=="G", 7, 
                        ifelse(combine$CabinL=="T", 0, NA))))))))
combine$CabinFloor <- as.integer(combine$CabinFloor)

We can use ggcorr to determine which numeric variable were most correlated with CabinFloor. But, before we find the corr between numeric variable, we can change the data type for a Survived and Pclass column.

combine <- combine %>% 
  mutate(Survived = as.numeric(Survived)) %>% 
  mutate(Pclass = as.numeric(Pclass))
glimpse (combine)

#> Observations: 1,309
#> Variables: 13
#> $ Survived   <dbl> 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2, 1,...
#> $ Pclass     <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3,...
#> $ Sex        <fct> male, female, female, female, male, male, male, male, fe...
#> $ Age        <dbl> 22, 38, 26, 35, 35, 28, 54, 2, 27, 14, 4, 58, 20, 39, 14...
#> $ SibSp      <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1,...
#> $ Parch      <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0,...
#> $ Fare       <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.862...
#> $ Cabin      <fct> NA, C85, NA, C123, NA, NA, E46, NA, NA, NA, G6, C103, NA...
#> $ Embarked   <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S,...
#> $ MissingAge <fct> N, N, N, N, N, Y, N, N, N, N, N, N, N, N, N, N, N, Y, N,...
#> $ FamilySize <dbl> 2, 2, 1, 2, 1, 1, 1, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, 1, 2,...
#> $ CabinL     <fct> NA, C, NA, C, NA, NA, E, NA, NA, NA, G, C, NA, NA, NA, N...
#> $ CabinFloor <int> NA, 3, NA, 3, NA, NA, 5, NA, NA, NA, 7, 3, NA, NA, NA, N...

GGally::ggcorr(combine, label = T)

As we can see in the diagram above, CabinFloor had a better correlation to Pclass. So we can imputed missing CabinFloor values using the mean Pclass and the ave() function.

combine$CabinFloor[is.na(combine$CabinFloor)] <- with(combine, ave(CabinFloor, Pclass,
                        FUN = function(x) mean(x, na.rm=TRUE)))[is.na(combine$CabinFloor)]
## Check if it worked
summary(combine$CabinFloor)

#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   0.000   5.000   6.125   5.176   6.125   7.000

Next, we can discretizing CabinFLoor into uppper half and lower half values. We can create new column for this one or overwrite to CabinFloor column.

combine$CabinFloor <- ifelse(combine$CabinFloor < 4, "Upper Half", "Lower Half")
combine$CabinFloor <- as.factor(combine$CabinFloor)

table(combine$CabinFloor)

#> 
#> Lower Half Upper Half 
#>       1060        249

Next, we can re-split our data into train and test dataset again. We can also remove the unused column to build our model.

combine <- combine %>% 
  mutate(Survived = as.factor(ifelse(Survived == 1, "0", "1"))) %>% 
  mutate(Pclass = as.factor(Pclass))
glimpse(combine)

#> Observations: 1,309
#> Variables: 13
#> $ Survived   <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0,...
#> $ Pclass     <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3,...
#> $ Sex        <fct> male, female, female, female, male, male, male, male, fe...
#> $ Age        <dbl> 22, 38, 26, 35, 35, 28, 54, 2, 27, 14, 4, 58, 20, 39, 14...
#> $ SibSp      <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1,...
#> $ Parch      <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0,...
#> $ Fare       <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.862...
#> $ Cabin      <fct> NA, C85, NA, C123, NA, NA, E46, NA, NA, NA, G6, C103, NA...
#> $ Embarked   <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S,...
#> $ MissingAge <fct> N, N, N, N, N, Y, N, N, N, N, N, N, N, N, N, N, N, Y, N,...
#> $ FamilySize <dbl> 2, 2, 1, 2, 1, 1, 1, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, 1, 2,...
#> $ CabinL     <fct> NA, C, NA, C, NA, NA, E, NA, NA, NA, G, C, NA, NA, NA, N...
#> $ CabinFloor <fct> Lower Half, Upper Half, Lower Half, Upper Half, Lower Ha...

colSums(is.na(combine))

#>   Survived     Pclass        Sex        Age      SibSp      Parch       Fare 
#>        418          0          0          0          0          0          0 
#>      Cabin   Embarked MissingAge FamilySize     CabinL CabinFloor 
#>       1014          0          0          0       1014          0

train <- combine[1:891, -c(8, 10, 12)]
test <- combine[892:1309, -c(8, 10, 12)]

colSums(is.na(train))

#>   Survived     Pclass        Sex        Age      SibSp      Parch       Fare 
#>          0          0          0          0          0          0          0 
#>   Embarked FamilySize CabinFloor 
#>          0          0          0

3.4 Data Exploration

We can explore the train data using ggplot2 packages for visualizations.

names(train)

#>  [1] "Survived"   "Pclass"     "Sex"        "Age"        "SibSp"     
#>  [6] "Parch"      "Fare"       "Embarked"   "FamilySize" "CabinFloor"

Survival Rate by Pclass

ggplot(train, aes(x = Survived))+
  geom_bar(aes(fill = Pclass), position ="dodge")+
  labs(title = "Survival by Pclass",
       y = "Frequency")+
  theme_minimal()

Passengers in 3rd class were much more likely to die, while passengers in 1st class were more likely to survive.

Survival Rate by Gender

ggplot(train, aes(x = Survived))+
  geom_bar(aes(fill = Sex), position ="fill")+
  labs(title = "Survival by Gender",
       y = "Percentage")+
  theme_minimal()

Females were much more likely to survive than their male counterparts.

Survival Rate by Age

ggplot(train, aes(x = Age, y = as.numeric(Survived)))+
  geom_point(alpha = 0.1)+
  geom_smooth(color="red", fill="#69b3a2", se=TRUE)+
  scale_y_continuous(breaks = seq(1, 2, 0.5), labels = c (0, 0.5, 1))+
  labs(title = "Survival by Age",
       y = "Survived")+
  theme_minimal()

Older passengers were less likely to survive.

Survival Rate by Fare

ggplot(train, aes(x = Fare, y = as.numeric(Survived)))+
  geom_point(alpha = 0.1)+
  geom_smooth(color="red", fill="#69b3a2", se=TRUE)+
  scale_y_continuous(breaks = seq(1, 2.5, 0.5), labels = c (0, 0.5, 1, 1.5))+
  labs(title = "Survival by Fare",
       y = "Survived")+
  theme_minimal()

Passengers who paid more for their fare were more likely to survive.

Survival Rate by Embarked

ggplot(train, aes(x = Survived))+
  geom_bar(aes(fill = Embarked), position ="stack")+
  labs(title = "Survival by Embarked",
       y = "Frequency")+
  theme_minimal()

Passengers who embarked at port “S” were less likely to survive, but then again, “S” was the mode for these data.

Survival Rate by Family Size

ggplot(train, aes(x = FamilySize, y = as.numeric(Survived)))+
  geom_point(alpha = 0.1)+
  geom_smooth(color="red", fill="#69b3a2", se=TRUE)+
  scale_y_continuous(breaks = seq(1, 2, 0.5), labels = c (0, 0.5, 1))+
  labs(title = "Survival by Family Size",
       y = "Survived")+
  theme_minimal()

Passengers with more siblings, spouses, parents, and children on board were less likely to survive. However, for family sizes between 1 and approximately 4, there was an increase in passenger survival.

Survival Rate by CabinFloor

ggplot(train, aes(x = Survived))+
  geom_bar(aes(fill = CabinFloor), position ="dodge")+
  labs(title = "Survival by Cabin Floor",
       y = "Frequency")+
  theme_minimal()

Passengers with a cabin in the lower half of the ship were less likely to survive than passengers in the upper half.

4 Model

4.1 Logistic Regression

4.1.1 Cross Validation

glimpse(train)

#> Observations: 891
#> Variables: 10
#> $ Survived   <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0,...
#> $ Pclass     <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3,...
#> $ Sex        <fct> male, female, female, female, male, male, male, male, fe...
#> $ Age        <dbl> 22, 38, 26, 35, 35, 28, 54, 2, 27, 14, 4, 58, 20, 39, 14...
#> $ SibSp      <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1,...
#> $ Parch      <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0,...
#> $ Fare       <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.862...
#> $ Embarked   <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S,...
#> $ FamilySize <dbl> 2, 2, 1, 2, 1, 1, 1, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, 1, 2,...
#> $ CabinFloor <fct> Lower Half, Upper Half, Lower Half, Upper Half, Lower Ha...

prop.table(table(train$Survived))

#> 
#>         0         1 
#> 0.6161616 0.3838384

set.seed(100)
idx <- initial_split(data = train, prop = 0.8, strata = "Survived")
titanic_train <- training(idx)
titanic_test <- testing(idx)

#Split Label
train_label <- titanic_train$Survived
test_label <- titanic_test$Survived

# Splitted-data proportion checking
prop.table(table(titanic_train$Survived))

#> 
#>         0         1 
#> 0.6162465 0.3837535

prop.table(table(titanic_test$Survived))

#> 
#>         0         1 
#> 0.6158192 0.3841808

4.1.2 Build Model

model_logistic_base <- glm(formula = Survived~., data = titanic_train, family = "binomial")
summary(model_logistic_base)

#> 
#> Call:
#> glm(formula = Survived ~ ., family = "binomial", data = titanic_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.5957  -0.5999  -0.4046   0.5895   2.5191  
#> 
#> Coefficients: (1 not defined because of singularities)
#>                       Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)           5.318657   0.689120   7.718 1.18e-14 ***
#> Pclass2              -1.615219   0.478560  -3.375 0.000738 ***
#> Pclass3              -2.942136   0.484207  -6.076 1.23e-09 ***
#> Sexmale              -2.789570   0.231299 -12.060  < 2e-16 ***
#> Age                  -0.044567   0.009206  -4.841 1.29e-06 ***
#> SibSp                -0.336739   0.119984  -2.807 0.005008 ** 
#> Parch                -0.135076   0.135273  -0.999 0.318016    
#> Fare                  0.001534   0.002541   0.604 0.546131    
#> EmbarkedQ            -0.362026   0.424061  -0.854 0.393264    
#> EmbarkedS            -0.723907   0.276873  -2.615 0.008934 ** 
#> FamilySize                  NA         NA      NA       NA    
#> CabinFloorUpper Half -0.797421   0.465473  -1.713 0.086687 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 950.86  on 713  degrees of freedom
#> Residual deviance: 619.06  on 703  degrees of freedom
#> AIC: 641.06
#> 
#> Number of Fisher Scoring iterations: 5

step(model_logistic_base, direction="backward")

#> Start:  AIC=641.06
#> Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + 
#>     FamilySize + CabinFloor
#> 
#> 
#> Step:  AIC=641.06
#> Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + 
#>     CabinFloor
#> 
#>              Df Deviance    AIC
#> - Fare        1   619.44 639.44
#> - Parch       1   620.09 640.09
#> <none>            619.06 641.06
#> - CabinFloor  1   622.06 642.06
#> - Embarked    2   626.26 644.26
#> - SibSp       1   628.39 648.39
#> - Age         1   644.80 664.80
#> - Pclass      2   671.30 689.30
#> - Sex         1   805.45 825.45
#> 
#> Step:  AIC=639.44
#> Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked + CabinFloor
#> 
#>              Df Deviance    AIC
#> - Parch       1   620.26 638.26
#> <none>            619.44 639.44
#> - CabinFloor  1   622.18 640.18
#> - Embarked    2   627.28 643.28
#> - SibSp       1   628.47 646.47
#> - Age         1   645.87 663.87
#> - Pclass      2   678.23 694.23
#> - Sex         1   807.94 825.94
#> 
#> Step:  AIC=638.26
#> Survived ~ Pclass + Sex + Age + SibSp + Embarked + CabinFloor
#> 
#>              Df Deviance    AIC
#> <none>            620.26 638.26
#> - CabinFloor  1   623.02 639.02
#> - Embarked    2   628.30 642.30
#> - SibSp       1   632.27 648.27
#> - Age         1   646.24 662.24
#> - Pclass      2   679.18 693.18
#> - Sex         1   812.58 828.58

#> 
#> Call:  glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Embarked + 
#>     CabinFloor, family = "binomial", data = titanic_train)
#> 
#> Coefficients:
#>          (Intercept)               Pclass2               Pclass3  
#>               5.3508               -1.6655               -3.0146  
#>              Sexmale                   Age                 SibSp  
#>              -2.7512               -0.0444               -0.3602  
#>            EmbarkedQ             EmbarkedS  CabinFloorUpper Half  
#>              -0.3459               -0.7493               -0.7538  
#> 
#> Degrees of Freedom: 713 Total (i.e. Null);  705 Residual
#> Null Deviance:       950.9 
#> Residual Deviance: 620.3     AIC: 638.3

model_logistic <- glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Embarked + 
    CabinFloor, family = "binomial", data = titanic_train)
summary(model_logistic)

#> 
#> Call:
#> glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Embarked + 
#>     CabinFloor, family = "binomial", data = titanic_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.6198  -0.6154  -0.4017   0.6023   2.5313  
#> 
#> Coefficients:
#>                      Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)           5.35084    0.65580   8.159 3.37e-16 ***
#> Pclass2              -1.66551    0.46962  -3.547  0.00039 ***
#> Pclass3              -3.01458    0.46833  -6.437 1.22e-10 ***
#> Sexmale              -2.75115    0.22445 -12.257  < 2e-16 ***
#> Age                  -0.04440    0.00913  -4.864 1.15e-06 ***
#> SibSp                -0.36024    0.11536  -3.123  0.00179 ** 
#> EmbarkedQ            -0.34588    0.42035  -0.823  0.41060    
#> EmbarkedS            -0.74926    0.27368  -2.738  0.00619 ** 
#> CabinFloorUpper Half -0.75375    0.45779  -1.647  0.09966 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 950.86  on 713  degrees of freedom
#> Residual deviance: 620.26  on 705  degrees of freedom
#> AIC: 638.26
#> 
#> Number of Fisher Scoring iterations: 5

4.1.3 Prediction

pred_logistic <- predict(object = model_logistic, newdata =  titanic_test, type = "response")
pred_label <- as.factor(ifelse(pred_logistic < 0.6, 0, 1)) # based on their proportion
titanic_test$Survived <- as.factor(titanic_test$Survived)

4.1.4 Evaluation

Log_eval <- confusionMatrix(data = pred_label, reference = test_label, positive = "1")
Log_eval

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 100  23
#>          1   9  45
#>                                           
#>                Accuracy : 0.8192          
#>                  95% CI : (0.7545, 0.8729)
#>     No Information Rate : 0.6158          
#>     P-Value [Acc > NIR] : 3.784e-09       
#>                                           
#>                   Kappa : 0.6025          
#>                                           
#>  Mcnemar's Test P-Value : 0.02156         
#>                                           
#>             Sensitivity : 0.6618          
#>             Specificity : 0.9174          
#>          Pos Pred Value : 0.8333          
#>          Neg Pred Value : 0.8130          
#>              Prevalence : 0.3842          
#>          Detection Rate : 0.2542          
#>    Detection Prevalence : 0.3051          
#>       Balanced Accuracy : 0.7896          
#>                                           
#>        'Positive' Class : 1               
#>

AUC (Area Under the ROC Curve) AUC - ROC curve is a performance measurement for classification problem at various thresholds settings

pred <-  prediction(predictions = pred_logistic,labels = titanic_test$Survived)
perf <- performance(prediction.obj = pred,measure = "tpr",x.measure = "fpr")
plot(perf)

auc <- performance(pred,measure = "auc")
auc@y.values[[1]]

#> [1] 0.8505127

4.2 k-Nearest Neighbor

4.2.1 Dummy Vars

dummy <- dummyVars(~., train[,-1], fullRank = T)
class(dummy)

#> [1] "dummyVars"

dummy.df <- data.frame(predict(dummy,train[,-1]))
class(dummy.df)

#> [1] "data.frame"

train.dummy <- cbind(Survived = train$Survived, dummy.df)

4.2.2 Cross Validation

set.seed(100)
idx <- initial_split(data = train.dummy, prop = 0.8, strata = "Survived")
titanic_train_knn <- training(idx)
titanic_test_knn <- testing(idx)

train_label_knn <- titanic_train_knn$Survived
test_label_knn <- titanic_test_knn$Survived

4.2.3 Scaling

Scale data train

train_scale <- titanic_train_knn %>% 
  select_if(is.numeric) %>% 
  scale()

Scale data test

test_scale <- titanic_test_knn %>% 
  select_if(is.numeric) %>% 
  scale(center = attr(train_scale, "scaled:center"),
        scale = attr(train_scale, "scaled:scale"))

4.2.4 Build Model & Prediction

k <- nrow(titanic_train_knn)^0.5
k

#> [1] 26.72078

pred_knn <- knn(train = train_scale,
                 test = test_scale,
                 cl = train_label_knn,
                 k = 27)

4.2.5 Evaluation

KNN_eval <- confusionMatrix(pred_knn,
                test_label,
                positive = "1")
KNN_eval

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 103  26
#>          1   6  42
#>                                           
#>                Accuracy : 0.8192          
#>                  95% CI : (0.7545, 0.8729)
#>     No Information Rate : 0.6158          
#>     P-Value [Acc > NIR] : 3.784e-09       
#>                                           
#>                   Kappa : 0.5955          
#>                                           
#>  Mcnemar's Test P-Value : 0.0007829       
#>                                           
#>             Sensitivity : 0.6176          
#>             Specificity : 0.9450          
#>          Pos Pred Value : 0.8750          
#>          Neg Pred Value : 0.7984          
#>              Prevalence : 0.3842          
#>          Detection Rate : 0.2373          
#>    Detection Prevalence : 0.2712          
#>       Balanced Accuracy : 0.7813          
#>                                           
#>        'Positive' Class : 1               
#>

5 Conclusion

Log_eval

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 100  23
#>          1   9  45
#>                                           
#>                Accuracy : 0.8192          
#>                  95% CI : (0.7545, 0.8729)
#>     No Information Rate : 0.6158          
#>     P-Value [Acc > NIR] : 3.784e-09       
#>                                           
#>                   Kappa : 0.6025          
#>                                           
#>  Mcnemar's Test P-Value : 0.02156         
#>                                           
#>             Sensitivity : 0.6618          
#>             Specificity : 0.9174          
#>          Pos Pred Value : 0.8333          
#>          Neg Pred Value : 0.8130          
#>              Prevalence : 0.3842          
#>          Detection Rate : 0.2542          
#>    Detection Prevalence : 0.3051          
#>       Balanced Accuracy : 0.7896          
#>                                           
#>        'Positive' Class : 1               
#>

a <- as.data.frame(Log_eval$overall[[1]])
names(a) <- c("Value")
a$Metrics <- "Accuracy"

b <- as.data.frame(rbind(Log_eval$byClass[[1]],Log_eval$byClass[[2]],
                   Log_eval$byClass[[3]],Log_eval$byClass[[4]]))
names(b) <- c("Value")
b$Metrics <- c("Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value")

c <- rbind(a,b)
c <- c[,c(2,1)]

Log.Regression <-c
Log.Regression$Model <- "Logistic Regression"

KNN_eval

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 103  26
#>          1   6  42
#>                                           
#>                Accuracy : 0.8192          
#>                  95% CI : (0.7545, 0.8729)
#>     No Information Rate : 0.6158          
#>     P-Value [Acc > NIR] : 3.784e-09       
#>                                           
#>                   Kappa : 0.5955          
#>                                           
#>  Mcnemar's Test P-Value : 0.0007829       
#>                                           
#>             Sensitivity : 0.6176          
#>             Specificity : 0.9450          
#>          Pos Pred Value : 0.8750          
#>          Neg Pred Value : 0.7984          
#>              Prevalence : 0.3842          
#>          Detection Rate : 0.2373          
#>    Detection Prevalence : 0.2712          
#>       Balanced Accuracy : 0.7813          
#>                                           
#>        'Positive' Class : 1               
#>

x <- as.data.frame(KNN_eval$overall[[1]])
names(x) <- c("Value")
# x$Metrics <- "Accuracy"

y <- as.data.frame(rbind(KNN_eval$byClass[[1]],KNN_eval$byClass[[2]],
                   KNN_eval$byClass[[3]],KNN_eval$byClass[[4]]))
names(y) <- c("Value")
# y$Metrics <- c("Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value")
# 
z <- rbind(x,y)
# z<- z[,c(2,1)]

KNN <- z
KNN$Model <- "K-Nearest Neighbour"

Conclusion <- cbind(Log.Regression, KNN)
Conclusion

For the Logistic model, I use the backward stepwise method and in the k-NN model I use all variables that have been changed to dummy variables. Both models, logistical and KNN return the same accuracy value of 0.8192. This means out of 100 passengers, the model could answer 82 passengers’ survival status correctly.

The sensitivity and precision of logistic regression model is 0.6617 and 0.8333. Meanwhile, the sensitivity and precision of k-NN model is 0.6176 and 0.8750.

In this classification, it is important for a model to have high accuracy to correctly predict passengers’ survival status to avoid miss-classification. Model with high sensitivity is also preferable because it describes the ability of a model to correctly predict positive class from the total of real positive class.

From the above evaluation, Logistic model gave the same score in accuracy metrics and higher score in sensitivity, making it a more appropriate model to predict passengers’ survival status.

Although logistic was picked as the best model in our case, we only get a 0.8192 accuracy and 0.6617 sensitivity from that model. It is always possible to have higher accuracy by using other classification model and/or with the addition of other data wrangling process. It is even better to try and build models using different classification method and compare them to find the best method for our data.

Thank you for reading !!!

Titanic: Machine Learning from Disaster

Fran Sanjaya Lumbangaol

Updated : Desember 28, 2019