1 Introduction

RMS Titanic was a British passenger liner operated by the White Star Line that sank in the North Atlantic Ocean in the early morning hours of 15 April 1912 after striking an iceberg during her maiden voyage from Southampton to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making the sinking one of modern history’s deadliest peacetime commercial marine disasters. Over 100 years after the crash, titanic seems to pop up in many different contexts. One of the new settings is to predict whether a passenger would survive from this legendary disaster.

library(dplyr) # for data wrangling
library(ggplot2) # to visualize data
library(gridExtra) # to display multiple graph
library(inspectdf) # for EDA
library(tidymodels) # to build tidy models
library(caret) # to pre-process data
library(yardstick)

2 Reading Data

The source of the data is https://www.kaggle.com/c/titanic

Titanic_Survival <- read.csv("train (1).csv")

The data contain information as follow:

  • survival : Survival 0 = No, 1 = Yes
  • pclass : Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
  • sex : Male or Female
  • Age : Age in years
  • sibsp : number of siblings / spouses aboard the Titanic
  • parch : number of parents / children aboard the Titanic
  • ticket : Ticket number
  • fare : Passenger fare
  • cabin : Cabin number
  • embarked : Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
rmarkdown::paged_table(Titanic_Survival)

3 Data Pre-Processing

Before we build a model, we need to investigate the data whether it has NA value or not.

colSums(is.na(Titanic_Survival))
#> PassengerId    Survived      Pclass        Name         Sex         Age 
#>           0           0           0           0           0         177 
#>       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
#>           0           0           0           0           0           0
sum(is.na(Titanic_Survival$Age))/nrow(Titanic_Survival)*100
#> [1] 19.86532

We lost 177 data or approximately almost 20% of the dataset. Next we see the missing value in the data

sapply(Titanic_Survival, function(x){sum(x=='')})
#> PassengerId    Survived      Pclass        Name         Sex         Age 
#>           0           0           0           0           0          NA 
#>       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
#>           0           0           0           0         687           2

The Cabin have a lot of blank data. We will remove this variable and the two blank observation in the embarked data.

As a justification, the variable age will be deleted since it has less than 20% NA value. The further recommendation may arise since tackling with NA value would be better with imputation with the specific technique rather than just removing the observations. We also remove the variable Name, PassengerId, and Ticket. It might be interesting to analyze the characteristic of each passenger name with the individual survival rate.

Titanic_new <-  Titanic_Survival %>% 
  dplyr::select(-c(Cabin, PassengerId, Name, Ticket)) %>% 
  filter(Age != '') %>% 
  filter(Embarked != '')
glimpse(Titanic_new)
#> Rows: 712
#> Columns: 8
#> $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1...
#> $ Pclass   <int> 3, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 3, 2, 2, 3...
#> $ Sex      <chr> "male", "female", "female", "female", "male", "male", "mal...
#> $ Age      <dbl> 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, ...
#> $ SibSp    <int> 1, 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 1, 0, 0, 0...
#> $ Parch    <int> 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0...
#> $ Fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0750...
#> $ Embarked <chr> "S", "C", "S", "S", "S", "S", "S", "S", "C", "S", "S", "S"...
colSums(is.na(Titanic_new))
#> Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
#>        0        0        0        0        0        0        0        0

We need to reformat the type of the data into the proper format—the variable Sex and Embarked need to reformat into factor.

Titanic_clean <- Titanic_new %>% 
  mutate(Sex = as.factor(Sex),
         Embarked = as.factor(Embarked))
glimpse(Titanic_clean)
#> Rows: 712
#> Columns: 8
#> $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1...
#> $ Pclass   <int> 3, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 3, 2, 2, 3...
#> $ Sex      <fct> male, female, female, female, male, male, male, female, fe...
#> $ Age      <dbl> 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, ...
#> $ SibSp    <int> 1, 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 1, 0, 0, 0...
#> $ Parch    <int> 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0...
#> $ Fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0750...
#> $ Embarked <fct> S, C, S, S, S, S, S, S, C, S, S, S, S, S, S, Q, S, S, S, Q...

4 Data Exploration

4.1 Age

titanic_survived <- Titanic_clean %>% 
  filter(Survived == 1)

titanic_not_survived <- Titanic_clean %>% 
  filter(Survived == 0)

hist(titanic_survived$Age, breaks=20, col = "Green", xlab = "Age",  main   = "Distribution of Survived Passengers")

hist(titanic_not_survived$Age, breaks=20, col = "Red", xlab = "Age", main = "Distribution of Non-Survived Passengers")

We see a lot of young passengers survived from the crash. The consideration to save young passengers might arise during the critical time.

4.2 Sex

Titanic_clean %>% group_by(Sex) %>% summarise(avgAge = mean(Age), stdev = sd(Age))
ggplot(data=Titanic_clean, aes(x=Age, fill=Sex)) + geom_density(alpha=0.5)

The male passenger age is over 60 until 80. This is higher than the fraction in females.

prop.table(table(Titanic_clean$Survived, Titanic_clean$Sex))
#>    
#>         female       male
#>   0 0.08988764 0.50561798
#>   1 0.27387640 0.13061798

The male passenger will likely not survived during the tragedy, with more than 50% than the total number of passengers.

4.3 Pclass

hist(titanic_survived$Pclass, col="green", main="Distribution of Survived Passenger Based on Class", xlab = "Class")

hist(titanic_not_survived$Pclass, col="red", main="Distribution of Died Passenger Based on Class", xlab = "Class")

The passenger who died came mostly from class 3, the lowest class while the passenger in the first-class gave the highest portion of the number who survived.

4.4 Embarked

ggplot(data=titanic_not_survived, aes(x=Age, fill=Sex)) + geom_density(alpha=0.5)

5 Splitting Training-testing

We will build the model from training data. After we get the model, we test in the testing dataset to evaluate the performance of the model.

set.seed(11)

index <- sample(nrow(Titanic_clean), nrow(Titanic_clean)*0.7)

data_train <- Titanic_clean[index, ]
data_test <- Titanic_clean[-index, ]
prop.table(table(data_train$Survived))
#> 
#>         0         1 
#> 0.5763052 0.4236948

The proportion of each class is balanced. Hence, we will build the decision tree model

library(rpart)
library(rattle)
library(rpart.plot)

6 Building Model

tree.survival <-  rpart(Survived~., data=data_train, method = "class")
fancyRpartPlot(tree.survival)

We obtain a decision tree to predict the survival of the Titanic passenger. Next we try to investigate the quality of this model.

6.1 Evaluation

6.1.1 Data Test

tree.survival.predict <- predict(tree.survival, data_test, type="class")
library(caret)
confusionMatrix(as.factor(tree.survival.predict), as.factor(data_test$Survived))
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 126  33
#>          1  11  44
#>                                          
#>                Accuracy : 0.7944         
#>                  95% CI : (0.734, 0.8465)
#>     No Information Rate : 0.6402         
#>     P-Value [Acc > NIR] : 7.015e-07      
#>                                          
#>                   Kappa : 0.5239         
#>                                          
#>  Mcnemar's Test P-Value : 0.001546       
#>                                          
#>             Sensitivity : 0.9197         
#>             Specificity : 0.5714         
#>          Pos Pred Value : 0.7925         
#>          Neg Pred Value : 0.8000         
#>              Prevalence : 0.6402         
#>          Detection Rate : 0.5888         
#>    Detection Prevalence : 0.7430         
#>       Balanced Accuracy : 0.7456         
#>                                          
#>        'Positive' Class : 0              
#> 

6.1.2 Data Train

tree.survival.predict_train <- predict(tree.survival, data_train, type="class")
library(caret)
confusionMatrix(as.factor(tree.survival.predict_train), as.factor(data_train$Survived))
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 266  54
#>          1  21 157
#>                                           
#>                Accuracy : 0.8494          
#>                  95% CI : (0.8149, 0.8797)
#>     No Information Rate : 0.5763          
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.6851          
#>                                           
#>  Mcnemar's Test P-Value : 0.0002199       
#>                                           
#>             Sensitivity : 0.9268          
#>             Specificity : 0.7441          
#>          Pos Pred Value : 0.8312          
#>          Neg Pred Value : 0.8820          
#>              Prevalence : 0.5763          
#>          Detection Rate : 0.5341          
#>    Detection Prevalence : 0.6426          
#>       Balanced Accuracy : 0.8355          
#>                                           
#>        'Positive' Class : 0               
#> 

Here, we find that the model in the training dataset give more accuracy (84%) than in the testing dataset (79%). It is a sign that our model is overfit. To make this model better is we can do ensemble method or by adjusting parameter in mincriterion, minsplit, or minbucket. One other weakness in a decision tree is that all predictor terms are by nature assumed to interact. Intuitively, to predict a “high risk of cancer”: the “age” variable have to combine with the “exposure to radiation” one node above it, which in turn have to combine with the “genetic” variable even further up the tree. This rule-based system can be too rigid for cases where the variables have close to no interactions.