Kaggle Introduction

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. This behavior has greatly affect stock price in Excel

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to use stock price in Google Sheets and apply the tools of machine learning to predict which passengers survived the tragedy.

Data Dictionary

Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings/spouses aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way… Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way… Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

Executive summary

This is my first kernel and bellow the main points that characterize my work:

Getting data

Library import

library(kableExtra)
library(ggplot2) # Visualization
## Warning: The package `vctrs` (>= 0.3.8) is required as of rlang 1.0.0.
library(tibble)
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.5.2
library(dplyr) # Data manipulation
library(corrplot) # Correlattion plot
library(pROC) # Roc curve
library(caret)
## Warning: package 'caret' was built under R version 3.5.2
# library(party)
# library(HDoutliers)

Importing and cleaning Data

# Importing train/test dataset in R
train <- read.csv('../input/train.csv', header = TRUE, stringsAsFactors = FALSE, na.strings = c(""))
train$Set = "train"
test <- read.csv('../input/test.csv', header = TRUE,stringsAsFactors = FALSE, na.strings = c(""))
test$Set = "test"

# In order to streamline the cleaning process, the train and test datasets are binded. A dummy column is added in order to keep track of the original data source
test$Survived <- NA
full <- rbind(train,test)
# Coercing features from char to factor
full$Sex <- as.factor(full$Sex)
full$Embarked <- as.factor(full$Embarked)
full$Pclass <- as.factor(full$Pclass)
# The levels if the Survived feature (outcome) are temporarly changed. The orginal ones are kept in a variable
full$Survived <- as.factor(full$Survived); origLevels <- levels(full$Survived); levels(full$Survived) <- c("dead","survived")

Survival distribution

The survivors distribution for each Pclass and sex is shown below:

g <- ggplot(filter(full,!is.na(Survived)),aes(Survived))
g <- g + geom_bar(colour = "black", aes(fill=Sex, Pclass),position="dodge")
g <- g + labs(title = "Survival distribution", x = "Pclass", y = "# survied") + theme_bw()
g

At first glanch it seems that male in the third class have more chance to survive respect to the female in the same class. To continue study how gender and stock price are related, use this guide about how to get stock price in Excel.

Missing value analysis:

Let’s first have a look at the percentage of missing value for each feature:

missingValue <- sapply(full, function(feature){
    sum(is.na(feature))/length(feature)
})
missingValue <- data.frame(feature = names(missingValue), percentage = missingValue) %>% filter(percentage != 0)

g <- ggplot(missingValue,
            aes(reorder(feature, percentage),percentage))
g <- g + geom_bar(stat = "identity", fill="dark green", color="black")
g <- g + labs(title = "Missing values Overview [%]", x = "Feature", y = "Missing Values [%]")
g <- g + coord_flip() + theme_bw()
g

The cabin feature has almost 80% of NA values. Maybe the NA’s here have the meaning of: no cabin.

Cleaning data

Name: extracting Title and Family Name

A bunch of entries from the Name feature:

head(full$Name)
## [1] "Braund, Mr. Owen Harris"                            
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"                             
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"       
## [5] "Allen, Mr. William Henry"                           
## [6] "Moran, Mr. James"

Each Name is of the form String, Title. String, where the first one is the family name.

The idea behind this extraction is not to use them for training the model, but to use them in order to: + understand the family status (married or maiden) + wheater or not is part of the crew + handle NA’s for the Age

Regular expressions are used in order to extract the Name.family and Name.title

full$Name.family <- gsub(pattern = "([\\w|\\-| |']*), ([\\w| ]*\\.) (.*)", x=full$Name, replacement="\\1", ignore.case=TRUE, perl = TRUE)
full$Name.title <- gsub(pattern = "([\\w|\\-| |']*), ([\\w| ]*\\.) (.*)", x=full$Name, replacement="\\2", ignore.case=TRUE, perl = TRUE)

Feature engineering: family name count:

Name.family <- full %>% group_by(Name.family) %>% summarise(Name.family.count = n())
full <- inner_join(full,Name.family, by = "Name.family")
knitr::kable(table(full$Name.title),"html") %>%
    kable_styling() %>%
    scroll_box(height = "200px")
Var1 Freq
Capt. 1
Col. 4
Don. 1
Dona. 1
Dr. 8
Jonkheer. 1
Lady. 1
Major. 2
Master. 61
Miss. 260
Mlle. 2
Mme. 1
Mr. 757
Mrs. 197
Ms. 2
Rev. 8
Sir. 1
the Countess. 1

As we can see from the last output, many titles have only few occurence. The less frequent titles are pooled in the more common ones:

full <- full %>% mutate(
    Title = as.factor(case_when(
        # Young boy less then 18 years old
        Name.title == "Master."                         ~ "Master",
        
        # Girl or unmarried women
        Name.title == "Miss." | 
            Name.title == "Mlle."                   ~ "Miss",
        
        # Women married
        Name.title == "Mrs."  | 
            Name.title == "Mme."  |
            Name.title == "Dona."                   ~ "Mrs",
        
        # Women with unspecified status
        Name.title == "Ms."           |
            Name.title == "Lady."         |
            Name.title == "the Countess." |
            Name.title == "Dr." & Sex == "female"   ~ "Mrs", # In this group because there are too few observation
        
        # Men
        Name.title == "Mr."       |
            Name.title == "Sir."      |
            Name.title == "Rev."      |
            Name.title == "Don."      |
            Name.title == "Jonkheer." |            
            Name.title == "Dr." & Sex == "male"     ~ "Mr",
        
        # Crew
        Name.title == "Capt." |
            Name.title == "Col."  |
            Name.title == "Major."                  ~ "Mr" # In this group because it wasn't significative in the final model
    ))
)

Below it’s shown the new histogram of the Title feature just created:

g <- ggplot(full, aes(Title))
g <- g + geom_bar(color="black", aes(fill=Title))
g <- g + coord_flip() + labs(title = "Passenger titles", x = "Title", y = "Count") + theme_bw()
g

Ticket: Ticket number

Using the same approch used for the Title (checking the structure of the entries and using regex), the Ticket.number is extracted:

full$Ticket.number <- gsub(pattern = "^((.*) )?(\\d*)?$", x=full$Ticket, replacement="\\3")
# Since we want to treat them as number the tickets which start with LINE are replaced with the 0
full$Ticket.number <- gsub(pattern = "^LINE$", x=full$Ticket.number, replacement="0")
full$Ticket.number <- as.integer(full$Ticket.number)

Cabin: extracting the deck

full <- full %>% mutate(Cabin.Deck = as.factor(case_when(
    grepl("^A",Cabin) ~ "A",
    grepl("^B",Cabin) ~ "B",
    grepl("^C",Cabin) ~ "C",
    grepl("^D",Cabin) ~ "D",
    grepl("^E",Cabin) ~ "E",
    grepl("^F",Cabin) ~ "F",
    grepl("^G",Cabin) ~ "G",
    grepl("^T",Cabin) ~ "T",
    TRUE ~ "NONE"
)))
g <- ggplot(filter(full,Cabin.Deck != "NONE" & Set=="train"), aes(Cabin.Deck))
g <- g + geom_bar(color="black", aes(fill=Survived, alpha=Pclass), position = "dodge")
g <- g + labs(title = "Cabin Deck distribution", x = "Cabin Deck", y = "Count") + theme_bw()
g
## Warning: Using alpha for a discrete variable is not advised.

Missing Value imputation

Age

Two different imputation techniques are tried:

  • fit a tree-based model using Age and Title;
  • replacing with the median age inside inside the Title group of appartenance.

Median inside the Title group

By looking at the Age distribution for each Title:

it seems reasonable to use the median Age in the Title group of appartenency.