Variable	Definition	Key
survival	Survival	0 = No, 1 = Yes
pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
sex	Sex
Age	Age in years
sibsp	# of siblings/spouses aboard the Titanic
ticket	Ticket number
fare	Passenger fare
cabin	Cabin number
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

Getting data

Library import

library(kableExtra)
library(ggplot2) # Visualization

## Warning: The package `vctrs` (>= 0.3.8) is required as of rlang 1.0.0.

library(tibble)
library(tidyr)

## Warning: package 'tidyr' was built under R version 3.5.2

library(dplyr) # Data manipulation
library(corrplot) # Correlattion plot
library(pROC) # Roc curve
library(caret)

## Warning: package 'caret' was built under R version 3.5.2

# library(party)
# library(HDoutliers)

Importing and cleaning Data

# Importing train/test dataset in R
train <- read.csv('../input/train.csv', header = TRUE, stringsAsFactors = FALSE, na.strings = c(""))
train$Set = "train"
test <- read.csv('../input/test.csv', header = TRUE,stringsAsFactors = FALSE, na.strings = c(""))
test$Set = "test"

# In order to streamline the cleaning process, the train and test datasets are binded. A dummy column is added in order to keep track of the original data source
test$Survived <- NA
full <- rbind(train,test)

# Coercing features from char to factor
full$Sex <- as.factor(full$Sex)
full$Embarked <- as.factor(full$Embarked)
full$Pclass <- as.factor(full$Pclass)
# The levels if the Survived feature (outcome) are temporarly changed. The orginal ones are kept in a variable
full$Survived <- as.factor(full$Survived); origLevels <- levels(full$Survived); levels(full$Survived) <- c("dead","survived")

Survival distribution

The survivors distribution for each Pclass and sex is shown below:

g <- ggplot(filter(full,!is.na(Survived)),aes(Survived))
g <- g + geom_bar(colour = "black", aes(fill=Sex, Pclass),position="dodge")
g <- g + labs(title = "Survival distribution", x = "Pclass", y = "# survied") + theme_bw()
g

At first glanch it seems that male in the third class have more chance to survive respect to the female in the same class. To continue study how gender and stock price are related, use this guide about how to get stock price in Excel.

Missing value analysis:

Let’s first have a look at the percentage of missing value for each feature:

missingValue <- sapply(full, function(feature){
    sum(is.na(feature))/length(feature)
})
missingValue <- data.frame(feature = names(missingValue), percentage = missingValue) %>% filter(percentage != 0)

g <- ggplot(missingValue,
            aes(reorder(feature, percentage),percentage))
g <- g + geom_bar(stat = "identity", fill="dark green", color="black")
g <- g + labs(title = "Missing values Overview [%]", x = "Feature", y = "Missing Values [%]")
g <- g + coord_flip() + theme_bw()
g

The cabin feature has almost 80% of NA values. Maybe the NA’s here have the meaning of: no cabin.

Cleaning data

Name: extracting Title and Family Name

A bunch of entries from the Name feature:

head(full$Name)

## [1] "Braund, Mr. Owen Harris"                            
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"                             
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"       
## [5] "Allen, Mr. William Henry"                           
## [6] "Moran, Mr. James"

Each Name is of the form String, Title. String, where the first one is the family name.

The idea behind this extraction is not to use them for training the model, but to use them in order to: + understand the family status (married or maiden) + wheater or not is part of the crew + handle NA’s for the Age

Regular expressions are used in order to extract the Name.family and Name.title

full$Name.family <- gsub(pattern = "([\\w|\\-| |']*), ([\\w| ]*\\.) (.*)", x=full$Name, replacement="\\1", ignore.case=TRUE, perl = TRUE)
full$Name.title <- gsub(pattern = "([\\w|\\-| |']*), ([\\w| ]*\\.) (.*)", x=full$Name, replacement="\\2", ignore.case=TRUE, perl = TRUE)

Feature engineering: family name count:

Name.family <- full %>% group_by(Name.family) %>% summarise(Name.family.count = n())
full <- inner_join(full,Name.family, by = "Name.family")

knitr::kable(table(full$Name.title),"html") %>%
    kable_styling() %>%
    scroll_box(height = "200px")

Var1	Freq
Capt.	1
Col.	4
Don.	1
Dona.	1
Dr.	8
Jonkheer.	1
Lady.	1
Major.	2
Master.	61
Miss.	260
Mlle.	2
Mme.	1
Mr.	757
Mrs.	197
Ms.	2
Rev.	8
Sir.	1
the Countess.	1

As we can see from the last output, many titles have only few occurence. The less frequent titles are pooled in the more common ones:

full <- full %>% mutate(
    Title = as.factor(case_when(
        # Young boy less then 18 years old
        Name.title == "Master."                         ~ "Master",
        
        # Girl or unmarried women
        Name.title == "Miss." | 
            Name.title == "Mlle."                   ~ "Miss",
        
        # Women married
        Name.title == "Mrs."  | 
            Name.title == "Mme."  |
            Name.title == "Dona."                   ~ "Mrs",
        
        # Women with unspecified status
        Name.title == "Ms."           |
            Name.title == "Lady."         |
            Name.title == "the Countess." |
            Name.title == "Dr." & Sex == "female"   ~ "Mrs", # In this group because there are too few observation
        
        # Men
        Name.title == "Mr."       |
            Name.title == "Sir."      |
            Name.title == "Rev."      |
            Name.title == "Don."      |
            Name.title == "Jonkheer." |            
            Name.title == "Dr." & Sex == "male"     ~ "Mr",
        
        # Crew
        Name.title == "Capt." |
            Name.title == "Col."  |
            Name.title == "Major."                  ~ "Mr" # In this group because it wasn't significative in the final model
    ))
)

Below it’s shown the new histogram of the Title feature just created:

g <- ggplot(full, aes(Title))
g <- g + geom_bar(color="black", aes(fill=Title))
g <- g + coord_flip() + labs(title = "Passenger titles", x = "Title", y = "Count") + theme_bw()
g

Ticket: Ticket number

Using the same approch used for the Title (checking the structure of the entries and using regex), the Ticket.number is extracted:

full$Ticket.number <- gsub(pattern = "^((.*) )?(\\d*)?$", x=full$Ticket, replacement="\\3")
# Since we want to treat them as number the tickets which start with LINE are replaced with the 0
full$Ticket.number <- gsub(pattern = "^LINE$", x=full$Ticket.number, replacement="0")
full$Ticket.number <- as.integer(full$Ticket.number)

Cabin: extracting the deck

full <- full %>% mutate(Cabin.Deck = as.factor(case_when(
    grepl("^A",Cabin) ~ "A",
    grepl("^B",Cabin) ~ "B",
    grepl("^C",Cabin) ~ "C",
    grepl("^D",Cabin) ~ "D",
    grepl("^E",Cabin) ~ "E",
    grepl("^F",Cabin) ~ "F",
    grepl("^G",Cabin) ~ "G",
    grepl("^T",Cabin) ~ "T",
    TRUE ~ "NONE"
)))

g <- ggplot(filter(full,Cabin.Deck != "NONE" & Set=="train"), aes(Cabin.Deck))
g <- g + geom_bar(color="black", aes(fill=Survived, alpha=Pclass), position = "dodge")
g <- g + labs(title = "Cabin Deck distribution", x = "Cabin Deck", y = "Count") + theme_bw()
g

## Warning: Using alpha for a discrete variable is not advised.

Missing Value imputation

Age

Two different imputation techniques are tried:

fit a tree-based model using Age and Title;
replacing with the median age inside inside the Title group of appartenance.

Median inside the Title group

By looking at the Age distribution for each Title:

it seems reasonable to use the median Age in the Title group of appartenency.

Titanic sinking effect on stock price

Kaggle Introduction

Data Dictionary

Variable Notes

Executive summary