The Data

The below kernel summarises an approach I took to the infamous Titanic challenge

A number of predictive techniques are implemented, comparing cross-validated results to submission results:

Data set-up

First, lets load-in our dependencies :

library(plyr) # means
library(dplyr) # data-cleaning
library(stringr) # data-cleaning
library(naniar) # missing data visualisation
library(ggplot2) # visualising predictors
library(rpart) # decision trees
library(rattle) # visualise decision trees
library(rpart.plot) # visualise decision trees
library(randomForest) # random forests
library(party) # conditional inference trees
library(partykit) # conditional inference trees
library(caret) # support vector machines 
library(e1071) # support vector machines
library(RColorBrewer)
library(Hmisc)
library(caTools) # cross validation 
library(kableExtra) # markdown tables 

Then load in our train & test data-sets

# Read in .csv files
train <- read.csv("train.csv")
test <- read.csv("test.csv")

# Create survived column in test set
test$Survived <- NA
# Bind test and train data to create new variables in both sets
combined <- rbind(train, test)
# Check data
str(combined)

Data Cleaning

Let’s first examine which variables are missing data

# Convert all blanks to NA
combined[combined==""]  <- NA 
# Check our missing data situation 
miss <- sapply(combined[-2], function(x) sum(is.na(x)))
x
PassengerId 0
Pclass 0
Name 0
Sex 0
Age 263
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 1014
Embarked 2

“Age” & “Cabin” are missing significant amounts of data, with a slightly higher proportion of cabin details missing for males

We can check whether missing cabin information is spread evenly across passenger “class”

Perhaps unsuprisingly, there is more missing information in the lower classes. I’m going to presume better historical records are maintained for affluent passengers

Let’s examine missing data for Cabin x SibSp (i.e. number of siblings)

Missing data for Cabin appears to rise as SibSp rises (i.e. more siblings == more missing data). This may suggest a systematic bias in record keeping of higher-occupancy cabins.

Generally, a feature with this much missing is a lost cause, & risks significantly biasing a model. Potentially a feature could be extracted utilising cabin details for first class passengers as this data is more complete. However, I opted not to proceed further with this variable

Passenger’s age has ~ 20% missing data, & therefore is suitable for missing value imputation. Below I have fitted a decision tree to predict missing values, utilising other variables as predictors of missing values. A decision tree was Previous favored over linear regression as previous attempts were sub-optimal, producing negative age values

# Use a decision tree to predict missing values
# Create DF with missing age values removed 
NoMissing <- combined %>% filter(!is.na(Age))
# Create a predictive model for age
fit.missing <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked, 
                  data = NoMissing)
# Examine the model 
summary(fit.missing)
# Examine the coefficients
fit.missing
# Create 'age predicted' column from fit.missing, into initial data-set
combined <- combined %>% 
  mutate(age.predicted = predict(fit.missing, .)) %>% 
  # Replace NAs in combinedd$Age, with age.predicted 
  mutate(Age = ifelse(is.na(Age), age.predicted, Age))
# Remove predicted age column 
combined$age.predicted <- NULL
print(fit.missing)

# Replace any values below 1 with 1 to keep consistent with age-data
combined$Age[combined$Age < 1] <- 1

A small number of data-points were missing from port of embarkment, which have been corrected

# replace missing embarked values
combined$Embarked[c(62,830)] = "S"
combined$Embarked <- factor(combined$Embarked)

With one missing fare value, we can safely replace this with the median without biasing our model

# replace missing fare value 
combined$Fare[1044] <- median(combined$Fare, na.rm = TRUE)

Finally, I have made some corrections to some implausible values that previous kernels have identified

# Data-set errors
# The following fixes SibSp/Parch values for two passengers (Id=280 and Id=1284) according to this kernel 
# because a 16 year old can’t have a 13 year old son! 

combined$SibSp[combined$PassengerId == 280] = 0
combined$Parch[combined$PassengerId == 280] = 2
combined$SibSp[combined$PassengerId == 1284] = 1
combined$Parch[combined$PassengerId == 1284] = 1

Data exploration and visualisation

The below section utilises density plots to visualise the relationships between various predictors of survival

Survival x Age

Below you can see that passengers survival status was fairly evenly distributed across the age-span of passengers, with a slightly higher density of survivors under 10 years of age

Sex x Survival

Using a stacked bar chart we can visualise passengers survival status x sex

Fairly clearly, (a) there were more male passengers on the titanic, & (b) survival rates were much higher for female passengers

Sex x Age x Survival

Below are two density plots plotting passenger Age x Sex for (a) surviving passengers, & (b) non-surviving passengers

For survivors, the distribution is fairly similar for males & females, with slightly more young males surviving

For non-survivors, the distribution of age x gender is more disparate. Proportionally more females than males under 20 years of age died, whilst over the age of 20, more males died

Class x Survival

There were three passengers classes on the titanic - 1, 2 & 3. Somewhat unsuprisingly, more passengers died in the poorer passenger classes

Class x Age x Survival

The distribution of passenger’s age for each cabin class is close to identical for survivors and non-survivors

Some noteworthy differences are a higher density of surviving passengers < 10 years of age in class (2), and a higher density of non-surviving passengers > 60 in class (1)

Embarked x Survival

Passengers boarded the Titanic from three ports of embarkment - Southampton, Cherbourg & Queenstown. The vast majority of passengers were from the first port of embarkment (Southampton).

There were more non-surviving Passengers embarking from Southampton & Queenstown, but more surviving passengers from Cherbourg. As such, port of embarkment appears related to survival probability

Embarked x Age x Survival

Density plots of port of embarkment by age show some marked differences between survivors & non-survivors

For Southampton & Cherbourg passengers, a flatter age distribution is notable for survivors compared to non-survivors. Those who survived were randomly distributed across the age spectrum, compared to deceased passengers who were more concentrated around the mean age (~ 30 for these two ports).

This pattern does not exist for Queenstown, with narrow age distributions observed for both survivors and non-survivors. In fact, the distribution for survivors from Queenstown is especially narrow, with all survivors being aged between 10 & 40. Non-survivors from Queenstown cover the entire age spectrum from infants through to the elderly

Why this is the case is unclear. With only 77 passengers embarking from Queenstown, perhaps certain travel parties survived or died together. The surviving parties potentially having no dependents (i.e. very young children or very elderly relatives). In any case, it shows port of embarkment is potentially an important predictor

Feature Extraction

Title

An obvious, and routinely extracted feature from the data-set is passengers title. It can be fairly simply derived via regex

# change to character
combined$Name <- as.character(combined$Name)

# Extract title from passenger name
combined$Title <- gsub("^.*, (.*?)\\..*$", "\\1", combined$Name)

Let’s examine the breakdown of passenger titles by sex:

Capt Col Don Dona Dr Jonkheer Lady Major Master Miss Mlle Mme Mr Mrs Ms Rev Sir the Countess
female 0 0 0 1 1 0 1 0 0 260 2 1 0 197 2 0 0 1
male 1 4 1 0 7 1 0 2 61 0 0 0 757 0 0 8 1 0

Some of the rarer titles can be aggregated

# Re-assign female categories 
combined$Title[combined$Title == 'Mlle' | combined$Title == 'Ms'] <- 'Miss' 
combined$Title[combined$Title == 'Mme']  <- 'Mrs' 

# Concatenate rare titles, potential proxi for high spcoa; standing 
Other <- c('Dona', 'Dr', 'Lady', 'the Countess','Capt', 'Col', 'Don', 'Jonkheer', 'Major', 'Rev', 'Sir')

combined$Title[combined$Title %in% Other]  <- 'Other'

Let’s examine the breakdown …

Master Miss Mr Mrs Other
female 0 264 0 198 4
male 61 0 757 0 25

Much nicer!

Let’s also look at the number of survivors & non-survivors by title

In essence, ‘title’ nicely conveys some of the interactive effects of Age x Sex

We know males are more likely to have died on the titanic than females, but evidently, young males with title “master” had a far better chance of survival than adult males

Family size

# Combine siblings + parents/children 
combined$FamilySize <- combined$SibSp + combined$Parch + 1

Family unit

# Family Unit
# Extract Surnames 
combined$Surname <- sapply(combined$Name, 
                           FUN=function(x) 
                           {strsplit(x, split='[,]')[[1]][1]})

# Combine with Family Size
combined$FamilyUnit <- paste(combined$Surname, "-", combined$FamilySize)

# How many unique families?
length(unique(combined$FamilyUnit))
## [1] 928

As expected there are many different family combinations … 928 infact! This is inpractical for predictive purposes, so let’s try and reduce the number of levels of this feature by combining ‘small’ and ‘medium’ sized families

# Re-code families undr 2 as 'small families'
combined$FamilyUnit[combined$FamilySize <= 2] <- "Small"
combined$FamilyUnit[combined$FamilySize > 2 & combined$FamilySize <= 4] <- 'Medium'

# remove surname variable 
combined$Surname <- NULL

Let’s examine the results of combining all family sizes under 2, and between 2 & 4 people:

Var1 Freq
Andersson - 7 9
Asplund - 7 7
Ford - 5 5
Fortune - 6 6
Goodwin - 8 8
Hocking - 5 1
Kink-Heilmann - 5 1
Lefebre - 5 5
Medium 202
Palsson - 5 5
Panula - 6 6
Rice - 6 6
Richards - 6 1
Ryerson - 5 5
Sage - 11 11
Skoog - 6 6
Small 1025

Much better!

Age Group

# Age Group
combined$AgeGroup <- NA
combined$AgeGroup[combined$Age <= 12] <- "Child"
combined$AgeGroup[combined$Age > 12 & combined$Age <= 18] <- "Adolescent"
combined$AgeGroup[combined$Age > 18 & combined$Age <= 59] <- "Adult"
combined$AgeGroup[combined$Age >= 60] <- "Elderly Adult"

Below is a table summarising the percentage of surviving passengers by AgeGroup & Sex

Interestingly, all elderly women (arbitrarily defined as people over 60) survived. By contrast, elderly men were over-whelmingly unlikely to surive, whilst male children were slightly more likely to survive than female children. We will use this variable in some later feature engineering

AgeGroup Sex Survived
Adolescent female 75.00
Adult female 77.54
Child female 50.00
Elderly Adult female 100.00
Adolescent male 8.82
Adult male 17.12
Child male 50.00
Elderly Adult male 13.64

Travel party size

Below we can approximate groups of passengers who travelled together, by extracting common string-names in the ticket, and common ticket numbers. This can be achieved by grouping passengers with commmon ticket details, aggregating the number of passengers with these ticket details, then re-combining this detail as a grouping variable with our original dataframe

# Extracts out any letters before first space
# If no letters, no extraction 
reg.ex2 <- ".*\\s|^[A-Z].*(?![0-9])$" 
combined$TicketPre <- str_extract(combined$Ticket, reg.ex2)

# For tickets that do not begin with a character string ...
combined$TicketPre <- gsub("\\.|\\s|/", "", combined$TicketPre)
# Impute "_"
combined$TicketPre[is.na(combined$TicketPre)] <- "_"
## Convert to factor
combined$TicketPre <- factor(combined$TicketPre)

# Second half of ticket string 
reg.ex3 <- "\\s[0-9]{1,}|^(?![A-Z]).*"
combined$TicketNumber <- str_extract(combined$Ticket, reg.ex3)
combined$TicketNumber[is.na(combined$TicketNumber)] <- "0"
combined$TicketNumber <- as.numeric(combined$TicketNumber)
combined$TicketNumber <- as.factor(combined$TicketNumber)

# Based upon common ticket numbers between passengers
# Calculate size of travel party 
detach("package:plyr")
library(dplyr)
Ticket.Agg <- combined %>% 
  group_by(TicketNumber) %>% 
  summarise(Travel.Party.Size = n())

# Convert to dataframe
Ticket.Agg <- data.frame(Ticket.Agg)

Ticket.Agg$TicketNumber <- as.factor(Ticket.Agg$TicketNumber)

# Left Join
combined <- left_join(combined, Ticket.Agg, by = "TicketNumber")

# Seems highly unlikley this ticket is a travel party of 17
# Change to 1
combined$Travel.Party.Size[combined$TicketNumber== 2] = 1

combined$TicketPre <- NULL 
combined$TicketNumber <- NULL

Let’s examine our breakdown of travel parties on the Titanic

Var1 Freq
1 707
2 266
3 147
4 68
5 35
6 24
7 35
8 16
11 11

Change variable types

# As factor
combined$Title <- factor(combined$Title)
combined$AgeGroup <- factor(combined$AgeGroup)
combined$Pclass <- factor(combined$Pclass)
combined$FamilyUnit <- factor(combined$FamilyUnit)
# As numeric
combined$SibSp <- as.numeric(combined$SibSp)
combined$Parch <- as.numeric(combined$Parch)
combined$PassengerId <- as.numeric(combined$PassengerId)

Feature Engineering

Finally, lets create two further binary predictors, by combining a number of our existing variables

Firstly, a factor variable delineating if a passenger is female, and in a travel part of more than two people

# Female Group ***
combined$Female.Group <- as.factor(ifelse(combined$Sex == "female" &
                                          combined$Travel.Party.Size > 2, 1,0))

Below you can see 72% of females who travelled in groups survived

Female.Group Survived
0 33.07
1 72.50

Secondly, a binary predictor of males who had the title ‘master’ (a proxi for being a child)

# Male & Master
combined$Master.Male <- as.factor(ifelse(combined$Sex == "male" &
                                         combined$Title == "Master", 1,0))

In the dataset, 57% of young males survived, much better than the average survival rate of males

Master.Male Survived
0 37.49
1 57.50