The below kernel summarises an approach I took to the infamous Titanic challenge
A number of predictive techniques are implemented, comparing cross-validated results to submission results:
First, lets load-in our dependencies :
library(plyr) # means
library(dplyr) # data-cleaning
library(stringr) # data-cleaning
library(naniar) # missing data visualisation
library(ggplot2) # visualising predictors
library(rpart) # decision trees
library(rattle) # visualise decision trees
library(rpart.plot) # visualise decision trees
library(randomForest) # random forests
library(party) # conditional inference trees
library(partykit) # conditional inference trees
library(caret) # support vector machines
library(e1071) # support vector machines
library(RColorBrewer)
library(Hmisc)
library(caTools) # cross validation
library(kableExtra) # markdown tables
Then load in our train & test data-sets
# Read in .csv files
train <- read.csv("train.csv")
test <- read.csv("test.csv")
# Create survived column in test set
test$Survived <- NA
# Bind test and train data to create new variables in both sets
combined <- rbind(train, test)
# Check data
str(combined)
Let’s first examine which variables are missing data
# Convert all blanks to NA
combined[combined==""] <- NA
# Check our missing data situation
miss <- sapply(combined[-2], function(x) sum(is.na(x)))
| x | |
|---|---|
| PassengerId | 0 |
| Pclass | 0 |
| Name | 0 |
| Sex | 0 |
| Age | 263 |
| SibSp | 0 |
| Parch | 0 |
| Ticket | 0 |
| Fare | 1 |
| Cabin | 1014 |
| Embarked | 2 |
“Age” & “Cabin” are missing significant amounts of data, with a slightly higher proportion of cabin details missing for males
We can check whether missing cabin information is spread evenly across passenger “class”
Perhaps unsuprisingly, there is more missing information in the lower classes. I’m going to presume better historical records are maintained for affluent passengers
Let’s examine missing data for Cabin x SibSp (i.e. number of siblings)
Missing data for Cabin appears to rise as SibSp rises (i.e. more siblings == more missing data). This may suggest a systematic bias in record keeping of higher-occupancy cabins.
Generally, a feature with this much missing is a lost cause, & risks significantly biasing a model. Potentially a feature could be extracted utilising cabin details for first class passengers as this data is more complete. However, I opted not to proceed further with this variable
Passenger’s age has ~ 20% missing data, & therefore is suitable for missing value imputation. Below I have fitted a decision tree to predict missing values, utilising other variables as predictors of missing values. A decision tree was Previous favored over linear regression as previous attempts were sub-optimal, producing negative age values
# Use a decision tree to predict missing values
# Create DF with missing age values removed
NoMissing <- combined %>% filter(!is.na(Age))
# Create a predictive model for age
fit.missing <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked,
data = NoMissing)
# Examine the model
summary(fit.missing)
# Examine the coefficients
fit.missing
# Create 'age predicted' column from fit.missing, into initial data-set
combined <- combined %>%
mutate(age.predicted = predict(fit.missing, .)) %>%
# Replace NAs in combinedd$Age, with age.predicted
mutate(Age = ifelse(is.na(Age), age.predicted, Age))
# Remove predicted age column
combined$age.predicted <- NULL
print(fit.missing)
# Replace any values below 1 with 1 to keep consistent with age-data
combined$Age[combined$Age < 1] <- 1
A small number of data-points were missing from port of embarkment, which have been corrected
# replace missing embarked values
combined$Embarked[c(62,830)] = "S"
combined$Embarked <- factor(combined$Embarked)
With one missing fare value, we can safely replace this with the median without biasing our model
# replace missing fare value
combined$Fare[1044] <- median(combined$Fare, na.rm = TRUE)
Finally, I have made some corrections to some implausible values that previous kernels have identified
# Data-set errors
# The following fixes SibSp/Parch values for two passengers (Id=280 and Id=1284) according to this kernel
# because a 16 year old can’t have a 13 year old son!
combined$SibSp[combined$PassengerId == 280] = 0
combined$Parch[combined$PassengerId == 280] = 2
combined$SibSp[combined$PassengerId == 1284] = 1
combined$Parch[combined$PassengerId == 1284] = 1
The below section utilises density plots to visualise the relationships between various predictors of survival
Below you can see that passengers survival status was fairly evenly distributed across the age-span of passengers, with a slightly higher density of survivors under 10 years of age
Using a stacked bar chart we can visualise passengers survival status x sex
Fairly clearly, (a) there were more male passengers on the titanic, & (b) survival rates were much higher for female passengers
Below are two density plots plotting passenger Age x Sex for (a) surviving passengers, & (b) non-surviving passengers
For survivors, the distribution is fairly similar for males & females, with slightly more young males surviving
For non-survivors, the distribution of age x gender is more disparate. Proportionally more females than males under 20 years of age died, whilst over the age of 20, more males died
There were three passengers classes on the titanic - 1, 2 & 3. Somewhat unsuprisingly, more passengers died in the poorer passenger classes
The distribution of passenger’s age for each cabin class is close to identical for survivors and non-survivors
Some noteworthy differences are a higher density of surviving passengers < 10 years of age in class (2), and a higher density of non-surviving passengers > 60 in class (1)
Passengers boarded the Titanic from three ports of embarkment - Southampton, Cherbourg & Queenstown. The vast majority of passengers were from the first port of embarkment (Southampton).
There were more non-surviving Passengers embarking from Southampton & Queenstown, but more surviving passengers from Cherbourg. As such, port of embarkment appears related to survival probability
Density plots of port of embarkment by age show some marked differences between survivors & non-survivors
For Southampton & Cherbourg passengers, a flatter age distribution is notable for survivors compared to non-survivors. Those who survived were randomly distributed across the age spectrum, compared to deceased passengers who were more concentrated around the mean age (~ 30 for these two ports).
This pattern does not exist for Queenstown, with narrow age distributions observed for both survivors and non-survivors. In fact, the distribution for survivors from Queenstown is especially narrow, with all survivors being aged between 10 & 40. Non-survivors from Queenstown cover the entire age spectrum from infants through to the elderly
Why this is the case is unclear. With only 77 passengers embarking from Queenstown, perhaps certain travel parties survived or died together. The surviving parties potentially having no dependents (i.e. very young children or very elderly relatives). In any case, it shows port of embarkment is potentially an important predictor
An obvious, and routinely extracted feature from the data-set is passengers title. It can be fairly simply derived via regex
# change to character
combined$Name <- as.character(combined$Name)
# Extract title from passenger name
combined$Title <- gsub("^.*, (.*?)\\..*$", "\\1", combined$Name)
Let’s examine the breakdown of passenger titles by sex:
| Capt | Col | Don | Dona | Dr | Jonkheer | Lady | Major | Master | Miss | Mlle | Mme | Mr | Mrs | Ms | Rev | Sir | the Countess | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| female | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 260 | 2 | 1 | 0 | 197 | 2 | 0 | 0 | 1 |
| male | 1 | 4 | 1 | 0 | 7 | 1 | 0 | 2 | 61 | 0 | 0 | 0 | 757 | 0 | 0 | 8 | 1 | 0 |
Some of the rarer titles can be aggregated
# Re-assign female categories
combined$Title[combined$Title == 'Mlle' | combined$Title == 'Ms'] <- 'Miss'
combined$Title[combined$Title == 'Mme'] <- 'Mrs'
# Concatenate rare titles, potential proxi for high spcoa; standing
Other <- c('Dona', 'Dr', 'Lady', 'the Countess','Capt', 'Col', 'Don', 'Jonkheer', 'Major', 'Rev', 'Sir')
combined$Title[combined$Title %in% Other] <- 'Other'
Let’s examine the breakdown …
| Master | Miss | Mr | Mrs | Other | |
|---|---|---|---|---|---|
| female | 0 | 264 | 0 | 198 | 4 |
| male | 61 | 0 | 757 | 0 | 25 |
Much nicer!
Let’s also look at the number of survivors & non-survivors by title
In essence, ‘title’ nicely conveys some of the interactive effects of Age x Sex
We know males are more likely to have died on the titanic than females, but evidently, young males with title “master” had a far better chance of survival than adult males
# Combine siblings + parents/children
combined$FamilySize <- combined$SibSp + combined$Parch + 1
# Family Unit
# Extract Surnames
combined$Surname <- sapply(combined$Name,
FUN=function(x)
{strsplit(x, split='[,]')[[1]][1]})
# Combine with Family Size
combined$FamilyUnit <- paste(combined$Surname, "-", combined$FamilySize)
# How many unique families?
length(unique(combined$FamilyUnit))
## [1] 928
As expected there are many different family combinations … 928 infact! This is inpractical for predictive purposes, so let’s try and reduce the number of levels of this feature by combining ‘small’ and ‘medium’ sized families
# Re-code families undr 2 as 'small families'
combined$FamilyUnit[combined$FamilySize <= 2] <- "Small"
combined$FamilyUnit[combined$FamilySize > 2 & combined$FamilySize <= 4] <- 'Medium'
# remove surname variable
combined$Surname <- NULL
Let’s examine the results of combining all family sizes under 2, and between 2 & 4 people:
| Var1 | Freq |
|---|---|
| Andersson - 7 | 9 |
| Asplund - 7 | 7 |
| Ford - 5 | 5 |
| Fortune - 6 | 6 |
| Goodwin - 8 | 8 |
| Hocking - 5 | 1 |
| Kink-Heilmann - 5 | 1 |
| Lefebre - 5 | 5 |
| Medium | 202 |
| Palsson - 5 | 5 |
| Panula - 6 | 6 |
| Rice - 6 | 6 |
| Richards - 6 | 1 |
| Ryerson - 5 | 5 |
| Sage - 11 | 11 |
| Skoog - 6 | 6 |
| Small | 1025 |
Much better!
# Age Group
combined$AgeGroup <- NA
combined$AgeGroup[combined$Age <= 12] <- "Child"
combined$AgeGroup[combined$Age > 12 & combined$Age <= 18] <- "Adolescent"
combined$AgeGroup[combined$Age > 18 & combined$Age <= 59] <- "Adult"
combined$AgeGroup[combined$Age >= 60] <- "Elderly Adult"
Below is a table summarising the percentage of surviving passengers by AgeGroup & Sex
Interestingly, all elderly women (arbitrarily defined as people over 60) survived. By contrast, elderly men were over-whelmingly unlikely to surive, whilst male children were slightly more likely to survive than female children. We will use this variable in some later feature engineering
| AgeGroup | Sex | Survived |
|---|---|---|
| Adolescent | female | 75.00 |
| Adult | female | 77.54 |
| Child | female | 50.00 |
| Elderly Adult | female | 100.00 |
| Adolescent | male | 8.82 |
| Adult | male | 17.12 |
| Child | male | 50.00 |
| Elderly Adult | male | 13.64 |
Below we can approximate groups of passengers who travelled together, by extracting common string-names in the ticket, and common ticket numbers. This can be achieved by grouping passengers with commmon ticket details, aggregating the number of passengers with these ticket details, then re-combining this detail as a grouping variable with our original dataframe
# Extracts out any letters before first space
# If no letters, no extraction
reg.ex2 <- ".*\\s|^[A-Z].*(?![0-9])$"
combined$TicketPre <- str_extract(combined$Ticket, reg.ex2)
# For tickets that do not begin with a character string ...
combined$TicketPre <- gsub("\\.|\\s|/", "", combined$TicketPre)
# Impute "_"
combined$TicketPre[is.na(combined$TicketPre)] <- "_"
## Convert to factor
combined$TicketPre <- factor(combined$TicketPre)
# Second half of ticket string
reg.ex3 <- "\\s[0-9]{1,}|^(?![A-Z]).*"
combined$TicketNumber <- str_extract(combined$Ticket, reg.ex3)
combined$TicketNumber[is.na(combined$TicketNumber)] <- "0"
combined$TicketNumber <- as.numeric(combined$TicketNumber)
combined$TicketNumber <- as.factor(combined$TicketNumber)
# Based upon common ticket numbers between passengers
# Calculate size of travel party
detach("package:plyr")
library(dplyr)
Ticket.Agg <- combined %>%
group_by(TicketNumber) %>%
summarise(Travel.Party.Size = n())
# Convert to dataframe
Ticket.Agg <- data.frame(Ticket.Agg)
Ticket.Agg$TicketNumber <- as.factor(Ticket.Agg$TicketNumber)
# Left Join
combined <- left_join(combined, Ticket.Agg, by = "TicketNumber")
# Seems highly unlikley this ticket is a travel party of 17
# Change to 1
combined$Travel.Party.Size[combined$TicketNumber== 2] = 1
combined$TicketPre <- NULL
combined$TicketNumber <- NULL
Let’s examine our breakdown of travel parties on the Titanic
| Var1 | Freq |
|---|---|
| 1 | 707 |
| 2 | 266 |
| 3 | 147 |
| 4 | 68 |
| 5 | 35 |
| 6 | 24 |
| 7 | 35 |
| 8 | 16 |
| 11 | 11 |
# As factor
combined$Title <- factor(combined$Title)
combined$AgeGroup <- factor(combined$AgeGroup)
combined$Pclass <- factor(combined$Pclass)
combined$FamilyUnit <- factor(combined$FamilyUnit)
# As numeric
combined$SibSp <- as.numeric(combined$SibSp)
combined$Parch <- as.numeric(combined$Parch)
combined$PassengerId <- as.numeric(combined$PassengerId)
Finally, lets create two further binary predictors, by combining a number of our existing variables
Firstly, a factor variable delineating if a passenger is female, and in a travel part of more than two people
# Female Group ***
combined$Female.Group <- as.factor(ifelse(combined$Sex == "female" &
combined$Travel.Party.Size > 2, 1,0))
Below you can see 72% of females who travelled in groups survived
| Female.Group | Survived |
|---|---|
| 0 | 33.07 |
| 1 | 72.50 |
Secondly, a binary predictor of males who had the title ‘master’ (a proxi for being a child)
# Male & Master
combined$Master.Male <- as.factor(ifelse(combined$Sex == "male" &
combined$Title == "Master", 1,0))
In the dataset, 57% of young males survived, much better than the average survival rate of males
| Master.Male | Survived |
|---|---|
| 0 | 37.49 |
| 1 | 57.50 |