The Titanic left the port of Queensland (now Cobh), Ireland on April 11, 1912 with 2224 passengers 1. On early morning of Monday, 15 April 1912, sank the “unsinkable” after a collision with an iceberg in the Atlantic ocean and consequently, more than 2/3 of the passengers lost their lives as only a very limited number of them was able to get in the available lifeboats.
Due to the insufficient number of means of survival, the possibility to get in one of these lifeboats would have obeyed to some kind of considerations from the ethical “women and children first” (considering the physical and personal situation of each passenger) to the inequitable “Only money can build roads over the sees/oceans” 2 (considering the social class or economic situation of each passenger).
Based on these considerations, the purpose of this project is to predict the survival of a - test - subset of passengers using machine learning. To this intent, I “dive” into the training dataset provided by kaggle.com within the framework of its prediction competitions, to identify and engineer the most important predictors.
trainVarTypes <- c("integer", "factor", "factor", "character", "factor", "numeric", "integer", "integer", "character", "numeric", "character", "factor") # for Resp. PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked
testVarTypes <- trainVarTypes[-2]
train <- read.csv("train.csv", colClasses = trainVarTypes, na.strings = c("NA", ""))
test <- read.csv("test.csv", colClasses = testVarTypes, na.strings = c("NA", ""))
train$Survived <- factor(train$Survived, levels = c(0,1), labels = c("no", "yes"))
The train dataset consists on a subset of 891 observations of 11 variables describing each passenger listed in. According to each variable we can know: whether the passenger survived or not (yes/no), his/her class on board (1st, 2nd or 3rd class), name (including the title), sex (female/male), Age, ticket number, fare, cabin number, and were he/her embarked.
Adding to that, the variables SibSp and Parch represent respectively the number of siblings and spouses, and the number of parents and children on board. Both variables indicate implicitly whether a passenger has family members on board and how many are they. (more details can be found here.
str(train)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "no","yes": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr NA "C85" NA "C123" ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
head(train, n = 4)
## PassengerId Survived Pclass
## 1 1 no 3
## 2 2 yes 1
## 3 3 yes 3
## 4 4 yes 1
## Name Sex Age SibSp
## 1 Braund, Mr. Owen Harris male 22 1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 3 Heikkinen, Miss. Laina female 26 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## Parch Ticket Fare Cabin Embarked
## 1 0 A/5 21171 7.2500 <NA> S
## 2 0 PC 17599 71.2833 C85 C
## 3 0 STON/O2. 3101282 7.9250 <NA> S
## 4 0 113803 53.1000 C123 S
summary(train)
## PassengerId Survived Pclass Name Sex
## Min. : 1.0 no :549 1:216 Length:891 female:314
## 1st Qu.:223.5 yes:342 2:184 Class :character male :577
## Median :446.0 3:491 Mode :character
## Mean :446.0
## 3rd Qu.:668.5
## Max. :891.0
##
## Age SibSp Parch Ticket
## Min. : 0.42 Min. :0.000 Min. :0.0000 Length:891
## 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 Class :character
## Median :28.00 Median :0.000 Median :0.0000 Mode :character
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Fare Cabin Embarked
## Min. : 0.00 Length:891 C :168
## 1st Qu.: 7.91 Class :character Q : 77
## Median : 14.45 Mode :character S :644
## Mean : 32.20 NA's: 2
## 3rd Qu.: 31.00
## Max. :512.33
##
One important aspect when predicting using machine learning is the assessment of the completeness of the data (percentage of data with one or more values).
missmap(train, main = "Missing Values Analysis", col=c("red", "gray")) # how many observations are missing
sapply(train, function(x) sum(is.na(x))) # how many observations are missing
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
The analysis shows that two variables suffer strongly from lack of data: Cabin and Age.
While a missing Cabin number simply means that the passenger has no cabin, the variable Age is crucial and its missing observations should be imputed.
The preliminary data analysis puts light also on the fact that the passengers titles are included to the variable Names. While the passengers names are not of a big importance here, their titles are in contrast very important and could represent an important predictor for machine learning.
I chose thus to isolate the titles and classify them in four different categories according to their meanings: “Miss” for young ladies, “Mrs” for older/married ones, “Mil” for military passengers and “Noble” for the aristocratic titles. The categories “Rev”, “Mr”, “Dr” and “Master” are kept unchangeable.
# 1. train$Name - Replace titles in Name
x <- sub(".*, ", "", as.character(train$Name))
train$NameCat <- sub("\\..*", "", x)
train$NameCat[train$NameCat %in% c("Miss", "Mlle", "Ms")] <- "Miss"
train$NameCat[train$NameCat %in% c("Mme", "Mrs")] <- "Mrs"
train$NameCat[train$NameCat %in% c("Capt", "Col", "Major")] <- "Mil"
train$NameCat[train$NameCat %in% c("Don", "Jonkheer", "Lady", "Sir", "the Countess")] <- "Noble"
Please note however, that, contrary to what one might think, the category “Master” was used at the time to identify male children.
ggplot(train, aes(x = NameCat, y = Age)) + geom_boxplot(aes(fill=Survived)) + theme_bw()
For efficiency, it is more appropriate to classify the passengers according to specific age ranges using the variable Age. In this way, one could directly identify whether the passenger is a “child”, “adult” or “senior”.
For passengers with no information about the age, I use temporarily the category “missing”.
Please note also, that eventhough the Titanic’s Certificates of Clearance indicates that passengers over 14 are considered adults 3, I consider myself that during the evacuation of the Titanic, only signs of puberty can be used to determine whether a passenger is a child or not. For this reason, I chose the age of 14 as a limit to distinguish between an adult or a child passenger.
# 2. train$Age - Categories
train$AgeCat[train$Age > 0 & train$Age <= 14 & !is.na(train$Age)] <- "child"
train$AgeCat[train$Age > 14 & train$Age <= 60 & !is.na(train$Age)] <- "adult"
train$AgeCat[train$Age > 60 & !is.na(train$Age)] <- "senior"
train$AgeCat[is.na(train$Age)] <- "missing"
A new categorical variable is set according to whether a passenger has a cabin or not.
# 3. train$Cabin
train$HasCab[!is.na(train$Cabin)] <- "yes"
train$HasCab[is.na(train$Cabin)] <- "no"
It is known that many passengers were crossing the Atlantic with other members of their families. On living the sinking Titanic, it is obvious that children should have been accompanied on lifeboats by at least one of their parents. Thus, being on board with other family members would play an important role.
# 4. train$HasFam
train$HasFam[train$SibSp != 0 | train$Parch != 0] <- "yes"
train$HasFam[train$SibSp == 0 & train$Parch == 0] <- "no"
Nevertheless, the question is: to what extent? did big families have the same chance of getting in lifeboats as smaller ones?
# 5. train$FamCat
# Check the relationship between SibSp, Parchar and survival
ggplot(train, aes(y = SibSp, x = Parch)) +
geom_jitter(aes(color = Survived)) +
theme_bw() +
geom_vline(xintercept = 3, color = 'black', lty=5) +
geom_hline(yintercept = 3, color = 'black', lty=5)
It appears here that passengers with SibSp or Parch greater than 3 (i.e. the passenger belonged to a family of more than 4 members) had very small chance to survive. I introduce thus another predictor variable which classify the passengers according to the count of their relatives on board.
train$FamCat[train$SibSp + train$Parch <= 4] <- "small" # family
train$FamCat[train$SibSp + train$Parch > 4] <- "big" # big one
train$FamCat[train$SibSp + train$Parch == 0] <- "single" # the passenger was alone
…as all these engineered variables are to be used as categorical ones:
train$NameCat <- factor(train$NameCat)
train$AgeCat <- factor(train$AgeCat)
train$HasCab <- factor(train$HasCab)
train$HasFam <- factor(train$HasFam)
train$FamCat <- factor(train$FamCat)
According to the train dataset the proportion of survival is a bit greather than the global one (one third). Not surprisingly, more women and children survived the shipwreck than men.
One can also notice that the overall proportion of survival for the category “missing” seems to be very close to category “adult” which leads me to believe (even subjectively) that the mean ages would not be very different.
# barplot: survived or not
p1 <- ggplot(data = train, aes(Survived, fill = Survived)) + geom_bar() + guides(fill = FALSE) + theme_bw()
# barplot: survival according to sex
p2 <- ggplot(data = train, aes(Sex, fill = Survived)) + geom_bar(position = "dodge") + guides(fill = FALSE) + theme_bw()
# barplot: survival or not according to age category
p3 <- ggplot(data = train, aes(x = AgeCat, fill = Survived)) + geom_bar(position = "dodge") + theme_bw()
grid.arrange(p1, p2, p3, layout_matrix = cbind(c(1,3), c(2,3)), widths=c(2,3))
## [1] "The percentage of survival is about 38.4%"
## train$Sex train$Survived
## 1 female 74.20382
## 2 male 18.89081
## train$AgeCat train$Survived
## 1 adult 39.02439
## 2 child 58.44156
## 3 missing 29.37853
## 4 senior 22.72727
In light of the first plot, it seems that having family members on board of the Titanic was not crucial for survival. However, when looking to the second plot, it appears clearly that being a member of a small family was in fact very important for survival mainly because of the fact that small children were always accompanied by at least one of their parents.
# barplot: survived or not according to has a family on board or not
p1 <- ggplot(data = train, aes(x = HasFam, fill = Survived)) + geom_bar(position = "dodge") + theme_bw()
# barplot: survived or not according to the category of the family
p2 <- ggplot(data = train, aes(x = FamCat, fill = Survived)) + geom_bar(position = "dodge") + guides(fill = FALSE) + theme_bw()
grid.arrange(p1, p2)
## train$HasFam train$Survived
## 1 no 30.35382
## 2 yes 50.56497
## train$FamCat train$Survived
## 1 big 14.89362
## 2 single 30.35382
## 3 small 56.02606
Wealth also seems to have an import impact on survival. It appears on the two first plots that having a cabin or being a first class passenger increased clearly the chance of survival.
# barplot: survived or not according to having a cabin or not
p2 <- ggplot(data = train, aes(x = HasCab, fill = Survived)) + geom_bar(position = "dodge") + theme_bw()
# barplot: survival or not according to passenger class
p1 <- ggplot(data = train, aes(x = Pclass, fill = Survived)) + geom_bar(position = "dodge") + guides(fill = FALSE) + theme_bw()
grid.arrange(p1, p2)
## train$Pclass train$Survived
## 1 1 62.96296
## 2 2 47.28261
## 3 3 24.23625
## train$HasCab train$Survived
## 1 no 29.98544
## 2 yes 66.66667
On the other side, it appears also clearly that the overwhelming majority of women who paid more than 50 pounds survived. Since it is not the case for men, the reason could be that the women from first class were the first to leave the Titanic on lifeboats when men of the first classe was forced to wait for women and children from other classes.
# Had the Fare any impact?
ggplot(train, aes(x = PassengerId, y = Fare, color = Survived)) + geom_point() + theme_bw() + facet_grid(. ~ Sex)
Concerning the passengers titles, the proportions of survival seem to be related to age categories and sex. One can notice also that certain categories such as Doctors, servicemen or clerics did not (for the majority) even tried to leave the Titanic.
# Relationship between survival and name categories
ggplot(data = train, aes(x = NameCat, fill = Survived)) + geom_bar(position = "dodge") + theme_bw()
aggregate(train$Survived ~ train$NameCat, FUN= function(x) sum(x == "yes") / length(x) * 100)
## train$NameCat train$Survived
## 1 Dr 42.85714
## 2 Master 57.50000
## 3 Mil 40.00000
## 4 Miss 70.27027
## 5 Mr 15.66731
## 6 Mrs 79.36508
## 7 Noble 60.00000
## 8 Rev 0.00000
For the machine learning it is important to impute all missing values of the predictor variables. There are many methods for the imputation each with advantages and inconvenients. For this project, two variables show missing values: Embarked and Age. Since only two values are missing in Embarked I do a little simple investigation to deduct them based on the ticket number which should be related to the place were it was issued.
# 1. train$Embarked
# passengerID with missing "Embarked"
train$PassengerId[is.na(train$Embarked)]
## [1] 62 830
# both passengerd have same ticket: 113572 => they embarked together with the same ticket
train$Ticket[train$PassengerId == 62 | train$PassengerId == 830]
## [1] "113572" "113572"
# check other passengers with ticket of the same prefix => they most probably bought their tickets from the same embarkment place
train$Embarked[grep("^1135", train$Ticket)]
## [1] C <NA> S S S S C S <NA>
## Levels: C Q S
# since for most cases the Embarkment has the value "S"
train$Embarked[62] <- "S"
train$Embarked[830] <- "S"
For the machine learning, instead of Age I decide to use AgeCat.The missing values are imputed using machine learning since it appears clearly that the age category of the passengers is related directly to many parameters included in our dataset as seen previously. Of course, for the prediction of missing AgeCat, the variable Survived is excluded when training the model.
The chosed predictor variables are: Pclass, Sex, Embarked, NameCat, HasCab, HasFam, and FamCat while the target variable is “AgeCat” with three different levels: “child”, “adult”, and “senior”.
The trainControl was set to use the K-fold cross-validation as it represents a robust method to estimate the model’s accuracy. The choice of k = 5 has been empirically shown to avoid high bias and variance when estimating the test error rate 4.
# train$Age
# The imputation is predicted
# 1. Select the predictors - "Survived"" should not be selected
trainAge <- train[c(1, 3, 5, 12, 13, 14, 15, 16, 17)]
# keep only observations with AgeCat = "missing" in the test subset
trainAgetest <- trainAge[trainAge$AgeCat == "missing",]
# exclude AgeCat from the test subset
trainAgetest <- trainAgetest[, -6]
# exclude observations with AgeCat = "missing" in the train subset
trainAge <- trainAge[trainAge$AgeCat != "missing",]
trainAge <- trainAge[, -1]
# exclude the level "mising"
trainAge$AgeCat <- factor(trainAge$AgeCat)
# Subset the train data set
set.seed(1962)
subsets <- createDataPartition(y=trainAge$AgeCat, p=0.75, list=FALSE)
subTraining <- trainAge[subsets, ]
subTesting <- trainAge[-subsets, ]
# train the model
x <- subTraining[, -5] # column 5 is the target variable
y <- subTraining[, 5]
control = trainControl(method = "cv", number = 5, allowParallel = TRUE)
modelGBM <- train(x, y, method = "rf", trControl = control) # I use GBM method
#modelGBM$results
predGBM <- predict(modelGBM, subTesting)
cfMatGBM <- confusionMatrix(subTesting$AgeCat, predGBM)
cfMatGBM$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.9378531 0.6888782 0.8915243 0.9685713 0.9152542
## AccuracyPValue McnemarPValue
## 0.1730900 NaN
cfMatGBM$table
## Reference
## Prediction adult child senior
## adult 152 1 0
## child 5 14 0
## senior 5 0 0
Another test was done using the random forest method. However the best accuracy was obtained with the GBM method.
The missing values of AgeCat in the train data set are then predicted using the GBM model.
# predict for the missing AgeCat in train data set
predAge <- predict(modelGBM, trainAgetest)
# values are then injected into train
train$AgeCat[train$PassengerId %in% trainAgetest$PassengerId] <- predAge
train$AgeCat <- factor(train$AgeCat)
# ifelse(train$AgeCat[trainAgetest$PassengerId] == predAge, 1, 0)
In order to do machine learning, it is mandatory to apply the same modifications to the test data set. The same steps applied to train data should thus be done to test data.
sapply(test, function(x) sum(is.na(x))) # how many observations are missing
## PassengerId Pclass Name Sex Age SibSp
## 0 0 0 0 86 0
## Parch Ticket Fare Cabin Embarked
## 0 0 1 327 0
We notice that here are missing values in Age, and Fare. Concerning Fare, and since it is only one value missing, one could simply assign to this observation the median value of the variable Fare for the appropriate class.
The missing values of AgeCat are imputed using the exct same method as for the train data.
Many titles seen in the train file are missing in test file. At the same time, a new one, Dona, is classified as a noble title.
# Data Engineering
# 1. test$Name
x <- sub(".*, ", "", as.character(test$Name))
test$NameCat <- sub("\\..*", "", x)
test$NameCat[test$NameCat %in% c("Miss", "Ms")] <- "Miss"
test$NameCat[test$NameCat == "Col"] <- "mil"
test$NameCat[test$NameCat == "Dona"] <- "noble"
# 2. test$Age
test$AgeCat[test$Age > 0 & test$Age <= 14 & !is.na(test$Age)] <- "child"
test$AgeCat[test$Age > 14 & test$Age <= 60 & !is.na(test$Age)] <- "adult"
test$AgeCat[test$Age > 60 & !is.na(test$Age)] <- "senior"
test$AgeCat[is.na(test$Age)] <- "missing"
# 3. test$Cabin
test$HasCab[!is.na(test$Cabin)] <- 1
test$HasCab[is.na(test$Cabin)] <- 0
# 4. Add test$HasFam
test$HasFam[test$SibSp != 0 | test$Parch != 0] <- 1
test$HasFam[test$SibSp == 0 & test$Parch == 0] <- 0
# 5. Add test$Famcat
test$FamCat[test$SibSp + test$Parch <= 4] <- "small"
test$FamCat[test$SibSp + test$Parch > 4] <- "big"
test$FamCat[test$SibSp + test$Parch == 0] <- "single"
# 6. missing Fare: PassengerId = 1044 with Pclass = 3 -> use the median for pclass
test$Fare[test$PassengerId == 1044] <- median(test$Fare[test$Pclass == 3], na.rm = TRUE)
test$NameCat <- factor(test$NameCat)
test$AgeCat <- factor(test$AgeCat)
test$HasCab <- factor(test$HasCab)
test$HasFam <- factor(test$HasFam)
test$FamCat <- factor(test$FamCat)
#Imputation of missing test$Age
testAge <- test[c(1, 2, 4, 11, 12, 13, 14, 15, 16)]
testAgetest <- testAge[testAge$AgeCat == "missing",]
testAgetest <- testAgetest[, -6] # test file
testAge <- testAge[testAge$AgeCat != "missing",]
testAge <- testAge[, -1] # train file
testAge$AgeCat <- factor(testAge$AgeCat) # in case of different levels
set.seed(1962)
subsets <- createDataPartition(y=testAge$AgeCat, p=0.75, list=FALSE)
subTraining <- testAge[subsets, ]
subTesting <- testAge[-subsets, ]
x <- subTraining[, -5] # column 5 is the target variable
y <- subTraining[, 5]
control = trainControl(method = "cv", number = 5, allowParallel = TRUE)
modelGBMtest <- train(x, y, method = "rf", trControl = control)
#modelGBMtest$results
predGBMtest <- predict(modelGBMtest, subTesting)
cfMatGBMtest <- confusionMatrix(subTesting$AgeCat, predGBMtest)
cfMatGBMtest$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.8641975 0.4095427 0.7699904 0.9302113 0.8518519
## AccuracyPValue McnemarPValue
## 0.4523181 NaN
cfMatGBMtest$table
## Reference
## Prediction adult child senior
## adult 65 7 0
## child 2 5 0
## senior 2 0 0
predAgetest <- predict(modelGBMtest, testAgetest)
test$AgeCat[test$PassengerId %in% testAgetest$PassengerId] <- predAgetest
test$AgeCat <- factor(test$AgeCat)
Now that all the data is processed and missing values imputed, lets compute the final model and predict survival among the passengers in test list.
The predictor variables are: “Pclass”, “Sex”, “Embarked”, “NameCat”, “AgeCat”, “HasCab”, “HasFam”, and “FamCat”.
The target variable is “Survived”.
The train data consists in 891 observations while the test data consists in 418.
# RF for train and test
trainFin <- train[c(2, 3, 5, 12, 13, 14, 15, 16, 17)]
#trainFin$Survived <- factor(trainFin$Survived)
testFin <- test[c(1, 2, 4, 11, 12, 13, 14, 15, 16)]
set.seed(1962)
subsets <- createDataPartition(y=trainFin$AgeCat, p=0.75, list=FALSE)
subTraining <- trainFin[subsets, ]
subTesting <- trainFin[-subsets, ]
x <- subTraining[, -1] # column 1 is the target variable
y <- subTraining[, 1]
control = trainControl(method = "boot", number = 5, allowParallel = TRUE)
modelGBMFin <- train(x, y, method = "gbm", trControl = control)
modelGBMFin$results
## shrinkage interaction.depth n.minobsinnode n.trees Accuracy Kappa
## 1 0.1 1 10 50 0.8144104 0.5807197
## 4 0.1 2 10 50 0.8135026 0.5825964
## 7 0.1 3 10 50 0.8144249 0.5835697
## 2 0.1 1 10 100 0.8129420 0.5808665
## 5 0.1 2 10 100 0.8159136 0.5892643
## 8 0.1 3 10 100 0.8239166 0.6035873
## 3 0.1 1 10 150 0.8203495 0.6037245
## 6 0.1 2 10 150 0.8206616 0.5987273
## 9 0.1 3 10 150 0.8278934 0.6097493
## AccuracySD KappaSD
## 1 0.017017093 0.04787700
## 4 0.005442876 0.01228053
## 7 0.015013440 0.03670176
## 2 0.020246036 0.05357968
## 5 0.006297675 0.02207858
## 8 0.001446437 0.01990235
## 3 0.021378438 0.05007471
## 6 0.008710776 0.02909173
## 9 0.007754714 0.03187313
predGBMFin <- predict(modelGBMFin, subTesting)
cfMatGBMFin <- confusionMatrix(subTesting$Survived, predGBMFin)
cfMatGBMFin$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 7.692308e-01 5.340829e-01 7.079997e-01 8.230873e-01 5.610860e-01
## AccuracyPValue McnemarPValue
## 8.964825e-11 5.754030e-01
cfMatGBMFin$table
## Reference
## Prediction no yes
## no 96 23
## yes 28 74
#importanceRFFin <- varImp(modelRFFin,scale = FALSE)
#plot(importanceRFFin, top= 10)
predGBMFin <- predict(modelGBMFin, testFin)
#sol <- ifelse(predGBMFin == "yes", "1", "0")
#z <- data.frame(test$PassengerId, sol)
#colnames(z) <- c("PassengerId", "Survived")
#write.csv(z, "Res.csv", row.names=FALSE)
Mersey, Lord (1999) [1912]. The Loss of the Titanic, 1912. The Stationery Office. ISBN 978-0-11-702403-8↩
Algerian proverb↩
https://www.encyclopedia-titanica.org/titanic-statistics.html↩
http://www.sthda.com/english/articles/38-regression-model-validation/157-cross-validation-essentials-in-r/↩