The goal of Titanic Competition is to predict which passengers survived the Titanic shipwreck. At that ocasion, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew (68%). While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. So, using passenger data, what sorts of people were more likely to survive? Passenger data includes:
First, let’s clean and prepare our R workspace:
# clean everything done before
rm(list=ls())
# packages that will be used
packages <- c("easypackages","caret","dplyr","VIM","missForest","caretEnsemble")
# install packages if necessary
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}
# load packages
library(easypackages)
libraries("caret","dplyr","VIM","missForest","caretEnsemble")
Kaggle provides us 2 data sets: a training and a testing data set. I will call this first training data just as “data”, because the real training data set will be built latter on. So, tet’s load them into R (to run codes bellow, first you need to download data sets into your computer; they can be found here):
# load data
data <- read.csv2(unz("./titanic.zip", "train.csv"),
sep=",", na.strings=c("","NA"))
testing <- read.csv2(unz("./titanic.zip", "test.csv"),
sep=",", na.strings=c("","NA"))
First, let’s view data structure:
# view data structure
str(data)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : chr "22" "38" "26" "35" ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : chr "7.25" "71.2833" "7.925" "53.1" ...
## $ Cabin : chr NA "C85" NA "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
Looking at the data, we can conclude that some variables are obviously useful for predicting survival (for instance, Pclass, Sex, Age, SibSp, Parch, Fare). In contrast, it doesn’t seem that someone would survive because of his/her name. But he/she could survive because of his/her surname (perhaps, he/she belongs to some important family) or title (for instance, someone could belong to nobility). So, we will create variables “surname” and “title” at once on data and testing set. First, we will bind the 2 data sets:
# bind training and testing data set
testing$Survived <- NA
data$Type <- "data"
testing$Type <- "testing"
total <- rbind(data, testing)
Second, we will create variables “Title” and “Surname” (and remove “Name”):
# decompose variable Name into Surname and Title
Title <- gsub("[.].*$", "", total$Name)
Title <- sub("^[^,]*,", "", Title)
Title <- gsub(" ", "", Title)
Surname <- gsub(",.*$", "", total$Name)
Surname <- gsub(" ", ".", Surname, fixed=TRUE)
total <- cbind(total, Title, Surname)
total$Name <- NULL
But, as there are a lot of surnames, we will keep only the most frequent:
# count surnames
surname1 <- total %>%
group_by(Surname) %>%
count() %>%
arrange(desc(n)) %>%
as.data.frame()
head(surname1,25)
## Surname n
## 1 Andersson 11
## 2 Sage 11
## 3 Asplund 8
## 4 Goodwin 8
## 5 Davies 7
## 6 Brown 6
## 7 Carter 6
## 8 Ford 6
## 9 Fortune 6
## 10 Johnson 6
## 11 Panula 6
## 12 Rice 6
## 13 Skoog 6
## 14 Smith 6
## 15 Kelly 5
## 16 Lefebre 5
## 17 Palsson 5
## 18 Ryerson 5
## 19 Thomas 5
## 20 Williams 5
## 21 Allison 4
## 22 Baclini 4
## 23 Becker 4
## 24 Boulos 4
## 25 Cacic 4
# most frequent surnames
frequent1 <- c("Andersson","Sage","Asplund","Goodwin","Davies","Brown","Carter","Ford","Fortune","Johnson","Panula","Rice","Skoog","Smith","Kelly","Lefebre","Palsson","Ryerson","Thomas","Williams")
# manipulate variable surname
total$Surname <- ifelse(total$Surname %in% frequent1, total$Surname,"Other")
A similar procedure will be done on variable “title”. The difference is that titles will be grouped into others. For instance, “captain”, “major” and “colonel” will all be grouped into the category “militar”.
# count surnames
title1 <- total %>%
group_by(Title) %>%
count() %>%
arrange(desc(n)) %>%
as.data.frame()
title1
## Title n
## 1 Mr 757
## 2 Miss 260
## 3 Mrs 197
## 4 Master 61
## 5 Dr 8
## 6 Rev 8
## 7 Col 4
## 8 Major 2
## 9 Mlle 2
## 10 Ms 2
## 11 Capt 1
## 12 Don 1
## 13 Dona 1
## 14 Jonkheer 1
## 15 Lady 1
## 16 Mme 1
## 17 Sir 1
## 18 theCountess 1
# manipulate variable title
militar <- c("Col","Major","Capt")
nobility <- c("Sir","theCountess","Don","Jonkheer","Dona")
other.woman <- c("Mlle","Mme","Ms","Lady")
total$Title <- ifelse(total$Title %in% militar, "Militar", total$Title)
total$Title <- ifelse(total$Title %in% other.woman, "Other.Woman", total$Title)
total$Title <- ifelse(total$Title %in% nobility, "Nobility", total$Title)
Again, we will apply the same procedure to variable “ticket” (most frequent tickets will be kept, and the others will be replaced by the level “Other”). But, first, there are dots, spaces and slashes into the variable. Let’s remove them:
# remove dots, spaces and slashes from variable Ticket
Ticket <- gsub(" ", "", total$Ticket, fixed = TRUE)
Ticket <- gsub(".", "", Ticket, fixed = TRUE)
Ticket <- gsub("/", "", Ticket, fixed = TRUE)
total$Ticket <- Ticket
Now we can transform the variable:
# count title
ticket1 <- total %>%
group_by(Ticket) %>%
count() %>%
arrange(desc(n)) %>%
as.data.frame()
head(ticket1,20)
## Ticket n
## 1 CA2343 11
## 2 1601 8
## 3 CA2144 8
## 4 3101295 7
## 5 347077 7
## 6 347082 7
## 7 PC17608 7
## 8 SOC14879 7
## 9 113781 6
## 10 19950 6
## 11 347088 6
## 12 382652 6
## 13 113503 5
## 14 16966 5
## 15 220845 5
## 16 349909 5
## 17 4133 5
## 18 PC17757 5
## 19 WC6608 5
## 20 113760 4
# most frequent tickets
frequent2 <- c("CA2343","1601","CA2144","3101295","347077","347082","PC17608","SOC14879","113781","19950","347088","382652","113503","16966","220845","349909","4133","PC17757","WC6608")
# manipulate variable ticket
total$Ticket <- ifelse(total$Ticket %in% frequent2, total$Ticket, "Other")
We are getting close to our goal. Now let’s set proper class to variables:
# set class properly
total$Survived <- factor(total$Survived,
levels=c(0,1),
labels=c("Not.Survived", "Survived"))
total$Pclass <- factor(total$Pclass,
levels=c(1,2,3),
labels=c("First","Second","Third"))
total$Sex <- factor(total$Sex,
labels=c("Female","Male"))
total$Ticket <- factor(total$Ticket)
total$Embarked <- factor(total$Embarked,
labels=c("Cherbourg","Queenstown","Southampton"))
total$Title <- as.factor(total$Title)
total$Surname <- as.factor(total$Surname)
total$Age <- as.numeric(total$Age)
total$Fare <- as.numeric(total$Fare)
Finally, we can go back to the “original” data sets:
# subset again in data and testing
data <- filter(total, Type == "data")
testing <- filter(total, Type == "testing")
data$Type <- NULL
testing$Type <- NULL
testing$Survived <- NULL
First procedure in a machine learning modeling is to split data into a training, validation and testing data set. We have already got the testing data set. What we need is to partitionate “data” into a training and a validation data set. Let’s do it:
# create data subsets
set.seed(6376)
inTrain <- createDataPartition(y=data$Survived,
p=0.7, list=FALSE) # partitionate data
training <- data[inTrain,] # create a training data set
validation <- data[-inTrain,] # create a validation data set
# view rows and columns
cbind(c("Training","Validation","Testing"),
rbind(dim(training), dim(validation), dim(testing)))
## [,1] [,2] [,3]
## [1,] "Training" "625" "13"
## [2,] "Validation" "266" "13"
## [3,] "Testing" "418" "12"
First step will be to inspect missing values:
# explore missing value
mice_plot <- aggr(training, col=c("navyblue","yellow"),
numbers=TRUE, sortVars=TRUE,
labels=names(training), cex.axis=.7,
gap=3, ylab=c("Missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## Cabin 0.7792
## Age 0.2016
## Embarked 0.0016
## PassengerId 0.0000
## Survived 0.0000
## Pclass 0.0000
## Sex 0.0000
## SibSp 0.0000
## Parch 0.0000
## Ticket 0.0000
## Fare 0.0000
## Title 0.0000
## Surname 0.0000
As we can see, there are missing values on variables “Cabin”, “Age” and “Embarked”. We cannot take advantage from variable “cabin”, because there are too much missing values (it is not a good idea to imputate with lots of missing values). Moreover, PassengerID is just an identification; it is not useful for prediction. Let’s remove both variables:
# remove variables that will not be used
training$PassengerId <- NULL
training$Cabin <- NULL
Now, let’s imputate missing values on variables “Age” and “Embarked”:
# impute missing values
survived <- training$Survived
training.imp <- missForest(select(training, -"Survived"),
ntree = 1000, verbose = T)
## missForest iteration 1 in progress...done!
## estimated error(s): 0.1942052 0.03846154
## difference(s): 0.003847561 0.0002666667
## time: 11.38 seconds
##
## missForest iteration 2 in progress...done!
## estimated error(s): 0.1934052 0.03685897
## difference(s): 6.45031e-06 0
## time: 10.14 seconds
##
## missForest iteration 3 in progress...done!
## estimated error(s): 0.1946736 0.03685897
## difference(s): 8.505943e-06 0
## time: 9.42 seconds
training <- data.frame(training.imp$ximp)
training$Survived <- survived
What type of relations can we see between the continuous predictors and the outcome?
# boxplot
featurePlot(x = select(training, c("Age","SibSp","Parch","Fare")),
y = training$Survived,
plot = "box",
scales = list(y = list(relation="free"),
x = list(rot = 90)),
layout = c(4,1),
auto.key = list(columns = 2))
Passengers that survived were younger (obviously, children had preference). Fares are higher between those that survived; in other words, richer passengers seemed to be privileged. Boxplot does not show very well the relation between “SibSp”, “Parch” and “Survival”. We can visualize it better in a density plot:
# density plot
featurePlot(x = select(training, c("Age","SibSp","Parch","Fare")),
y = training$Survived,
plot = "density",
scales = list(x = list(relation="free"),
y = list(relation="free")),
adjust = 1.5,
pch = "|",
layout = c(4,1),
auto.key = list(columns = 2))
Passengers with one or two siblings or spose survived more, but more than that was a problem. The same can be said about the number of children or parents. The relation between categorical variables is better seen in tables:
# 2x2 table (predictor "sex")
t1 <- table(training$Sex, training$Survived)
round(prop.table(t1,1),2)
##
## Not.Survived Survived
## Female 0.27 0.73
## Male 0.81 0.19
# 2x2 table (predictor "pclass")
t2 <- table(training$Pclass, training$Survived)
round(prop.table(t2,1),2)
##
## Not.Survived Survived
## First 0.35 0.65
## Second 0.52 0.48
## Third 0.76 0.24
t3 <- table(training$Embarked, training$Survived)
round(prop.table(t3,1),2)
##
## Not.Survived Survived
## Cherbourg 0.49 0.51
## Queenstown 0.63 0.37
## Southampton 0.65 0.35
Women survived more than men; passengers from first class survived more than from third class; passengers that embarked at Queenstown and Southampton tended to die more, in comparison of those who embarked at Cherbourg.
On validation data set, we will repeat transformations done on training data set; we will: 1) remove variables that will not be used; 2) imputate missing values:
# remove variables that will not be used
validation$PassengerId <- NULL
validation$Cabin <- NULL
# impute missing values
survived <- validation$Survived
validation.imp <- missForest(select(validation, -"Survived"),
ntree = 1000, verbose = T)
## missForest iteration 1 in progress...done!
## estimated error(s): 0.1638357 0.03522013
## difference(s): 0.002180458 0.0006265664
## time: 4.58 seconds
##
## missForest iteration 2 in progress...done!
## estimated error(s): 0.1626721 0.03333333
## difference(s): 3.593081e-06 0
## time: 4.15 seconds
##
## missForest iteration 3 in progress...done!
## estimated error(s): 0.1639058 0.03333333
## difference(s): 2.965022e-06 0
## time: 4.25 seconds
##
## missForest iteration 4 in progress...done!
## estimated error(s): 0.163997 0.03333333
## difference(s): 3.540082e-06 0
## time: 4.05 seconds
validation <- data.frame(validation.imp$ximp)
validation$Survived <- survived
Arbitrarily, we will fit Random Forest method as the first model. Basic steps are: 1) fit model on training set; 2) predict values on validation data set; 3) build confusion matrix (again, based on the validation set).
# fit a random forest model
set.seed(431)
modRF <- train(Survived ~ ., data=training,
method="rf",
prox=TRUE)
modRF
## Random Forest
##
## 625 samples
## 10 predictor
## 2 classes: 'Not.Survived', 'Survived'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 625, 625, 625, 625, 625, 625, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7951357 0.5306890
## 29 0.8176757 0.6079478
## 56 0.7964694 0.5647731
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 29.
# predict on new samples
predRF <- predict(modRF, newdata = validation)
# evaluate model
confMatRF <- confusionMatrix(predRF, validation$Survived)
Now, suppose that we want to find other uncorrelated models to Random Forest, to build later a combined model (a combined model works better with uncorrelated ones). How can we can do that? Max Kuhn, on the website The caret Package, provides us a csv file, which summarizes the main features of algorithms contemplated by caret package. The explanation of how to use this file to find uncorrelated models is here. Let’s do it:
# assign URL to an R object
URL <- "http://topepo.github.io/caret/tag_data.csv"
# download data
download.file(URL, destfile = "tag_data.csv")
# load data
tag <- read.csv("tag_data.csv", row.names = 1)
# convert data frame to a matrix
tag <- as.matrix(tag)
# select only models for classification
regModels <- tag[tag[,"Classification"] == 1,]
all <- 1:nrow(regModels)
# seed the analysis with the GBM model
start <- grep("(rf)", rownames(regModels), fixed = TRUE)
pool <- all[all != start]
# select 2 models models by maximizing the Jaccard dissimilarity between sets of models
nextMods <- maxDissim(regModels[start,,drop = FALSE],
regModels[pool, ],
method = "Jaccard",
n = 2)
modelsClassification <- rownames(regModels)[c(start, nextMods)]
modelsClassification
## [1] "Random Forest (rf)"
## [2] "Linear Distance Weighted Discrimination (dwdLinear)"
## [3] "Bagged CART (treebag)"
Models that we will fit and later combine are: Random Forest (which has already been fitted); Linear Distance Weigthed Model; and Bagged Classification and Regression Tree model (Bagged Tree). First, we will fit a Linear Distance Weigthed Model:
# fit a Linear Distance Weighted Discrimination model
set.seed(431)
modLDWD <- train(Survived ~ ., method="dwdLinear", data=training)
modLDWD
## Linear Distance Weighted Discrimination
##
## 625 samples
## 10 predictor
## 2 classes: 'Not.Survived', 'Survived'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 625, 625, 625, 625, 625, 625, ...
## Resampling results across tuning parameters:
##
## lambda Accuracy Kappa
## 1e-05 0.8072708 0.5886795
## 1e-02 0.8203765 0.6169484
## 1e+01 0.6694264 0.1950262
##
## Tuning parameter 'qval' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were lambda = 0.01 and qval = 1.
# predict on new samples
predLDWD <- predict(modLDWD, newdata=validation)
# evaluate model
confMatLDWD <- confusionMatrix(predLDWD, validation$Survived)
Secondly, we will fit a Bagged Classification and Regression Tree model:
# fit a Bagged CART model (treebag)
set.seed(431)
modBCART <- train(Survived ~ ., method="treebag", data=training)
modBCART
## Bagged CART
##
## 625 samples
## 10 predictor
## 2 classes: 'Not.Survived', 'Survived'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 625, 625, 625, 625, 625, 625, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7972226 0.5668496
# predict on new samples
predBCART <- predict(modBCART, newdata=validation)
# evaluate model
confMatBCART <- confusionMatrix(predBCART,validation$Survived)
Now we are going to fit a final model; it will be a combination from these 3 previous ones:
# fit multiple models
trainControl <- trainControl(method="repeatedcv",
number=10,
repeats=10,
classProbs=TRUE,
savePredictions=TRUE)
algorithmList <- c("rf","dwdLinear","treebag")
set.seed(431)
models <- caretList(Survived ~ ., data=training,
trControl = trainControl,
methodList = algorithmList)
# ensemble models
set.seed(431)
stackControl <- trainControl(method="repeatedcv",
number=10,
repeats=10,
savePredictions=TRUE,
classProbs=TRUE)
modStack <- caretStack(models,
method="glm",
metric="Accuracy",
trControl=stackControl)
modStack
## A glm ensemble of 3 base models: rf, dwdLinear, treebag
##
## Ensemble results:
## Generalized Linear Model
##
## 6250 samples
## 3 predictor
## 2 classes: 'Not.Survived', 'Survived'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 5625, 5625, 5625, 5625, 5625, 5625, ...
## Resampling results:
##
## Accuracy Kappa
## 0.824512 0.6211911
# predict on validation data set
predStack <- predict(modStack, newdata = validation)
# evaluate model
confMatStack <- confusionMatrix(predStack, validation$Survived)
And, finally, let’s see how these 4 fours behave on a new sample (the validation data set). We will use accuracy to evaluate model performance:
cbind(c("Random Forest","Linear Distance Weighted Model","Bagged Tree",
"Stacked Model"),
rbind(confMatRF$overall["Accuracy"],
confMatLDWD$overall["Accuracy"],
confMatBCART$overall["Accuracy"],
confMatStack$overall["Accuracy"]))
## Accuracy
## [1,] "Random Forest" "0.857142857142857"
## [2,] "Linear Distance Weighted Model" "0.849624060150376"
## [3,] "Bagged Tree" "0.815789473684211"
## [4,] "Stacked Model" "0.868421052631579"
As expected, combined model performanced better in comparison with the others, but just a little bit. Anyway, we will use this final model to predict survival on the testing data set provived by Kaggle. First, we will transform testing in the same way that we did on training and validation data set:
# create vector with variablePassengerId
PassengerId <- testing$PassengerId
# remove variables that will not be used
testing$PassengerId <- NULL
testing$Cabin <- NULL
# impute missing values
testing.imp <- missForest(testing, ntree = 1000, verbose = T)
## missForest iteration 1 in progress...done!
## estimated error(s): 0.549955 0
## difference(s): 0.002299855 0
## time: 2.06 seconds
##
## missForest iteration 2 in progress...done!
## estimated error(s): 0.5534937 0
## difference(s): 3.594727e-06 0
## time: 2.47 seconds
##
## missForest iteration 3 in progress...done!
## estimated error(s): 0.5610569 0
## difference(s): 6.533115e-06 0
## time: 2.25 seconds
testing <- data.frame(testing.imp$ximp)
And finally, we will get the predictions on it::
# predict on testing data set
predFinal <- predict(modStack, newdata = testing)
# prepare file
submission <- as.data.frame(cbind(PassengerId, predFinal))
submission$predFinal <- ifelse(submission$predFinal == 2, 1, 0)
colnames(submission)[2] <- "Survived"
write.csv(submission,"./submission.csv", row.names = FALSE)
Now, all you have to do is to submit to Kaggle the csv file called “submission”. The accuracy that I got on the testing data set was 0.76076 (which, in fact, was not so good, but it was the best that I could have done).