Predicting Titanic Survivors with Machine Learning on Kaggle

This is my attempt to follow Trevor Stephen’s tutorial on Kaggle’s Titanic contest which introduces Machine Learning concepts.

Set working directory and import datafiles

setwd("D:/Code/Kaggle/titanic")
train <- read.csv('train.csv')
test <- read.csv('test.csv')

Prediction #1

We will be working with train$gender to train our classifier. Let’s take a look at the dataset

table(train$Survived)

## 
##   0   1 
## 549 342

prop.table(table(train$Survived))

## 
##         0         1 
## 0.6161616 0.3838384

As most of the people died in Titanic, let’s make a morbid assumption that everyone died.

test$Survived <- rep(0, 418)

submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = 'yallperish.csv', row.names = FALSE)

Hooray! My first prediction!

Prediction #2

In the first prediction, I only looked at train$Survived to predict the fate of passengers. Trevor focusses the fact that Titanic disaster was famous for saving ‘women and children first’. Let’s take a look at survival rates between genders

prop.table(table(train$Sex, train$Survived), 1)

##         
##                  0         1
##   female 0.2579618 0.7420382
##   male   0.8110919 0.1889081

Looks like most of the women survived while most of the men died. Let’s alter my last prediction and assume only all men died.

test$Survived <- 0
test$Survived[test$Sex == 'female'] <- 1

# write to disk
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = 'diemendie.csv', row.names = FALSE)

0.76555 isn’t that bad but I’m still in the bottom 20 percentile.

Prediction #3

Let’s see if being a minor has any effect on chances of survival.

summary(train$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177

# Age as a categorical variable
train$Child <- 0
train$Child[train$Age < 18] <- 1

#aggregate(Survived ~ Child + Sex, data = train, FUN = sum)
#aggregate(Survived ~ Child + Sex, data = train, FUN = length)
aggregate(Survived ~ Child + Sex, data = train, FUN = function(x) {sum(x)/length(x)})

##   Child    Sex  Survived
## 1     0 female 0.7528958
## 2     1 female 0.6909091
## 3     0   male 0.1657033
## 4     1   male 0.3965517

That’s tragic, looks like the chances of survival weren’t really bright for the boys. Let’s take a look at how passengers with expensive fares fared in this catastrophe,

train$Fare2 <- cut(train$Fare, c(-Inf, 10, 20, 30, Inf), labels = c("<10", "10-20", "20-30", "30+"))

aggregate(Survived ~ Child + Fare2 + Pclass + Sex, data = train, FUN = function(x) {sum(x) / length(x)})

##    Child Fare2 Pclass    Sex   Survived
## 1      0 20-30      1 female 0.85714286
## 2      0   30+      1 female 0.98734177
## 3      1   30+      1 female 0.87500000
## 4      0 10-20      2 female 0.90625000
## 5      1 10-20      2 female 1.00000000
## 6      0 20-30      2 female 0.88461538
## 7      1 20-30      2 female 1.00000000
## 8      0   30+      2 female 1.00000000
## 9      1   30+      2 female 1.00000000
## 10     0   <10      3 female 0.56140351
## 11     1   <10      3 female 0.85714286
## 12     0 10-20      3 female 0.50000000
## 13     1 10-20      3 female 0.73333333
## 14     0 20-30      3 female 0.40000000
## 15     1 20-30      3 female 0.16666667
## 16     0   30+      3 female 0.11111111
## 17     1   30+      3 female 0.14285714
## 18     0   <10      1   male 0.00000000
## 19     0 20-30      1   male 0.44117647
## 20     0   30+      1   male 0.33333333
## 21     1   30+      1   male 1.00000000
## 22     0   <10      2   male 0.00000000
## 23     0 10-20      2   male 0.11864407
## 24     1 10-20      2   male 0.75000000
## 25     0 20-30      2   male 0.04761905
## 26     1 20-30      2   male 0.75000000
## 27     0   30+      2   male 0.00000000
## 28     1   30+      2   male 1.00000000
## 29     0   <10      3   male 0.10931174
## 30     1   <10      3   male 0.15384615
## 31     0 10-20      3   male 0.12903226
## 32     1 10-20      3   male 0.71428571
## 33     0 20-30      3   male 0.07142857
## 34     1 20-30      3   male 0.20000000
## 35     0   30+      3   male 0.41666667
## 36     1   30+      3   male 0.07692308

It looks like Class 3 females who paid more than $20 didn’t do well. I’ll remove them from those who survived along with the men.

test$Survived <- 0
test$Survived[test$Sex == 'female'] <- 1
test$Survived[test$Sex == 'female' & test$Pclass == 3 & test$Fare > 20] <- 0

# write to disk
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = 'allmensomewomendie.csv', row.names = FALSE)

0.77990 this time! Top 45% percentile

Prediction #4

Decision Trees

Trevor gave an introduction to decision trees. I’m basically going to automate the process I was doing so far using an efficient algorithm. To build the model,

fit <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
             data = train,
             method = "class")

plot(fit)
text(fit)

fancyRpartPlot(fit)

As we can see the process has been automated. Note that a decision tree algorithm is essentially a greedy algorithm, it doesn’t always lead to the most efficient solution. I’ll use predict() to arrive at the results without messing around with subsetting or overwriting values.

Prediction <- predict(fit, test, type = "class")
submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)
write.csv(submit, file = 'decisiontree.csv', row.names = FALSE)

This was okay.

Prediction #5

Feature Engineering

Feature Engineering brings back the importance of human element in machine learning. It involves refining points of information or discovering new ones.

So far, we were dependent on the variables that were provided with the dataset as-is. We haven’t yet taken a look at Names.

train <- read.csv('train.csv')
test <- read.csv('test.csv')

train$Name[1]

## [1] Braund, Mr. Owen Harris
## 891 Levels: Abbing, Mr. Anthony ... Zimmerman, Mr. Leo

Each name has a title, first name, middle name and last name. Title and Last Names can provide us a different insight on the data. Let’s explore them. Before I do that, I am going to merge both the training and test dataset because Trevor does the same. I am not sure why

test$Survived <- NA
combi <- rbind(train, test)

combi$Name <- as.character(combi$Name)
combi$Name[1]

## [1] "Braund, Mr. Owen Harris"

names <- strsplit(combi$Name, split = "[,.\\s]+", perl = TRUE)
#head(data.frame(names))

combi$Title <- sapply(names, "[", 2) #Gets the second element from every vector in the list of vector
table(combi$Title)

## 
##    Billiard       Brito        Capt       Carlo         Col    Cruyssen 
##           3           1           1           2           4           1 
##         der         Don          Dr      Gordon        Impe    Jonkheer 
##           1           1           8           2           3           1 
##      Khalil       Major      Master   Melkebeke Messemaeker        Miss 
##           1           2          59           1           2         256 
##        Mlle         Mme          Mr         Mrs          Ms      Mulder 
##           2           1         736         191           2           1 
##   Palmquist  Pelsmaeker      Planke         Rev      Shawah       Steen 
##           1           1           4           8           1           1 
##         the       Velde       Walle           y 
##           1           1           1           8

This regex does not identify strings inside parenthesis. However, that is not our concern as we are not currently trying to access beyong the 4th string which is the last name of the passenger.

Or, we can take the approach Trevor takes in this,

combi$Title <- sapply(combi$Name, FUN=function(x) {strsplit(x, split = "[.,]")[[1]][2]})
combi$Title <- sub(' ', '', combi$Title)
table(combi$Title)

## 
##         Capt          Col          Don         Dona           Dr 
##            1            4            1            1            8 
##     Jonkheer         Lady        Major       Master         Miss 
##            1            1            2           61          260 
##         Mlle          Mme           Mr          Mrs           Ms 
##            2            1          757          197            2 
##          Rev          Sir the Countess 
##            8            1            1

The latter approach does a little better.

combi$Title[combi$Title %in% c('Mme', 'Mlle')] <- 'Mlle'
combi$Title[combi$Title %in% c('Capt', 'Don', 'Major', 'Sir')] <- 'Sir'
combi$Title[combi$Title %in% c('Dona', 'Lady', 'the Countess', 'Johnkeer')] <- 'Lady'
combi$Title <- factor(combi$Title)

Another variable we can fetch is the size of family, using SibSp and Parch

combi$FamilySize <- combi$SibSp + combi$Parch + 1

Extract surnames and Family ID,

combi$Surname <- sapply(combi$Name, FUN=function(x) {strsplit(x, split = "[.,]")[[1]][1]})
combi$FamilyID <- paste(as.character(combi$FamilySize), combi$Surname, sep = '')

At this point, I’m not sure what’s going on in which variables. I’ll keep on going though. Maybe it’s like making your way to the light switch in a dark room.

Anything more? Well we just thought about a large family having issues getting to lifeboats together, but maybe specific families had more trouble than others? We could try to extract the Surname of the passengers and group them to find families, but a common last name such as Johnson might have a few extra non-related people aboard. In fact there are three Johnsons in a family with size 3, and another three probably unrelated Johnsons all travelling solo.

This is where I was lost, just to note.

combi$FamilyID[combi$FamilySize <= 2] <- "Small"
table(combi$FamilyID)

## 
##            11Sage           3Abbott         3Appleton         3Beckwith 
##                11                 3                 1                 2 
##           3Boulos           3Bourke            3Brown         3Caldwell 
##                 3                 3                 4                 3 
##          3Christy          3Collyer          3Compton          3Cornell 
##                 2                 3                 3                 1 
##           3Coutts           3Crosby           3Danbom           3Davies 
##                 3                 3                 3                 5 
##            3Dodge          3Douglas             3Drew            3Elias 
##                 3                 1                 3                 3 
##       3Frauenthal        3Frolicher 3Frolicher-Stehli        3Goldsmith 
##                 1                 1                 2                 3 
##       3Gustafsson       3Hamalainen           3Hansen             3Hart 
##                 2                 2                 1                 3 
##             3Hays          3Hickman         3Hiltunen         3Hirvonen 
##                 2                 3                 1                 1 
##         3Jefferys          3Johnson             3Kink    3Kink-Heilmann 
##                 2                 3                 2                 2 
##           3Klasen         3Lahtinen           3Mallet            3McCoy 
##                 3                 2                 3                 3 
##          3Minahan         3Moubarek            3Nakid         3Navratil 
##                 1                 3                 3                 3 
##           3Newell           3Newsom         3Nicholls          3Peacock 
##                 1                 1                 1                 3 
##            3Peter            3Quick         3Richards          3Rosblom 
##                 3                 3                 2                 3 
##           3Samaan        3Sandstrom           3Silven          3Spedden 
##                 3                 3                 1                 3 
##            3Strom          3Taussig           3Thayer           3Thomas 
##                 1                 3                 3                 1 
##            3Touma     3van Billiard         3Van Impe    3Vander Planke 
##                 3                 3                 3                 2 
##            3Wells             3Wick          3Widener          4Allison 
##                 3                 3                 3                 4 
##        4Backstrom          4Baclini           4Becker           4Carter 
##                 1                 4                 4                 4 
##         4Davidson             4Dean           4Herman          4Hocking 
##                 1                 4                 4                 2 
##        4Jacobsohn         4Johnston          4Laroche           4Renouf 
##                 1                 4                 4                 1 
##    4Vander Planke             4West             5Ford          5Hocking 
##                 1                 4                 5                 1 
##    5Kink-Heilmann          5Lefebre          5Palsson          5Ryerson 
##                 1                 5                 5                 5 
##          6Fortune           6Panula             6Rice         6Richards 
##                 6                 6                 6                 1 
##            6Skoog        7Andersson          7Asplund          8Goodwin 
##                 6                 9                 7                 8 
##             Small 
##              1025

There is some problem that occurs at this point which we’re trying to fix, something wrong with family ID.

famIDs <- data.frame(table(combi$FamilyID))

I think the trouble is that from the statement combi$FamilyID[combi$FamilySize <= 2] <- "Small" we shouldn’t be seeing any Family with 2 members. However we do.

famIDs <- famIDs[famIDs$Freq <= 2, ]

This is the list of all small families that didn’t get converted

combi$FamilyID[combi$FamilyID %in% famIDs$Var1] <- 'Small'
combi$FamilyID <- factor(combi$FamilyID)

Alright. Trevor also answers why did we merge the two datasets. It makes the model consistent and something that I don’t completely understand right now.

We are now ready to split the test and training sets back into their original states, carrying our fancy new engineered variables with them. The nicest part of what we just did is how the factors are treated in R. Behind the scenes, factors are basically stored as integers, but masked with their text names for us to look at. If you create the above factors on the isolated test and train sets separately, there is no guarantee that both groups exist in both sets.

For instance, the family “3Johnson” previously discussed does not exist in the test set. We know that all three of them survive from the training set data. If we had built our factors in isolation, there would be no factor “3Johnson” for the test set. This would upset any machine learning model because the factors between the training set used to build the model and the test set it is asked to predict for are not consistent. ie. R will throw errors at you if you try.

Because we built the factors on a single dataframe, and then split it apart after we built them, R will give all factor levels to both new dataframes, even if the factor doesn’t exist in one. It will still have the factor level, but no actual observations of it in the set. Neat trick right? Let me assure you that manually updating factor levels is a pain.

train <- combi[1:891, ]
test <- combi[892:1309, ]

Time to do predictions now, this is just like the earlier prediction

fit <- rpart(data = train,
             method = "class",
             Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID)

Prediction <- predict(fit, test, type = "class")
submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)
write.csv(submit, file = 'featureengineering.csv', row.names = FALSE)

Yay! This got me in the top 1000 people. Deadline for this competition ends in December, 2016 so I’ll try to reach a sub-500 rank till then.

Prediction #6

Random Forest

If we grow a lot of decision trees, with randomized samples from the dataset using mutliple subsets of variables, we get a forest. To predict a result we can ask this forest to vote. This is random forest.

Before we take the problem head on, let’s massage the data. The Random Forest algorithm in R will not tolerate NA values. We can replace them with the mean of the data but in dataset combi$Age 20% of the values are missing. Instead we will grow a decision tree.

Agefit <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + FamilySize,
                data = combi[!is.na(combi$Age), ],
                method = 'anova')
combi$Age[is.na(combi$Age)] <- predict(Agefit, combi[is.na(combi$Age), ])

Take a look at the statistics before we lift off,

summary(combi)

##   PassengerId      Survived          Pclass          Name          
##  Min.   :   1   Min.   :0.0000   Min.   :1.000   Length:1309       
##  1st Qu.: 328   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median : 655   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   : 655   Mean   :0.3838   Mean   :2.295                     
##  3rd Qu.: 982   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :1309   Max.   :1.0000   Max.   :3.000                     
##                 NA's   :418                                        
##      Sex           Age            SibSp            Parch      
##  female:466   Min.   : 0.17   Min.   :0.0000   Min.   :0.000  
##  male  :843   1st Qu.:22.00   1st Qu.:0.0000   1st Qu.:0.000  
##               Median :28.86   Median :0.0000   Median :0.000  
##               Mean   :29.70   Mean   :0.4989   Mean   :0.385  
##               3rd Qu.:36.50   3rd Qu.:1.0000   3rd Qu.:0.000  
##               Max.   :80.00   Max.   :8.0000   Max.   :9.000  
##                                                               
##       Ticket          Fare                     Cabin      Embarked
##  CA. 2343:  11   Min.   :  0.000                  :1014    :  2   
##  1601    :   8   1st Qu.:  7.896   C23 C25 C27    :   6   C:270   
##  CA 2144 :   8   Median : 14.454   B57 B59 B63 B66:   5   Q:123   
##  3101295 :   7   Mean   : 33.295   G6             :   5   S:914   
##  347077  :   7   3rd Qu.: 31.275   B96 B98        :   4           
##  347082  :   7   Max.   :512.329   C22 C26        :   4           
##  (Other) :1261   NA's   :1         (Other)        : 271           
##      Title       FamilySize       Surname                FamilyID   
##  Mr     :757   Min.   : 1.000   Length:1309        Small     :1074  
##  Miss   :260   1st Qu.: 1.000   Class :character   11Sage    :  11  
##  Mrs    :197   Median : 1.000   Mode  :character   7Andersson:   9  
##  Master : 61   Mean   : 1.884                      8Goodwin  :   8  
##  Dr     :  8   3rd Qu.: 2.000                      7Asplund  :   7  
##  Rev    :  8   Max.   :11.000                      6Fortune  :   6  
##  (Other): 18                                       (Other)   : 194

#combi$Embarked <- as.character(combi$Embarked)
combi$Embarked[which(combi$Embarked == '')] <- 'S'
combi$Embarked <- factor(combi$Embarked)

combi$Fare[which(is.na(combi$Fare))] <- median(combi$Fare, na.rm = TRUE)

I’ve got rid of all NA values from the dataset. There’s another problem here, the factor variables cannot have more than 32 levels. combi$FamilyID has double that many. To cut them down I will include all the 3 member families into small

combi$FamilyID2 <- combi$FamilyID
combi$FamilyID2 <- as.character(combi$FamilyID2)
combi$FamilyID2[combi$FamilySize <= 3] <- 'Small'
combi$FamilyID2 <- factor(combi$FamilyID2)

train <- combi[1:891,]
test <- combi[892:1309,]

Let’s begin by setting a seed to ensure reproducibility

set.seed(42)

fit <- randomForest(data = train,
                    importance = TRUE,
                    ntree = 2000,
                    as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID2)

varImpPlot(fit)

To generate the prediction is just like as it has always been

prediction <- predict(fit, test)
submit <- data.frame(PassengerID = test$PassengerId, Survived = prediction)
write.csv(submit, file = "randomForest.csv", row.names = FALSE)

This was a bummer. Our score didn’t improve at all!

Prediction #7

Conditional Inference Trees

Somewhere in this a statistical test is involved, let’s roll before we dive into the specifics. Using the library(party)

set.seed(42)
fit <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID,
               data = train,
               controls = cforest_unbiased(ntree=2000, mtry=3))

To make the prediction, same drill.

prediction <- predict(fit, test, OOB = TRUE, type = 'response')
submit <- data.frame(PassengerID = test$PassengerId, Survived = prediction)
write.csv(submit, file = 'conditionalInferenceTrees.csv', row.names = FALSE)

This gets me in top-500 of the leaderboard with a score of 0.80861. Trevor leaves us at this point so I’ll have to tread alone the rest of the way.