This is my attempt to follow Trevor Stephen’s tutorial on Kaggle’s Titanic contest which introduces Machine Learning concepts.
setwd("D:/Code/Kaggle/titanic")
train <- read.csv('train.csv')
test <- read.csv('test.csv')
We will be working with train$gender to train our classifier. Let’s take a look at the dataset
table(train$Survived)
##
## 0 1
## 549 342
prop.table(table(train$Survived))
##
## 0 1
## 0.6161616 0.3838384
As most of the people died in Titanic, let’s make a morbid assumption that everyone died.
test$Survived <- rep(0, 418)
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = 'yallperish.csv', row.names = FALSE)
Hooray! My first prediction!
In the first prediction, I only looked at train$Survived to predict the fate of passengers. Trevor focusses the fact that Titanic disaster was famous for saving ‘women and children first’. Let’s take a look at survival rates between genders
prop.table(table(train$Sex, train$Survived), 1)
##
## 0 1
## female 0.2579618 0.7420382
## male 0.8110919 0.1889081
Looks like most of the women survived while most of the men died. Let’s alter my last prediction and assume only all men died.
test$Survived <- 0
test$Survived[test$Sex == 'female'] <- 1
# write to disk
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = 'diemendie.csv', row.names = FALSE)
0.76555 isn’t that bad but I’m still in the bottom 20 percentile.
Let’s see if being a minor has any effect on chances of survival.
summary(train$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 20.12 28.00 29.70 38.00 80.00 177
# Age as a categorical variable
train$Child <- 0
train$Child[train$Age < 18] <- 1
#aggregate(Survived ~ Child + Sex, data = train, FUN = sum)
#aggregate(Survived ~ Child + Sex, data = train, FUN = length)
aggregate(Survived ~ Child + Sex, data = train, FUN = function(x) {sum(x)/length(x)})
## Child Sex Survived
## 1 0 female 0.7528958
## 2 1 female 0.6909091
## 3 0 male 0.1657033
## 4 1 male 0.3965517
That’s tragic, looks like the chances of survival weren’t really bright for the boys. Let’s take a look at how passengers with expensive fares fared in this catastrophe,
train$Fare2 <- cut(train$Fare, c(-Inf, 10, 20, 30, Inf), labels = c("<10", "10-20", "20-30", "30+"))
aggregate(Survived ~ Child + Fare2 + Pclass + Sex, data = train, FUN = function(x) {sum(x) / length(x)})
## Child Fare2 Pclass Sex Survived
## 1 0 20-30 1 female 0.85714286
## 2 0 30+ 1 female 0.98734177
## 3 1 30+ 1 female 0.87500000
## 4 0 10-20 2 female 0.90625000
## 5 1 10-20 2 female 1.00000000
## 6 0 20-30 2 female 0.88461538
## 7 1 20-30 2 female 1.00000000
## 8 0 30+ 2 female 1.00000000
## 9 1 30+ 2 female 1.00000000
## 10 0 <10 3 female 0.56140351
## 11 1 <10 3 female 0.85714286
## 12 0 10-20 3 female 0.50000000
## 13 1 10-20 3 female 0.73333333
## 14 0 20-30 3 female 0.40000000
## 15 1 20-30 3 female 0.16666667
## 16 0 30+ 3 female 0.11111111
## 17 1 30+ 3 female 0.14285714
## 18 0 <10 1 male 0.00000000
## 19 0 20-30 1 male 0.44117647
## 20 0 30+ 1 male 0.33333333
## 21 1 30+ 1 male 1.00000000
## 22 0 <10 2 male 0.00000000
## 23 0 10-20 2 male 0.11864407
## 24 1 10-20 2 male 0.75000000
## 25 0 20-30 2 male 0.04761905
## 26 1 20-30 2 male 0.75000000
## 27 0 30+ 2 male 0.00000000
## 28 1 30+ 2 male 1.00000000
## 29 0 <10 3 male 0.10931174
## 30 1 <10 3 male 0.15384615
## 31 0 10-20 3 male 0.12903226
## 32 1 10-20 3 male 0.71428571
## 33 0 20-30 3 male 0.07142857
## 34 1 20-30 3 male 0.20000000
## 35 0 30+ 3 male 0.41666667
## 36 1 30+ 3 male 0.07692308
It looks like Class 3 females who paid more than $20 didn’t do well. I’ll remove them from those who survived along with the men.
test$Survived <- 0
test$Survived[test$Sex == 'female'] <- 1
test$Survived[test$Sex == 'female' & test$Pclass == 3 & test$Fare > 20] <- 0
# write to disk
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = 'allmensomewomendie.csv', row.names = FALSE)
0.77990 this time! Top 45% percentile
Trevor gave an introduction to decision trees. I’m basically going to automate the process I was doing so far using an efficient algorithm. To build the model,
fit <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
data = train,
method = "class")
plot(fit)
text(fit)
fancyRpartPlot(fit)
As we can see the process has been automated. Note that a decision tree algorithm is essentially a greedy algorithm, it doesn’t always lead to the most efficient solution. I’ll use predict() to arrive at the results without messing around with subsetting or overwriting values.
Prediction <- predict(fit, test, type = "class")
submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)
write.csv(submit, file = 'decisiontree.csv', row.names = FALSE)
This was okay.
Feature Engineering brings back the importance of human element in machine learning. It involves refining points of information or discovering new ones.
So far, we were dependent on the variables that were provided with the dataset as-is. We haven’t yet taken a look at Names.
train <- read.csv('train.csv')
test <- read.csv('test.csv')
train$Name[1]
## [1] Braund, Mr. Owen Harris
## 891 Levels: Abbing, Mr. Anthony ... Zimmerman, Mr. Leo
Each name has a title, first name, middle name and last name. Title and Last Names can provide us a different insight on the data. Let’s explore them. Before I do that, I am going to merge both the training and test dataset because Trevor does the same. I am not sure why
test$Survived <- NA
combi <- rbind(train, test)
combi$Name <- as.character(combi$Name)
combi$Name[1]
## [1] "Braund, Mr. Owen Harris"
names <- strsplit(combi$Name, split = "[,.\\s]+", perl = TRUE)
#head(data.frame(names))
combi$Title <- sapply(names, "[", 2) #Gets the second element from every vector in the list of vector
table(combi$Title)
##
## Billiard Brito Capt Carlo Col Cruyssen
## 3 1 1 2 4 1
## der Don Dr Gordon Impe Jonkheer
## 1 1 8 2 3 1
## Khalil Major Master Melkebeke Messemaeker Miss
## 1 2 59 1 2 256
## Mlle Mme Mr Mrs Ms Mulder
## 2 1 736 191 2 1
## Palmquist Pelsmaeker Planke Rev Shawah Steen
## 1 1 4 8 1 1
## the Velde Walle y
## 1 1 1 8
This regex does not identify strings inside parenthesis. However, that is not our concern as we are not currently trying to access beyong the 4th string which is the last name of the passenger.
Or, we can take the approach Trevor takes in this,
combi$Title <- sapply(combi$Name, FUN=function(x) {strsplit(x, split = "[.,]")[[1]][2]})
combi$Title <- sub(' ', '', combi$Title)
table(combi$Title)
##
## Capt Col Don Dona Dr
## 1 4 1 1 8
## Jonkheer Lady Major Master Miss
## 1 1 2 61 260
## Mlle Mme Mr Mrs Ms
## 2 1 757 197 2
## Rev Sir the Countess
## 8 1 1
The latter approach does a little better.
combi$Title[combi$Title %in% c('Mme', 'Mlle')] <- 'Mlle'
combi$Title[combi$Title %in% c('Capt', 'Don', 'Major', 'Sir')] <- 'Sir'
combi$Title[combi$Title %in% c('Dona', 'Lady', 'the Countess', 'Johnkeer')] <- 'Lady'
combi$Title <- factor(combi$Title)
Another variable we can fetch is the size of family, using SibSp and Parch
combi$FamilySize <- combi$SibSp + combi$Parch + 1
Extract surnames and Family ID,
combi$Surname <- sapply(combi$Name, FUN=function(x) {strsplit(x, split = "[.,]")[[1]][1]})
combi$FamilyID <- paste(as.character(combi$FamilySize), combi$Surname, sep = '')
At this point, I’m not sure what’s going on in which variables. I’ll keep on going though. Maybe it’s like making your way to the light switch in a dark room.
Anything more? Well we just thought about a large family having issues getting to lifeboats together, but maybe specific families had more trouble than others? We could try to extract the Surname of the passengers and group them to find families, but a common last name such as Johnson might have a few extra non-related people aboard. In fact there are three Johnsons in a family with size 3, and another three probably unrelated Johnsons all travelling solo.
This is where I was lost, just to note.
combi$FamilyID[combi$FamilySize <= 2] <- "Small"
table(combi$FamilyID)
##
## 11Sage 3Abbott 3Appleton 3Beckwith
## 11 3 1 2
## 3Boulos 3Bourke 3Brown 3Caldwell
## 3 3 4 3
## 3Christy 3Collyer 3Compton 3Cornell
## 2 3 3 1
## 3Coutts 3Crosby 3Danbom 3Davies
## 3 3 3 5
## 3Dodge 3Douglas 3Drew 3Elias
## 3 1 3 3
## 3Frauenthal 3Frolicher 3Frolicher-Stehli 3Goldsmith
## 1 1 2 3
## 3Gustafsson 3Hamalainen 3Hansen 3Hart
## 2 2 1 3
## 3Hays 3Hickman 3Hiltunen 3Hirvonen
## 2 3 1 1
## 3Jefferys 3Johnson 3Kink 3Kink-Heilmann
## 2 3 2 2
## 3Klasen 3Lahtinen 3Mallet 3McCoy
## 3 2 3 3
## 3Minahan 3Moubarek 3Nakid 3Navratil
## 1 3 3 3
## 3Newell 3Newsom 3Nicholls 3Peacock
## 1 1 1 3
## 3Peter 3Quick 3Richards 3Rosblom
## 3 3 2 3
## 3Samaan 3Sandstrom 3Silven 3Spedden
## 3 3 1 3
## 3Strom 3Taussig 3Thayer 3Thomas
## 1 3 3 1
## 3Touma 3van Billiard 3Van Impe 3Vander Planke
## 3 3 3 2
## 3Wells 3Wick 3Widener 4Allison
## 3 3 3 4
## 4Backstrom 4Baclini 4Becker 4Carter
## 1 4 4 4
## 4Davidson 4Dean 4Herman 4Hocking
## 1 4 4 2
## 4Jacobsohn 4Johnston 4Laroche 4Renouf
## 1 4 4 1
## 4Vander Planke 4West 5Ford 5Hocking
## 1 4 5 1
## 5Kink-Heilmann 5Lefebre 5Palsson 5Ryerson
## 1 5 5 5
## 6Fortune 6Panula 6Rice 6Richards
## 6 6 6 1
## 6Skoog 7Andersson 7Asplund 8Goodwin
## 6 9 7 8
## Small
## 1025
There is some problem that occurs at this point which we’re trying to fix, something wrong with family ID.
famIDs <- data.frame(table(combi$FamilyID))
I think the trouble is that from the statement combi$FamilyID[combi$FamilySize <= 2] <- "Small" we shouldn’t be seeing any Family with 2 members. However we do.
famIDs <- famIDs[famIDs$Freq <= 2, ]
This is the list of all small families that didn’t get converted
combi$FamilyID[combi$FamilyID %in% famIDs$Var1] <- 'Small'
combi$FamilyID <- factor(combi$FamilyID)
Alright. Trevor also answers why did we merge the two datasets. It makes the model consistent and something that I don’t completely understand right now.
We are now ready to split the test and training sets back into their original states, carrying our fancy new engineered variables with them. The nicest part of what we just did is how the factors are treated in R. Behind the scenes, factors are basically stored as integers, but masked with their text names for us to look at. If you create the above factors on the isolated test and train sets separately, there is no guarantee that both groups exist in both sets.
For instance, the family “3Johnson” previously discussed does not exist in the test set. We know that all three of them survive from the training set data. If we had built our factors in isolation, there would be no factor “3Johnson” for the test set. This would upset any machine learning model because the factors between the training set used to build the model and the test set it is asked to predict for are not consistent. ie. R will throw errors at you if you try.
Because we built the factors on a single dataframe, and then split it apart after we built them, R will give all factor levels to both new dataframes, even if the factor doesn’t exist in one. It will still have the factor level, but no actual observations of it in the set. Neat trick right? Let me assure you that manually updating factor levels is a pain.
train <- combi[1:891, ]
test <- combi[892:1309, ]
Time to do predictions now, this is just like the earlier prediction
fit <- rpart(data = train,
method = "class",
Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID)
Prediction <- predict(fit, test, type = "class")
submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)
write.csv(submit, file = 'featureengineering.csv', row.names = FALSE)
Yay! This got me in the top 1000 people. Deadline for this competition ends in December, 2016 so I’ll try to reach a sub-500 rank till then.
If we grow a lot of decision trees, with randomized samples from the dataset using mutliple subsets of variables, we get a forest. To predict a result we can ask this forest to vote. This is random forest.
Before we take the problem head on, let’s massage the data. The Random Forest algorithm in R will not tolerate NA values. We can replace them with the mean of the data but in dataset combi$Age 20% of the values are missing. Instead we will grow a decision tree.
Agefit <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + FamilySize,
data = combi[!is.na(combi$Age), ],
method = 'anova')
combi$Age[is.na(combi$Age)] <- predict(Agefit, combi[is.na(combi$Age), ])
Take a look at the statistics before we lift off,
summary(combi)
## PassengerId Survived Pclass Name
## Min. : 1 Min. :0.0000 Min. :1.000 Length:1309
## 1st Qu.: 328 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median : 655 Median :0.0000 Median :3.000 Mode :character
## Mean : 655 Mean :0.3838 Mean :2.295
## 3rd Qu.: 982 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :1309 Max. :1.0000 Max. :3.000
## NA's :418
## Sex Age SibSp Parch
## female:466 Min. : 0.17 Min. :0.0000 Min. :0.000
## male :843 1st Qu.:22.00 1st Qu.:0.0000 1st Qu.:0.000
## Median :28.86 Median :0.0000 Median :0.000
## Mean :29.70 Mean :0.4989 Mean :0.385
## 3rd Qu.:36.50 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :80.00 Max. :8.0000 Max. :9.000
##
## Ticket Fare Cabin Embarked
## CA. 2343: 11 Min. : 0.000 :1014 : 2
## 1601 : 8 1st Qu.: 7.896 C23 C25 C27 : 6 C:270
## CA 2144 : 8 Median : 14.454 B57 B59 B63 B66: 5 Q:123
## 3101295 : 7 Mean : 33.295 G6 : 5 S:914
## 347077 : 7 3rd Qu.: 31.275 B96 B98 : 4
## 347082 : 7 Max. :512.329 C22 C26 : 4
## (Other) :1261 NA's :1 (Other) : 271
## Title FamilySize Surname FamilyID
## Mr :757 Min. : 1.000 Length:1309 Small :1074
## Miss :260 1st Qu.: 1.000 Class :character 11Sage : 11
## Mrs :197 Median : 1.000 Mode :character 7Andersson: 9
## Master : 61 Mean : 1.884 8Goodwin : 8
## Dr : 8 3rd Qu.: 2.000 7Asplund : 7
## Rev : 8 Max. :11.000 6Fortune : 6
## (Other): 18 (Other) : 194
#combi$Embarked <- as.character(combi$Embarked)
combi$Embarked[which(combi$Embarked == '')] <- 'S'
combi$Embarked <- factor(combi$Embarked)
combi$Fare[which(is.na(combi$Fare))] <- median(combi$Fare, na.rm = TRUE)
I’ve got rid of all NA values from the dataset. There’s another problem here, the factor variables cannot have more than 32 levels. combi$FamilyID has double that many. To cut them down I will include all the 3 member families into small
combi$FamilyID2 <- combi$FamilyID
combi$FamilyID2 <- as.character(combi$FamilyID2)
combi$FamilyID2[combi$FamilySize <= 3] <- 'Small'
combi$FamilyID2 <- factor(combi$FamilyID2)
train <- combi[1:891,]
test <- combi[892:1309,]
Let’s begin by setting a seed to ensure reproducibility
set.seed(42)
fit <- randomForest(data = train,
importance = TRUE,
ntree = 2000,
as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID2)
varImpPlot(fit)
To generate the prediction is just like as it has always been
prediction <- predict(fit, test)
submit <- data.frame(PassengerID = test$PassengerId, Survived = prediction)
write.csv(submit, file = "randomForest.csv", row.names = FALSE)
This was a bummer. Our score didn’t improve at all!
Somewhere in this a statistical test is involved, let’s roll before we dive into the specifics. Using the library(party)
set.seed(42)
fit <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID,
data = train,
controls = cforest_unbiased(ntree=2000, mtry=3))
To make the prediction, same drill.
prediction <- predict(fit, test, OOB = TRUE, type = 'response')
submit <- data.frame(PassengerID = test$PassengerId, Survived = prediction)
write.csv(submit, file = 'conditionalInferenceTrees.csv', row.names = FALSE)
This gets me in top-500 of the leaderboard with a score of 0.80861. Trevor leaves us at this point so I’ll have to tread alone the rest of the way.