This is my short summary of my investigation of the Titanic data set. It consists of three parts:
- Data loading and preprocessing
- Feature engineering
- Machine learning with Decision Tree
1 Manipulating Data
First, lets load data
train <- read.csv("files/train.csv", header = TRUE,stringsAsFactors = FALSE)
test <- read.csv("files/test.csv", header = TRUE,stringsAsFactors = FALSE)
We combine the two data sets, train and test, into one data set data_full. dplyr library provides handy command bind_rows:
data_full <- bind_rows(train,test)
Let’s check our data
str(data_full)
'data.frame': 1309 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : chr "male" "female" "female" "female" ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : chr "" "C85" "" "C123" ...
$ Embarked : chr "S" "C" "S" "S" ...
survived and pclass features have 2 and 3 distinct values respectively. They better be converted into factors
data_full$Survived <- as.factor(data_full$Survived)
data_full$Pclass <- as.factor(data_full$Pclass)
pclass is an important feature. Rich passengers traveled in higher pclass survived as higher rate as can be seen from the following plot
p<-ggplot(data_full[1:891,], aes(x = Pclass, fill = Survived)) +
geom_bar(stat = 'count',position='dodge') +
xlab("Pclass") +
ylab("Count") +
labs(fill = "Survived")+theme_light()
ggplotly(p)
2 Feature engineering
One powerful desired feature is title. It can be extracted from existing feature name. Each name in the full data set has a title in it. For example, the first passenger Owen Haris was titled as ‘Mr.’:
data_full$Name[1]
[1] "Braund, Mr. Owen Harris"
Braund, Mr. Owen Harris
So, let’s extract titles from name using regular expressions:
data_full$Title <- gsub('(.*, )|(\\..*)','',data_full$Name)
table(data_full$Sex,data_full$Title)
Capt Col Don Dona Dr Jonkheer Lady Major Master Miss Mlle Mme
female 0 0 0 1 1 0 1 0 0 260 2 1
male 1 4 1 0 7 1 0 2 61 0 0 0
Mr Mrs Ms Rev Sir the Countess
female 0 197 2 0 0 1
male 757 0 0 8 1 0
Too many titles. Let’s group them
data_full[data_full$Title %in% c("Dona", "the Countess"),'Title'] <- "Lady"
data_full[data_full$Title %in% c("Ms", "Mlle"),'Title'] <- "Miss"
data_full[data_full$Title == "Mme",'Title'] <- "Mrs"
data_full[data_full$Title %in% c("Jonkheer", "Don"),'Title'] <- "Sir"
data_full[data_full$Title %in% c("Col", "Capt", "Major"),'Title'] <- "Officer"
data_full[data_full$Title == "Lady",'Title'] <- "Mrs"
data_full[data_full$Title %in% c("Rev", "Sir", "Officer"),'Title'] <- "Mr"
The title Dr is a bit tricky. There is one female with this title. Let’s fix this:
table(data_full$Sex,data_full$Title)
Dr Master Miss Mr Mrs
female 1 0 264 0 201
male 7 61 0 775 0
data_full[data_full$Sex=='female' & data_full$Title=='Dr','title']<-'Mrs'
data_full[data_full$Title=='Dr','Title'] <- 'Mr'
table(data_full$Sex,data_full$Title)
Master Miss Mr Mrs
female 0 264 1 201
male 61 0 782 0
data_full$Title <- as.factor(data_full$Title)
p<-ggplot(data_full[1:891,], aes(x = Title, fill = Survived)) +
geom_bar(stat='count') + facet_grid(.~Pclass)+
xlab("Title") +
ylab("Count") +
labs(fill = "Survived")+theme_light()
ggplotly(p)
mosaicplot(table(data_full$Title,data_full$Survived),main='title by survival')

mosaicplot(table(data_full$Pclass,data_full$Survived),main='pclass by survival')

Main conclusions so far:
- people with the title ‘Mr’ had high probability to perish.
- people from the 3d class had high probability to perish.
One more feature which appeared to be useful is family size. It can be extracted by adding sibsp + parch + 1. We further bin the family size into 3 categories:
data_full$familysize <- data_full$SibSp + data_full$Parch + 1
data_full$fsize[data_full$familysize==1] <- 'single'
data_full$fsize[data_full$familysize <= 4 & data_full$familysize > 1] <- 'small'
data_full$fsize[data_full$familysize >= 5] <- 'large'
data_full$fsize <- as.factor(data_full$fsize)
mosaicplot(table(data_full$fsize,data_full$Survived),main='famility size by survival')

There is a survival penalty for traveling with a large family.
3 Decision Tree
Random Forests are more powerful than single trees, but single trees have the advantage of being easier to understand.
X <- data_full[1:891,c('Pclass','fsize','Title')]
y <- as.factor(data_full[1:891,'Survived'])
folds <- createMultiFolds(y, k = 3, times = 10)
ctrl <- trainControl(method = "repeatedcv", number = 3, repeats = 10,
index = folds)
rpart.cv <- train(x = X, y = y, method = "rpart", tuneLength = 10,
trControl = ctrl)
prp(rpart.cv$finalModel,type = 0, extra = 1, under = TRUE) # display the results

LS0tCnRpdGxlOiAiTWFjaGluZSBMZWFybmluZyBmcm9tIERpc2FzdGVyOiBUaXRhbmljIgphdXRob3I6ICJPbGVrc2FuZHIgRmlhbGtvIgpkYXRlOiAiYHIgU3lzLkRhdGUoKWAiCm91dHB1dDogaHRtbF9ub3RlYm9vawphbHdheXNfYWxsb3dfaHRtbDogeWVzCi0tLQoKVGhpcyBpcyBteSBzaG9ydCBzdW1tYXJ5IG9mIG15IGludmVzdGlnYXRpb24gb2YgdGhlIFRpdGFuaWMgZGF0YSBzZXQuCkl0IGNvbnNpc3RzIG9mIHRocmVlIHBhcnRzOgoKLSBEYXRhIGxvYWRpbmcgYW5kIHByZXByb2Nlc3NpbmcgCi0gRmVhdHVyZSBlbmdpbmVlcmluZwotIE1hY2hpbmUgbGVhcm5pbmcgd2l0aCBEZWNpc2lvbiBUcmVlCgojIyMgMSBNYW5pcHVsYXRpbmcgRGF0YQpGaXJzdCwgbGV0cyBsb2FkIGRhdGEKYGBge3IgaW5jbHVkZT1GQUxTRX0KbGlicmFyeShkcGx5cikKYGBgCgoKYGBge3J9CnRyYWluIDwtIHJlYWQuY3N2KCJmaWxlcy90cmFpbi5jc3YiLCBoZWFkZXIgPSBUUlVFLHN0cmluZ3NBc0ZhY3RvcnMgPSBGQUxTRSkKdGVzdCA8LSByZWFkLmNzdigiZmlsZXMvdGVzdC5jc3YiLCBoZWFkZXIgPSBUUlVFLHN0cmluZ3NBc0ZhY3RvcnMgPSBGQUxTRSkKYGBgCgpXZSBjb21iaW5lIHRoZSB0d28gZGF0YSBzZXRzLCBgdHJhaW5gIGFuZCBgdGVzdGAsIGludG8gb25lIGRhdGEgc2V0IGBkYXRhX2Z1bGxgLgpgZHBseXJgIGxpYnJhcnkgcHJvdmlkZXMgaGFuZHkgY29tbWFuZCBgYmluZF9yb3dzYDogIAoKYGBge3J9CmRhdGFfZnVsbCA8LSBiaW5kX3Jvd3ModHJhaW4sdGVzdCkKCmBgYAoKTGV0J3MgY2hlY2sgb3VyIGRhdGEKCmBgYHtyfQpzdHIoZGF0YV9mdWxsKQoKYGBgCgpgc3Vydml2ZWRgIGFuZCBgcGNsYXNzYCBmZWF0dXJlcyBoYXZlIDIgYW5kIDMgZGlzdGluY3QgdmFsdWVzIHJlc3BlY3RpdmVseS4gVGhleSBiZXR0ZXIgYmUgY29udmVydGVkIGludG8gZmFjdG9ycwoKYGBge3J9CmRhdGFfZnVsbCRTdXJ2aXZlZCA8LSBhcy5mYWN0b3IoZGF0YV9mdWxsJFN1cnZpdmVkKQpkYXRhX2Z1bGwkUGNsYXNzIDwtIGFzLmZhY3RvcihkYXRhX2Z1bGwkUGNsYXNzKQpgYGAKCmBwY2xhc3NgIGlzIGFuIGltcG9ydGFudCBmZWF0dXJlLiBSaWNoIHBhc3NlbmdlcnMgdHJhdmVsZWQgaW4gaGlnaGVyIGBwY2xhc3NgIHN1cnZpdmVkIGFzIGhpZ2hlciByYXRlIGFzIGNhbiBiZSBzZWVuIGZyb20gdGhlIGZvbGxvd2luZyBwbG90CmBgYHtyIGluY2x1ZGU9RkFMU0V9CmxpYnJhcnkoZ2dwbG90MikKbGlicmFyeShwbG90bHkpCmBgYAoKYGBge3J9CnA8LWdncGxvdChkYXRhX2Z1bGxbMTo4OTEsXSwgYWVzKHggPSBQY2xhc3MsIGZpbGwgPSBTdXJ2aXZlZCkpICsKICBnZW9tX2JhcihzdGF0ID0gJ2NvdW50Jyxwb3NpdGlvbj0nZG9kZ2UnKSArCiAgeGxhYigiUGNsYXNzIikgKwogIHlsYWIoIkNvdW50IikgKwogIGxhYnMoZmlsbCA9ICJTdXJ2aXZlZCIpK3RoZW1lX2xpZ2h0KCkKZ2dwbG90bHkocCkKYGBgCgojIyMgMiBGZWF0dXJlIGVuZ2luZWVyaW5nCgpPbmUgcG93ZXJmdWwgZGVzaXJlZCBmZWF0dXJlIGlzIGB0aXRsZWAuIEl0IGNhbiBiZSBleHRyYWN0ZWQgZnJvbSBleGlzdGluZyBmZWF0dXJlIGBuYW1lYC4gRWFjaCBuYW1lIGluIHRoZSBmdWxsIGRhdGEgc2V0IGhhcyBhIHRpdGxlIGluIGl0LiBGb3IgZXhhbXBsZSwgdGhlIGZpcnN0IHBhc3NlbmdlciBgT3dlbiBIYXJpc2Agd2FzIHRpdGxlZCBhcyAnTXIuJzoKYGBge3IgaW5jbHVkZT1GQUxTRX0KbGlicmFyeShzdHJpbmdpKQpgYGAKCmBgYHtyfQpkYXRhX2Z1bGwkTmFtZVsxXQpgYGAKClNvLCBsZXQncyBleHRyYWN0IHRpdGxlcyBmcm9tIGBuYW1lYCB1c2luZyByZWd1bGFyIGV4cHJlc3Npb25zOgpgYGB7cn0KZGF0YV9mdWxsJFRpdGxlIDwtIGdzdWIoJyguKiwgKXwoXFwuLiopJywnJyxkYXRhX2Z1bGwkTmFtZSkKdGFibGUoZGF0YV9mdWxsJFNleCxkYXRhX2Z1bGwkVGl0bGUpCmBgYAoKVG9vIG1hbnkgdGl0bGVzLiBMZXQncyBncm91cCB0aGVtCmBgYHtyfQpkYXRhX2Z1bGxbZGF0YV9mdWxsJFRpdGxlICVpbiUgYygiRG9uYSIsICJ0aGUgQ291bnRlc3MiKSwnVGl0bGUnXSA8LSAiTGFkeSIKZGF0YV9mdWxsW2RhdGFfZnVsbCRUaXRsZSAlaW4lIGMoIk1zIiwgIk1sbGUiKSwnVGl0bGUnXSA8LSAiTWlzcyIKZGF0YV9mdWxsW2RhdGFfZnVsbCRUaXRsZSA9PSAiTW1lIiwnVGl0bGUnXSA8LSAiTXJzIgpkYXRhX2Z1bGxbZGF0YV9mdWxsJFRpdGxlICVpbiUgYygiSm9ua2hlZXIiLCAiRG9uIiksJ1RpdGxlJ10gPC0gIlNpciIKZGF0YV9mdWxsW2RhdGFfZnVsbCRUaXRsZSAlaW4lIGMoIkNvbCIsICJDYXB0IiwgIk1ham9yIiksJ1RpdGxlJ10gPC0gIk9mZmljZXIiCgpkYXRhX2Z1bGxbZGF0YV9mdWxsJFRpdGxlID09ICJMYWR5IiwnVGl0bGUnXSA8LSAiTXJzIgpkYXRhX2Z1bGxbZGF0YV9mdWxsJFRpdGxlICVpbiUgYygiUmV2IiwgIlNpciIsICJPZmZpY2VyIiksJ1RpdGxlJ10gPC0gIk1yIgpgYGAKClRoZSB0aXRsZSBgRHJgIGlzIGEgYml0IHRyaWNreS4gVGhlcmUgaXMgb25lIGZlbWFsZSB3aXRoIHRoaXMgdGl0bGUuIExldCdzIGZpeCB0aGlzOgoKYGBge3J9CnRhYmxlKGRhdGFfZnVsbCRTZXgsZGF0YV9mdWxsJFRpdGxlKQpkYXRhX2Z1bGxbZGF0YV9mdWxsJFNleD09J2ZlbWFsZScgJiBkYXRhX2Z1bGwkVGl0bGU9PSdEcicsJ3RpdGxlJ108LSdNcnMnCmRhdGFfZnVsbFtkYXRhX2Z1bGwkVGl0bGU9PSdEcicsJ1RpdGxlJ10gPC0gJ01yJwp0YWJsZShkYXRhX2Z1bGwkU2V4LGRhdGFfZnVsbCRUaXRsZSkKZGF0YV9mdWxsJFRpdGxlIDwtIGFzLmZhY3RvcihkYXRhX2Z1bGwkVGl0bGUpCmBgYAoKYGBge3Igd2FybmluZz1GQUxTRX0KcDwtZ2dwbG90KGRhdGFfZnVsbFsxOjg5MSxdLCBhZXMoeCA9IFRpdGxlLCBmaWxsID0gU3Vydml2ZWQpKSArCiAgZ2VvbV9iYXIoc3RhdD0nY291bnQnKSArIGZhY2V0X2dyaWQoLn5QY2xhc3MpKwogIHhsYWIoIlRpdGxlIikgKwogIHlsYWIoIkNvdW50IikgKwogIGxhYnMoZmlsbCA9ICJTdXJ2aXZlZCIpK3RoZW1lX2xpZ2h0KCkKZ2dwbG90bHkocCkKYGBgCgpgYGB7cn0KbW9zYWljcGxvdCh0YWJsZShkYXRhX2Z1bGwkVGl0bGUsZGF0YV9mdWxsJFN1cnZpdmVkKSxtYWluPSd0aXRsZSBieSBzdXJ2aXZhbCcpCmBgYAoKYGBge3J9Cm1vc2FpY3Bsb3QodGFibGUoZGF0YV9mdWxsJFBjbGFzcyxkYXRhX2Z1bGwkU3Vydml2ZWQpLG1haW49J3BjbGFzcyBieSBzdXJ2aXZhbCcpCmBgYAoKTWFpbiBjb25jbHVzaW9ucyBzbyBmYXI6CgotIHBlb3BsZSB3aXRoIHRoZSB0aXRsZSAnTXInIGhhZCBoaWdoIHByb2JhYmlsaXR5IHRvIHBlcmlzaC4KLSBwZW9wbGUgZnJvbSB0aGUgM2QgY2xhc3MgaGFkIGhpZ2ggcHJvYmFiaWxpdHkgdG8gcGVyaXNoLgoKT25lIG1vcmUgZmVhdHVyZSB3aGljaCBhcHBlYXJlZCB0byBiZSB1c2VmdWwgaXMgYGZhbWlseSBzaXplYC4gSXQgY2FuIGJlIGV4dHJhY3RlZCBieSBhZGRpbmcgYHNpYnNwYCArIGBwYXJjaGAgKyAxLiBXZSBmdXJ0aGVyIGJpbiB0aGUgZmFtaWx5IHNpemUgaW50byAzIGNhdGVnb3JpZXM6CgpgYGB7cn0KZGF0YV9mdWxsJGZhbWlseXNpemUgPC0gZGF0YV9mdWxsJFNpYlNwICsgZGF0YV9mdWxsJFBhcmNoICsgMQoKZGF0YV9mdWxsJGZzaXplW2RhdGFfZnVsbCRmYW1pbHlzaXplPT0xXSA8LSAnc2luZ2xlJwpkYXRhX2Z1bGwkZnNpemVbZGF0YV9mdWxsJGZhbWlseXNpemUgPD0gNCAmIGRhdGFfZnVsbCRmYW1pbHlzaXplID4gMV0gPC0gJ3NtYWxsJwpkYXRhX2Z1bGwkZnNpemVbZGF0YV9mdWxsJGZhbWlseXNpemUgPj0gNV0gPC0gJ2xhcmdlJwpkYXRhX2Z1bGwkZnNpemUgPC0gYXMuZmFjdG9yKGRhdGFfZnVsbCRmc2l6ZSkKbW9zYWljcGxvdCh0YWJsZShkYXRhX2Z1bGwkZnNpemUsZGF0YV9mdWxsJFN1cnZpdmVkKSxtYWluPSdmYW1pbGl0eSBzaXplIGJ5IHN1cnZpdmFsJykKYGBgCgpUaGVyZSBpcyBhIHN1cnZpdmFsIHBlbmFsdHkgZm9yIHRyYXZlbGluZyB3aXRoIGEgbGFyZ2UgZmFtaWx5LgoKIyMjIDMgRGVjaXNpb24gVHJlZQoKUmFuZG9tIEZvcmVzdHMgYXJlIG1vcmUgcG93ZXJmdWwgdGhhbiBzaW5nbGUgdHJlZXMsCmJ1dCBzaW5nbGUgdHJlZXMgaGF2ZSB0aGUgYWR2YW50YWdlIG9mIGJlaW5nIGVhc2llciB0byB1bmRlcnN0YW5kLgoKYGBge3IgaW5jbHVkZT1GfQpsaWJyYXJ5KGNhcmV0KQpsaWJyYXJ5KHJwYXJ0KQpsaWJyYXJ5KHJwYXJ0LnBsb3QpCmBgYApgYGB7cn0KWCA8LSBkYXRhX2Z1bGxbMTo4OTEsYygnUGNsYXNzJywnZnNpemUnLCdUaXRsZScpXQp5IDwtIGFzLmZhY3RvcihkYXRhX2Z1bGxbMTo4OTEsJ1N1cnZpdmVkJ10pCgpmb2xkcyA8LSBjcmVhdGVNdWx0aUZvbGRzKHksIGsgPSAzLCB0aW1lcyA9IDEwKQpjdHJsICA8LSB0cmFpbkNvbnRyb2wobWV0aG9kID0gInJlcGVhdGVkY3YiLCBudW1iZXIgPSAzLCByZXBlYXRzID0gMTAsCiAgICAgICAgICAgICAgICAgICAgICAgaW5kZXggPSBmb2xkcykKCnJwYXJ0LmN2IDwtIHRyYWluKHggPSBYLCB5ID0geSwgbWV0aG9kID0gInJwYXJ0IiwgdHVuZUxlbmd0aCA9IDEwLCAKICAgICAgICAgICAgICAgICAgICB0ckNvbnRyb2wgPSBjdHJsKQoKCnBycChycGFydC5jdiRmaW5hbE1vZGVsLHR5cGUgPSAwLCBleHRyYSA9IDEsIHVuZGVyID0gVFJVRSkgIyBkaXNwbGF5IHRoZSByZXN1bHRzIApgYGA=