This is my short summary of my investigation of the Titanic data set. It consists of three parts:

1 Manipulating Data

First, lets load data

train <- read.csv("files/train.csv", header = TRUE,stringsAsFactors = FALSE)
test <- read.csv("files/test.csv", header = TRUE,stringsAsFactors = FALSE)

We combine the two data sets, train and test, into one data set data_full. dplyr library provides handy command bind_rows:

data_full <- bind_rows(train,test)

Let’s check our data

str(data_full)
'data.frame':   1309 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : chr  "male" "female" "female" "female" ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : chr  "" "C85" "" "C123" ...
 $ Embarked   : chr  "S" "C" "S" "S" ...

survived and pclass features have 2 and 3 distinct values respectively. They better be converted into factors

data_full$Survived <- as.factor(data_full$Survived)
data_full$Pclass <- as.factor(data_full$Pclass)

pclass is an important feature. Rich passengers traveled in higher pclass survived as higher rate as can be seen from the following plot

p<-ggplot(data_full[1:891,], aes(x = Pclass, fill = Survived)) +
  geom_bar(stat = 'count',position='dodge') +
  xlab("Pclass") +
  ylab("Count") +
  labs(fill = "Survived")+theme_light()
ggplotly(p)

2 Feature engineering

One powerful desired feature is title. It can be extracted from existing feature name. Each name in the full data set has a title in it. For example, the first passenger Owen Haris was titled as ‘Mr.’:

data_full$Name[1]
[1] "Braund, Mr. Owen Harris"
Braund, Mr. Owen Harris

So, let’s extract titles from name using regular expressions:

data_full$Title <- gsub('(.*, )|(\\..*)','',data_full$Name)
table(data_full$Sex,data_full$Title)
        
         Capt Col Don Dona  Dr Jonkheer Lady Major Master Miss Mlle Mme
  female    0   0   0    1   1        0    1     0      0  260    2   1
  male      1   4   1    0   7        1    0     2     61    0    0   0
        
          Mr Mrs  Ms Rev Sir the Countess
  female   0 197   2   0   0            1
  male   757   0   0   8   1            0

Too many titles. Let’s group them

data_full[data_full$Title %in% c("Dona", "the Countess"),'Title'] <- "Lady"
data_full[data_full$Title %in% c("Ms", "Mlle"),'Title'] <- "Miss"
data_full[data_full$Title == "Mme",'Title'] <- "Mrs"
data_full[data_full$Title %in% c("Jonkheer", "Don"),'Title'] <- "Sir"
data_full[data_full$Title %in% c("Col", "Capt", "Major"),'Title'] <- "Officer"

data_full[data_full$Title == "Lady",'Title'] <- "Mrs"
data_full[data_full$Title %in% c("Rev", "Sir", "Officer"),'Title'] <- "Mr"

The title Dr is a bit tricky. There is one female with this title. Let’s fix this:

table(data_full$Sex,data_full$Title)
        
          Dr Master Miss  Mr Mrs
  female   1      0  264   0 201
  male     7     61    0 775   0
data_full[data_full$Sex=='female' & data_full$Title=='Dr','title']<-'Mrs'
data_full[data_full$Title=='Dr','Title'] <- 'Mr'
table(data_full$Sex,data_full$Title)
        
         Master Miss  Mr Mrs
  female      0  264   1 201
  male       61    0 782   0
data_full$Title <- as.factor(data_full$Title)
p<-ggplot(data_full[1:891,], aes(x = Title, fill = Survived)) +
  geom_bar(stat='count') + facet_grid(.~Pclass)+
  xlab("Title") +
  ylab("Count") +
  labs(fill = "Survived")+theme_light()
ggplotly(p)
mosaicplot(table(data_full$Title,data_full$Survived),main='title by survival')

mosaicplot(table(data_full$Pclass,data_full$Survived),main='pclass by survival')

Main conclusions so far:

One more feature which appeared to be useful is family size. It can be extracted by adding sibsp + parch + 1. We further bin the family size into 3 categories:

data_full$familysize <- data_full$SibSp + data_full$Parch + 1

data_full$fsize[data_full$familysize==1] <- 'single'
data_full$fsize[data_full$familysize <= 4 & data_full$familysize > 1] <- 'small'
data_full$fsize[data_full$familysize >= 5] <- 'large'
data_full$fsize <- as.factor(data_full$fsize)
mosaicplot(table(data_full$fsize,data_full$Survived),main='famility size by survival')

There is a survival penalty for traveling with a large family.

3 Decision Tree

Random Forests are more powerful than single trees, but single trees have the advantage of being easier to understand.

X <- data_full[1:891,c('Pclass','fsize','Title')]
y <- as.factor(data_full[1:891,'Survived'])

folds <- createMultiFolds(y, k = 3, times = 10)
ctrl  <- trainControl(method = "repeatedcv", number = 3, repeats = 10,
                       index = folds)

rpart.cv <- train(x = X, y = y, method = "rpart", tuneLength = 10, 
                    trControl = ctrl)


prp(rpart.cv$finalModel,type = 0, extra = 1, under = TRUE) # display the results 

LS0tCnRpdGxlOiAiTWFjaGluZSBMZWFybmluZyBmcm9tIERpc2FzdGVyOiBUaXRhbmljIgphdXRob3I6ICJPbGVrc2FuZHIgRmlhbGtvIgpkYXRlOiAiYHIgU3lzLkRhdGUoKWAiCm91dHB1dDogaHRtbF9ub3RlYm9vawphbHdheXNfYWxsb3dfaHRtbDogeWVzCi0tLQoKVGhpcyBpcyBteSBzaG9ydCBzdW1tYXJ5IG9mIG15IGludmVzdGlnYXRpb24gb2YgdGhlIFRpdGFuaWMgZGF0YSBzZXQuCkl0IGNvbnNpc3RzIG9mIHRocmVlIHBhcnRzOgoKLSBEYXRhIGxvYWRpbmcgYW5kIHByZXByb2Nlc3NpbmcgCi0gRmVhdHVyZSBlbmdpbmVlcmluZwotIE1hY2hpbmUgbGVhcm5pbmcgd2l0aCBEZWNpc2lvbiBUcmVlCgojIyMgMSBNYW5pcHVsYXRpbmcgRGF0YQpGaXJzdCwgbGV0cyBsb2FkIGRhdGEKYGBge3IgaW5jbHVkZT1GQUxTRX0KbGlicmFyeShkcGx5cikKYGBgCgoKYGBge3J9CnRyYWluIDwtIHJlYWQuY3N2KCJmaWxlcy90cmFpbi5jc3YiLCBoZWFkZXIgPSBUUlVFLHN0cmluZ3NBc0ZhY3RvcnMgPSBGQUxTRSkKdGVzdCA8LSByZWFkLmNzdigiZmlsZXMvdGVzdC5jc3YiLCBoZWFkZXIgPSBUUlVFLHN0cmluZ3NBc0ZhY3RvcnMgPSBGQUxTRSkKYGBgCgpXZSBjb21iaW5lIHRoZSB0d28gZGF0YSBzZXRzLCBgdHJhaW5gIGFuZCBgdGVzdGAsIGludG8gb25lIGRhdGEgc2V0IGBkYXRhX2Z1bGxgLgpgZHBseXJgIGxpYnJhcnkgcHJvdmlkZXMgaGFuZHkgY29tbWFuZCBgYmluZF9yb3dzYDogIAoKYGBge3J9CmRhdGFfZnVsbCA8LSBiaW5kX3Jvd3ModHJhaW4sdGVzdCkKCmBgYAoKTGV0J3MgY2hlY2sgb3VyIGRhdGEKCmBgYHtyfQpzdHIoZGF0YV9mdWxsKQoKYGBgCgpgc3Vydml2ZWRgIGFuZCBgcGNsYXNzYCBmZWF0dXJlcyBoYXZlIDIgYW5kIDMgZGlzdGluY3QgdmFsdWVzIHJlc3BlY3RpdmVseS4gVGhleSBiZXR0ZXIgYmUgY29udmVydGVkIGludG8gZmFjdG9ycwoKYGBge3J9CmRhdGFfZnVsbCRTdXJ2aXZlZCA8LSBhcy5mYWN0b3IoZGF0YV9mdWxsJFN1cnZpdmVkKQpkYXRhX2Z1bGwkUGNsYXNzIDwtIGFzLmZhY3RvcihkYXRhX2Z1bGwkUGNsYXNzKQpgYGAKCmBwY2xhc3NgIGlzIGFuIGltcG9ydGFudCBmZWF0dXJlLiBSaWNoIHBhc3NlbmdlcnMgdHJhdmVsZWQgaW4gaGlnaGVyIGBwY2xhc3NgIHN1cnZpdmVkIGFzIGhpZ2hlciByYXRlIGFzIGNhbiBiZSBzZWVuIGZyb20gdGhlIGZvbGxvd2luZyBwbG90CmBgYHtyIGluY2x1ZGU9RkFMU0V9CmxpYnJhcnkoZ2dwbG90MikKbGlicmFyeShwbG90bHkpCmBgYAoKYGBge3J9CnA8LWdncGxvdChkYXRhX2Z1bGxbMTo4OTEsXSwgYWVzKHggPSBQY2xhc3MsIGZpbGwgPSBTdXJ2aXZlZCkpICsKICBnZW9tX2JhcihzdGF0ID0gJ2NvdW50Jyxwb3NpdGlvbj0nZG9kZ2UnKSArCiAgeGxhYigiUGNsYXNzIikgKwogIHlsYWIoIkNvdW50IikgKwogIGxhYnMoZmlsbCA9ICJTdXJ2aXZlZCIpK3RoZW1lX2xpZ2h0KCkKZ2dwbG90bHkocCkKYGBgCgojIyMgMiBGZWF0dXJlIGVuZ2luZWVyaW5nCgpPbmUgcG93ZXJmdWwgZGVzaXJlZCBmZWF0dXJlIGlzIGB0aXRsZWAuIEl0IGNhbiBiZSBleHRyYWN0ZWQgZnJvbSBleGlzdGluZyBmZWF0dXJlIGBuYW1lYC4gRWFjaCBuYW1lIGluIHRoZSBmdWxsIGRhdGEgc2V0IGhhcyBhIHRpdGxlIGluIGl0LiBGb3IgZXhhbXBsZSwgdGhlIGZpcnN0IHBhc3NlbmdlciBgT3dlbiBIYXJpc2Agd2FzIHRpdGxlZCBhcyAnTXIuJzoKYGBge3IgaW5jbHVkZT1GQUxTRX0KbGlicmFyeShzdHJpbmdpKQpgYGAKCmBgYHtyfQpkYXRhX2Z1bGwkTmFtZVsxXQpgYGAKClNvLCBsZXQncyBleHRyYWN0IHRpdGxlcyBmcm9tIGBuYW1lYCB1c2luZyByZWd1bGFyIGV4cHJlc3Npb25zOgpgYGB7cn0KZGF0YV9mdWxsJFRpdGxlIDwtIGdzdWIoJyguKiwgKXwoXFwuLiopJywnJyxkYXRhX2Z1bGwkTmFtZSkKdGFibGUoZGF0YV9mdWxsJFNleCxkYXRhX2Z1bGwkVGl0bGUpCmBgYAoKVG9vIG1hbnkgdGl0bGVzLiBMZXQncyBncm91cCB0aGVtCmBgYHtyfQpkYXRhX2Z1bGxbZGF0YV9mdWxsJFRpdGxlICVpbiUgYygiRG9uYSIsICJ0aGUgQ291bnRlc3MiKSwnVGl0bGUnXSA8LSAiTGFkeSIKZGF0YV9mdWxsW2RhdGFfZnVsbCRUaXRsZSAlaW4lIGMoIk1zIiwgIk1sbGUiKSwnVGl0bGUnXSA8LSAiTWlzcyIKZGF0YV9mdWxsW2RhdGFfZnVsbCRUaXRsZSA9PSAiTW1lIiwnVGl0bGUnXSA8LSAiTXJzIgpkYXRhX2Z1bGxbZGF0YV9mdWxsJFRpdGxlICVpbiUgYygiSm9ua2hlZXIiLCAiRG9uIiksJ1RpdGxlJ10gPC0gIlNpciIKZGF0YV9mdWxsW2RhdGFfZnVsbCRUaXRsZSAlaW4lIGMoIkNvbCIsICJDYXB0IiwgIk1ham9yIiksJ1RpdGxlJ10gPC0gIk9mZmljZXIiCgpkYXRhX2Z1bGxbZGF0YV9mdWxsJFRpdGxlID09ICJMYWR5IiwnVGl0bGUnXSA8LSAiTXJzIgpkYXRhX2Z1bGxbZGF0YV9mdWxsJFRpdGxlICVpbiUgYygiUmV2IiwgIlNpciIsICJPZmZpY2VyIiksJ1RpdGxlJ10gPC0gIk1yIgpgYGAKClRoZSB0aXRsZSBgRHJgIGlzIGEgYml0IHRyaWNreS4gVGhlcmUgaXMgb25lIGZlbWFsZSB3aXRoIHRoaXMgdGl0bGUuIExldCdzIGZpeCB0aGlzOgoKYGBge3J9CnRhYmxlKGRhdGFfZnVsbCRTZXgsZGF0YV9mdWxsJFRpdGxlKQpkYXRhX2Z1bGxbZGF0YV9mdWxsJFNleD09J2ZlbWFsZScgJiBkYXRhX2Z1bGwkVGl0bGU9PSdEcicsJ3RpdGxlJ108LSdNcnMnCmRhdGFfZnVsbFtkYXRhX2Z1bGwkVGl0bGU9PSdEcicsJ1RpdGxlJ10gPC0gJ01yJwp0YWJsZShkYXRhX2Z1bGwkU2V4LGRhdGFfZnVsbCRUaXRsZSkKZGF0YV9mdWxsJFRpdGxlIDwtIGFzLmZhY3RvcihkYXRhX2Z1bGwkVGl0bGUpCmBgYAoKYGBge3Igd2FybmluZz1GQUxTRX0KcDwtZ2dwbG90KGRhdGFfZnVsbFsxOjg5MSxdLCBhZXMoeCA9IFRpdGxlLCBmaWxsID0gU3Vydml2ZWQpKSArCiAgZ2VvbV9iYXIoc3RhdD0nY291bnQnKSArIGZhY2V0X2dyaWQoLn5QY2xhc3MpKwogIHhsYWIoIlRpdGxlIikgKwogIHlsYWIoIkNvdW50IikgKwogIGxhYnMoZmlsbCA9ICJTdXJ2aXZlZCIpK3RoZW1lX2xpZ2h0KCkKZ2dwbG90bHkocCkKYGBgCgpgYGB7cn0KbW9zYWljcGxvdCh0YWJsZShkYXRhX2Z1bGwkVGl0bGUsZGF0YV9mdWxsJFN1cnZpdmVkKSxtYWluPSd0aXRsZSBieSBzdXJ2aXZhbCcpCmBgYAoKYGBge3J9Cm1vc2FpY3Bsb3QodGFibGUoZGF0YV9mdWxsJFBjbGFzcyxkYXRhX2Z1bGwkU3Vydml2ZWQpLG1haW49J3BjbGFzcyBieSBzdXJ2aXZhbCcpCmBgYAoKTWFpbiBjb25jbHVzaW9ucyBzbyBmYXI6CgotIHBlb3BsZSB3aXRoIHRoZSB0aXRsZSAnTXInIGhhZCBoaWdoIHByb2JhYmlsaXR5IHRvIHBlcmlzaC4KLSBwZW9wbGUgZnJvbSB0aGUgM2QgY2xhc3MgaGFkIGhpZ2ggcHJvYmFiaWxpdHkgdG8gcGVyaXNoLgoKT25lIG1vcmUgZmVhdHVyZSB3aGljaCBhcHBlYXJlZCB0byBiZSB1c2VmdWwgaXMgYGZhbWlseSBzaXplYC4gSXQgY2FuIGJlIGV4dHJhY3RlZCBieSBhZGRpbmcgYHNpYnNwYCArIGBwYXJjaGAgKyAxLiBXZSBmdXJ0aGVyIGJpbiB0aGUgZmFtaWx5IHNpemUgaW50byAzIGNhdGVnb3JpZXM6CgpgYGB7cn0KZGF0YV9mdWxsJGZhbWlseXNpemUgPC0gZGF0YV9mdWxsJFNpYlNwICsgZGF0YV9mdWxsJFBhcmNoICsgMQoKZGF0YV9mdWxsJGZzaXplW2RhdGFfZnVsbCRmYW1pbHlzaXplPT0xXSA8LSAnc2luZ2xlJwpkYXRhX2Z1bGwkZnNpemVbZGF0YV9mdWxsJGZhbWlseXNpemUgPD0gNCAmIGRhdGFfZnVsbCRmYW1pbHlzaXplID4gMV0gPC0gJ3NtYWxsJwpkYXRhX2Z1bGwkZnNpemVbZGF0YV9mdWxsJGZhbWlseXNpemUgPj0gNV0gPC0gJ2xhcmdlJwpkYXRhX2Z1bGwkZnNpemUgPC0gYXMuZmFjdG9yKGRhdGFfZnVsbCRmc2l6ZSkKbW9zYWljcGxvdCh0YWJsZShkYXRhX2Z1bGwkZnNpemUsZGF0YV9mdWxsJFN1cnZpdmVkKSxtYWluPSdmYW1pbGl0eSBzaXplIGJ5IHN1cnZpdmFsJykKYGBgCgpUaGVyZSBpcyBhIHN1cnZpdmFsIHBlbmFsdHkgZm9yIHRyYXZlbGluZyB3aXRoIGEgbGFyZ2UgZmFtaWx5LgoKIyMjIDMgRGVjaXNpb24gVHJlZQoKUmFuZG9tIEZvcmVzdHMgYXJlIG1vcmUgcG93ZXJmdWwgdGhhbiBzaW5nbGUgdHJlZXMsCmJ1dCBzaW5nbGUgdHJlZXMgaGF2ZSB0aGUgYWR2YW50YWdlIG9mIGJlaW5nIGVhc2llciB0byB1bmRlcnN0YW5kLgoKYGBge3IgaW5jbHVkZT1GfQpsaWJyYXJ5KGNhcmV0KQpsaWJyYXJ5KHJwYXJ0KQpsaWJyYXJ5KHJwYXJ0LnBsb3QpCmBgYApgYGB7cn0KWCA8LSBkYXRhX2Z1bGxbMTo4OTEsYygnUGNsYXNzJywnZnNpemUnLCdUaXRsZScpXQp5IDwtIGFzLmZhY3RvcihkYXRhX2Z1bGxbMTo4OTEsJ1N1cnZpdmVkJ10pCgpmb2xkcyA8LSBjcmVhdGVNdWx0aUZvbGRzKHksIGsgPSAzLCB0aW1lcyA9IDEwKQpjdHJsICA8LSB0cmFpbkNvbnRyb2wobWV0aG9kID0gInJlcGVhdGVkY3YiLCBudW1iZXIgPSAzLCByZXBlYXRzID0gMTAsCiAgICAgICAgICAgICAgICAgICAgICAgaW5kZXggPSBmb2xkcykKCnJwYXJ0LmN2IDwtIHRyYWluKHggPSBYLCB5ID0geSwgbWV0aG9kID0gInJwYXJ0IiwgdHVuZUxlbmd0aCA9IDEwLCAKICAgICAgICAgICAgICAgICAgICB0ckNvbnRyb2wgPSBjdHJsKQoKCnBycChycGFydC5jdiRmaW5hbE1vZGVsLHR5cGUgPSAwLCBleHRyYSA9IDEsIHVuZGVyID0gVFJVRSkgIyBkaXNwbGF5IHRoZSByZXN1bHRzIApgYGA=