1 Introduction
- 1.1 Loading datasets
2 Feature Engineering
3 Preprocessing of Features
- 3.1 Treating Missing values
- 3.2 Fare Improvement
4 Building the Model
5 Prediction

I made this RMarkdown as a part of my work solving the Titanic beginner challenge on Kaggle. I presented this in the form of a kernel that contains this markdown along with all the datasets, including the output dataset and the script of this RMarkdown. Kernels increase the avenues for collaboration with other Kagglers and let others fork our work that they find useful.

This will be my first document to be published on RPubs and I will be submitting more as I learn and explore more in this vast domain of data science.

The basis of this kernel is a tutorial that I would highly recommend to those who are beginning on Kaggle with R. Trevor Stephens has created this excellent Titanic problem walkthrough that covers starting from setting up your R environment to reaching at the upper end of the leaderboard. While I have made a few changes wherever I felt the need to experiment, you should too because you learn when you try something new! I am myself a beginner and with this tutorial, I think I have found a direction and a decent rank in the top 3%. For this kernel, I will try to keep things as concise as possible and easy to follow.

1 Introduction

The Titanic problem is a binary classification problem where every passenger either survives or perishes. There are a lot of techniques and models that can be used in this scenario. However, to achieve a higher accuracy, an ensemble model may be needed. I am going to use cforest or Conditional Random Forests to build my model. For now, it is all fine to know just that cforest is similar to randomForest but uses conditional inference trees as base learners instead of decision trees.

1.1 Loading datasets

We read our datasets as follows

train <- read.csv('C:/Users/admin/Desktop/R/Kaggle/Titanic/CSVs/train.csv')
test <- read.csv('C:/Users/admin/Desktop/R/Kaggle/Titanic/CSVs/test.csv')

The read.csv() function reads the datasets into the respective data frames. You will notice here that the test data frame has one less column than train. As obvious, it is missing the Survived column that our job is to predict.

After this, we load some useful packages.

#for decision trees
library(rpart) 

#for cforest
library(party)

2 Feature Engineering

We have been provided with a bunch of variables for prediction. But not all of them are going to be valuable for us. Some of them that don’t look useful may still carry some importance which is not immediately obvious. Further, the model that we will build will perform better if it has more useful features. This call is for what is known as Feature Engineering.

In Trevor’s own words, “Feature engineering really boils down to the human element in machine learning.” Because anyone can build models from the available features using readymade packages and make predictions which make it quite an objective task. Feature Engineering brings the subjective part to the table. In this stage, your creativity is your most important companion.

So, going ahead, we will first row bind the two datasets to form a single, consolidated dataset. To do this, we will first need to make the number of columns same in both.

#Adding a column of NA (missing) values in test
test$Survived <- NA     

# row binding train and test
combi  <- rbind(train, test) 

# taking a look at the structure of combi
str(combi) #or View(combi)

## 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 187 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

2.1 Family Size

First, we will put our focus on the two variables SibSp and Parch. Doesn’t their sum make the size of the family of a passenger? We can quickly make this a new feature!

# 1 is for the passenger himself/herself
combi$FamilySize <- combi$SibSp + combi$Parch + 1

Feature Engineering sounds quite easy, right? Well, this was just the beginning. Let’s move forward and derive more features.

2.2 Titles

If you take a look at the Name column, you may notice that the names also carry their titles along with them. See, for example,

#to display the first 6 records in the Name column
head(combi$Name)

## [1] Braund, Mr. Owen Harris                            
## [2] Cumings, Mrs. John Bradley (Florence Briggs Thayer)
## [3] Heikkinen, Miss. Laina                             
## [4] Futrelle, Mrs. Jacques Heath (Lily May Peel)       
## [5] Allen, Mr. William Henry                           
## [6] Moran, Mr. James                                   
## 1307 Levels: Abbing, Mr. Anthony ... Zakarian, Mr. Ortin

These titles can be in our second feature. But what stands in the way is their encapsulated presence within the Name feature. We need to extract the titles from all the names of the passengers. If we look closely, we see that there is a certain pattern in which the titles occur in the name. They always occur between a comma ‘,’ and a period ‘.’ We have a nice little function at our disposal that can make our work of segregating the titles from the names quite easy. Meet our friend, strsplit() —

# first converting to character type
combi$Name <- as.character(combi$Name)

strsplit(combi$Name[1], split = '[,.]')

## [[1]]
## [1] "Braund"       " Mr"          " Owen Harris"

The symbols inside the square brackets are known as regular expressions. We can pick the title component out of these like this:

strsplit(combi$Name[1], split = '[,.]')[[1]][2]

## [1] " Mr"

Now to apply this function to each row of the Name feature, we will have to use another function known as sapply() —

combi$Title <- sapply(combi$Name, FUN = function(x){ strsplit(x, split = '[,.]')[[1]][2]})

Each row element goes into our function strsplit as variable x and gets operated upon. Our new feature is now almost ready. Only one more operation is required to get rid of the white space that exists before every title. We use the sub() function for that (sub = substitute),

#substitute the first occurrence of a white space with nothing
combi$Title <- sub(' ', '', combi$Title)

We now have our second feature ready! You can have a statistical look at it.

table(combi$Title)

## 
##         Capt          Col          Don         Dona           Dr 
##            1            4            1            1            8 
##     Jonkheer         Lady        Major       Master         Miss 
##            1            1            2           61          260 
##         Mlle          Mme           Mr          Mrs           Ms 
##            2            1          757          197            2 
##          Rev          Sir the Countess 
##            8            1            1

We see there are so much more titles than we had expected. Moreover, there are some titles that only a single passenger has. To make the data more conformable, we will try to merge some titles into a single, broader category.

# The titles 'Mme' and 'Mlle' are merged into one 'Mlle' and similarly, for others. 
# %in% is used to satisfy the logical OR condition, means if the title is one among the titles
# in the given vector, condition is satisfied. Below is the way:
    
combi$Title[combi$Title %in% c('Mme', 'Mlle')] <- 'Mlle'
combi$Title[combi$Title %in% c('Capt', 'Don', 'Jonkheer', 'Major', 'Sir')] <- 'Sir'
combi$Title[combi$Title %in% c('Dona', 'Lady', 'Ms', 'the Countess', 'Mlle')] <- 'Lady'

#To change into factor (datatype that contains categories)
combi$Title <- factor(combi$Title) 
table(combi$Title)

## 
##    Col     Dr   Lady Master   Miss     Mr    Mrs    Rev    Sir 
##      4      8      8     61    260    757    197      8      6

At last, we have our newly engineered Title feature. You can go ahead and experiment with the number of categories of titles according to your own liking by submitting different outputs at the end of this walkthrough.

2.3 FamilyId

One may be tempted to notice that the Name feature has more to offer than just the titles. It also carries with it the surnames of the passengers that may be categorical in nature. We may want to extract that too, for it may add valuable contribution to our model.

combi$Surname <- sapply(combi$Name, FUN = function(x){ strsplit(x, split = '[,.]')[[1]][1]})

We can also go one level higher by creating an artificial feature. We can combine FamilySize and Surname together to create FamilyId that would be unique for each family. Maybe, some type of families were more likely to survive than the others because of their FamilySize. That’s why we can paste FamilySize with the Surname and expect accurate results. We can mark those unique families with less than 3 members as ‘Small’. That would save us from having too many FamilyId categories.

# pasting the two columns together
combi$FamilyId <- paste(as.character(combi$FamilySize), combi$Surname, sep = '') 

# as obvious below, those with family size less than or equal to 2 will be designated as Small
combi$FamilyId[combi$FamilySize <= 2] <- 'Small'

But wait, we can see when we run the command table(combi$FamilyId) some FamilyId values with different values of pasted FamilySize and their frequencies, which should have been the same (didn’t print it here because the output takes way too space). For instance, ‘3Strom’ should have had at least 3 passengers(frequency) with the same FamilyId. But it has only 1, which could mean that only 1 passenger was present in the ship with that FamilyId. This could mean that still, we would need to assign such FamilyId as ‘Small’.

# famId, a temporary variable, will store the FamilyId with their frequencies
famId <- data.frame(table(combi$FamilyId))

# Now, famId will only contain records with frequency or number of passengers in that family 
# less than or equal to 2
famId <- famId[famId$Freq <= 2, ]

# Any FamilyId existing in famId would then be declared 'Small'
combi$FamilyId[combi$FamilyId %in% famId$Var1] <- 'Small'

# finally converting to factor
combi$FamilyId <- factor(combi$FamilyId)

table(combi$FamilyId)

## 
##        11Sage       3Abbott       3Boulos       3Bourke        3Brown 
##            11             3             3             3             4 
##     3Caldwell      3Collyer      3Compton       3Coutts       3Crosby 
##             3             3             3             3             3 
##       3Danbom       3Davies        3Dodge         3Drew        3Elias 
##             3             5             3             3             3 
##    3Goldsmith         3Hart      3Hickman      3Johnson       3Klasen 
##             3             3             3             3             3 
##       3Mallet        3McCoy     3Moubarek        3Nakid     3Navratil 
##             3             3             3             3             3 
##      3Peacock        3Peter        3Quick      3Rosblom       3Samaan 
##             3             3             3             3             3 
##    3Sandstrom      3Spedden      3Taussig       3Thayer        3Touma 
##             3             3             3             3             3 
## 3van Billiard     3Van Impe        3Wells         3Wick      3Widener 
##             3             3             3             3             3 
##      4Allison      4Baclini       4Becker       4Carter         4Dean 
##             4             4             4             4             4 
##       4Herman     4Johnston      4Laroche         4West         5Ford 
##             4             4             4             4             5 
##      5Lefebre      5Palsson      5Ryerson      6Fortune       6Panula 
##             5             5             5             6             6 
##         6Rice        6Skoog    7Andersson      7Asplund      8Goodwin 
##             6             6             9             7             8 
##         Small 
##          1074

Let’s look at other potentially useful columns.

2.4 Deck

We can see that we have a column called Cabin that carries values like ‘C85’, ‘G6’, etc. You can use head(combi$Cabin, n) with n being any big enough value to see that. Moreover, you will also notice a lot of NA values. We will deal with them later, but, what is important here is a pattern in the Cabin values. The first character is repetitive and maybe, one may guess that it represents the deck of the ship. We are going to move ahead with this notion and create further another feature called Deck.

#creating a column of NA values first
combi$Deck <- NA  

#NULL splits at each position
combi$Deck <- sapply(as.character(combi$Cabin), function(x){ strsplit(x, NULL)[[1]][1]}) 
combi$Deck <- factor(combi$Deck)
summary(combi$Deck)

##    A    B    C    D    E    F    G    T NA's 
##   22   65   94   46   41   21    5    1 1014

This feature too has been successfully created. We will definitely address the issue of NA values later.

So, now we have a total of 4 features that we have engineered ourselves from the already existing features. However, we are yet not in a position to employ them to construct our predictor model. Some of the features need a little bit of retouching before that.

3 Preprocessing of Features

There are some features that have NA values in them. It is beneficial for us to get rid of those missing values. Moreover, some features may still not contain data in the form suitable for our model. This part of the walkthrough will deal with such issues.

3.1 Treating Missing values

First, we look at the features that we would actually want to use in our model and identify those that contain NA values.

summary(combi)

##   PassengerId      Survived          Pclass          Name          
##  Min.   :   1   Min.   :0.0000   Min.   :1.000   Length:1309       
##  1st Qu.: 328   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median : 655   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   : 655   Mean   :0.3838   Mean   :2.295                     
##  3rd Qu.: 982   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :1309   Max.   :1.0000   Max.   :3.000                     
##                 NA's   :418                                        
##      Sex           Age            SibSp            Parch      
##  female:466   Min.   : 0.17   Min.   :0.0000   Min.   :0.000  
##  male  :843   1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.000  
##               Median :28.00   Median :0.0000   Median :0.000  
##               Mean   :29.88   Mean   :0.4989   Mean   :0.385  
##               3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.000  
##               Max.   :80.00   Max.   :8.0000   Max.   :9.000  
##               NA's   :263                                     
##       Ticket          Fare                     Cabin      Embarked
##  CA. 2343:  11   Min.   :  0.000                  :1014    :  2   
##  1601    :   8   1st Qu.:  7.896   C23 C25 C27    :   6   C:270   
##  CA 2144 :   8   Median : 14.454   B57 B59 B63 B66:   5   Q:123   
##  3101295 :   7   Mean   : 33.295   G6             :   5   S:914   
##  347077  :   7   3rd Qu.: 31.275   B96 B98        :   4           
##  347082  :   7   Max.   :512.329   C22 C26        :   4           
##  (Other) :1261   NA's   :1         (Other)        : 271           
##    FamilySize         Title       Surname                FamilyId   
##  Min.   : 1.000   Mr     :757   Length:1309        Small     :1074  
##  1st Qu.: 1.000   Miss   :260   Class :character   11Sage    :  11  
##  Median : 1.000   Mrs    :197   Mode  :character   7Andersson:   9  
##  Mean   : 1.884   Master : 61                      8Goodwin  :   8  
##  3rd Qu.: 2.000   Dr     :  8                      7Asplund  :   7  
##  Max.   :11.000   Lady   :  8                      6Fortune  :   6  
##                   (Other): 18                      (Other)   : 194  
##       Deck     
##  C      :  94  
##  B      :  65  
##  D      :  46  
##  E      :  41  
##  A      :  22  
##  (Other):  27  
##  NA's   :1014

So, Age, Fare, Embarked and Deck are those features (Embarked has 2 blanks which are tantamount to NA here). We will address them one at a time. First, for Age, we will use decision tree function rpart(). We have already imported its library. It is a good model that can be used for both classification and regression problems (like in the case of Age).

# first creating the decision tree on the training data, or only the rows that contain a value for Age column
fitAge <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + FamilySize, data = combi[!(is.na(combi$Age)), ], method = "anova")

# now putting the predicted Age values in the rows that have NA in their Age column
combi$Age[is.na(combi$Age)] <- predict(fitAge, combi[is.na(combi$Age), ])

For Fare, we find that only a single NA is present. We can assign the median value of the Fare values to the NA.

# which row(s) has NA in the Fare column
which(is.na(combi$Fare))

## [1] 1044

# median value assigned to the NA value.
combi$Fare[1044] <- median(combi$Fare, na.rm = TRUE)

In case of Embarked, we observe there are just two missing values. Although it is not NA but blank spaces, it is equivalent to NA and we should clean them up too. To do that, we will replace these low number of Embarked blank spaces with the value that appears most frequently, i.e., “Southampton”. So, we replace with “S”.

# which row has blank as Embarked
which(combi$Embarked == '')

## [1]  62 830

combi$Embarked[c(62,830)] = 'S'

Now, in case of column Deck, we have a lot of NA values. We will again use rpart() but this time we will use method = "class" instead of “anova” as we did in the case of regression in the Age column’s treatment.

fitDeck <- rpart(Deck ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + FamilySize + Age, data = combi[!(is.na(combi$Deck)), ], method = "class")

combi$Deck[is.na(combi$Deck)] <- predict(fitDeck, combi[is.na(combi$Deck), ], type = "class")

Go ahead and experiment with the rpart() function by adding or removing the predictor variables.

3.2 Fare Improvement

I got this idea from one of the commenters on Trevor’s blog. When I implemented it, my accuracy improved marginally, which is enough to jump ahead of a few hundred people in the leaderboard. I want to thank you, vruizext(Disqus name), if you ever see this. So, how it works is like this: You can find multiple passengers sharing the same ticket number and hence the same fare on the ticket. Those passengers may not be listed together in the dataset and hence may not be obvious immediately. What we would like to do here is to determine the fare per passenger. To find this, we will divide the Fare amounts by the total number of passengers sharing the corresponding Ticket numbers.

For example, if there are three passengers having Ticket value equal to 12345 and, therefore, having the same Fare amount of, let’s say, 66, then the fare per passenger will be equal to 66 divided by 3, which is 22. We will implement this to all the rows.

# create a temporary data frame storing Ticket values
tempo <- data.frame(table(combi$Ticket))

# store only those Ticket values that belong to more than one passenger
tempo <- tempo[tempo$Freq > 1, ]

tempo$Fare <- 0

# run a nested loop to extract the Fare values for the Ticket values in tempo
for(i in 1:nrow(combi)) {
  for(j in 1:nrow(tempo)) {
    if(combi$Ticket[i] == tempo$Var1[j]) {
      tempo$Fare[j] = combi$Fare[i] } } } 

# calculate the Fare per passenger
tempo$Fare <- tempo$Fare / tempo$Freq

# put back the Fare per passenger values to the combi data frame
for(i in 1:nrow(combi)) {
  for(j in 1:nrow(tempo)) {
    if(combi$Ticket[i] == tempo$Var1[j]) {
      combi$Fare[i] = tempo$Fare[j] } } }

And we are done with our Fare improvement!

4 Building the Model

As said before, we are going to build an ensemble model known as Conditional Random Forest. We have already loaded the required package for cforest() known as party. But before building our model, we will split our train and test datasets from combi.

# 1:891 means 1 to 891(included)
train <- combi[1:891, ]
test <- combi[892:1309, ]

Now we are in the position to build our model, the last step before making the predictions!

# setting the seed so that everytime the model is created, random numbers produced would be the same
set.seed(400)

# the way of building a model is quite similar to how we made a couple of decision trees
# ntree specifies number of trees as base learners and mtry is for the number of variables to be used in each tree 
fit <- cforest(factor(Survived) ~ Pclass + Age + Sex + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyId + Deck, data = train, controls = cforest_unbiased(ntree = 2000, mtry = 3))

# making a bar plot of the importance of predictor variables for the model
par(las = 2)
barplot(varimp(fit))

The bar plot is a testimony to the success of our engineered features, one of which (Title) is the most important of all, i.e., removing that particular variable may cause a high decrease in the prediction accuracy.

We will now move forward to making the predictions on the test set.

5 Prediction

Since our model is ready, we are all set to predict the survival chances of the rest of the passengers in the test set. Along with the prediction, we will also prepare our submission CSV file holding 0s and 1s as Survived column values for the passengers.

# our model operating upon the test set
Prediction <- predict(fit, test, OOB = TRUE, type = "response")

# create the submit data frame
submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)

# write the data frame as a csv file without the row numbers as a separate column
write.csv(submit, file = "RFcondInfTrees.csv", row.names = FALSE)

So, this is it! Go ahead and make your submission. If you have already submitted before and didn’t get a good score, you may get it this time.

This was my first kernel on RMarkdown and I hope you liked it. Please give me feedbacks if you have any suggestion or remark on how I can improve this. In case you have any doubt, please feel free to post it. And even if I don’t know the answer myself, we’ll hunt for it together.

Thanks for visiting!

Titanic is back in R!

Vikas Dhyani

29 August 2017