I made this RMarkdown as a part of my work solving the Titanic beginner challenge on Kaggle. I presented this in the form of a kernel that contains this markdown along with all the datasets, including the output dataset and the script of this RMarkdown. Kernels increase the avenues for collaboration with other Kagglers and let others fork our work that they find useful.
This will be my first document to be published on RPubs and I will be submitting more as I learn and explore more in this vast domain of data science.
The basis of this kernel is a tutorial that I would highly recommend to those who are beginning on Kaggle with R. Trevor Stephens has created this excellent Titanic problem walkthrough that covers starting from setting up your R environment to reaching at the upper end of the leaderboard. While I have made a few changes wherever I felt the need to experiment, you should too because you learn when you try something new! I am myself a beginner and with this tutorial, I think I have found a direction and a decent rank in the top 3%. For this kernel, I will try to keep things as concise as possible and easy to follow.
The Titanic problem is a binary classification problem where every passenger either survives or perishes. There are a lot of techniques and models that can be used in this scenario. However, to achieve a higher accuracy, an ensemble model may be needed. I am going to use cforest or Conditional Random Forests to build my model. For now, it is all fine to know just that cforest is similar to randomForest but uses conditional inference trees as base learners instead of decision trees.
We read our datasets as follows
train <- read.csv('C:/Users/admin/Desktop/R/Kaggle/Titanic/CSVs/train.csv')
test <- read.csv('C:/Users/admin/Desktop/R/Kaggle/Titanic/CSVs/test.csv')The read.csv() function reads the datasets into the respective data frames. You will notice here that the test data frame has one less column than train. As obvious, it is missing the Survived column that our job is to predict.
After this, we load some useful packages.
#for decision trees
library(rpart)
#for cforest
library(party) We have been provided with a bunch of variables for prediction. But not all of them are going to be valuable for us. Some of them that don’t look useful may still carry some importance which is not immediately obvious. Further, the model that we will build will perform better if it has more useful features. This call is for what is known as Feature Engineering.
In Trevor’s own words, “Feature engineering really boils down to the human element in machine learning.” Because anyone can build models from the available features using readymade packages and make predictions which make it quite an objective task. Feature Engineering brings the subjective part to the table. In this stage, your creativity is your most important companion.
So, going ahead, we will first row bind the two datasets to form a single, consolidated dataset. To do this, we will first need to make the number of columns same in both.
#Adding a column of NA (missing) values in test
test$Survived <- NA
# row binding train and test
combi <- rbind(train, test)
# taking a look at the structure of combi
str(combi) #or View(combi)## 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 187 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
First, we will put our focus on the two variables SibSp and Parch. Doesn’t their sum make the size of the family of a passenger? We can quickly make this a new feature!
# 1 is for the passenger himself/herself
combi$FamilySize <- combi$SibSp + combi$Parch + 1 Feature Engineering sounds quite easy, right? Well, this was just the beginning. Let’s move forward and derive more features.
If you take a look at the Name column, you may notice that the names also carry their titles along with them. See, for example,
#to display the first 6 records in the Name column
head(combi$Name) ## [1] Braund, Mr. Owen Harris
## [2] Cumings, Mrs. John Bradley (Florence Briggs Thayer)
## [3] Heikkinen, Miss. Laina
## [4] Futrelle, Mrs. Jacques Heath (Lily May Peel)
## [5] Allen, Mr. William Henry
## [6] Moran, Mr. James
## 1307 Levels: Abbing, Mr. Anthony ... Zakarian, Mr. Ortin
These titles can be in our second feature. But what stands in the way is their encapsulated presence within the Name feature. We need to extract the titles from all the names of the passengers. If we look closely, we see that there is a certain pattern in which the titles occur in the name. They always occur between a comma ‘,’ and a period ‘.’ We have a nice little function at our disposal that can make our work of segregating the titles from the names quite easy. Meet our friend, strsplit() —
# first converting to character type
combi$Name <- as.character(combi$Name)
strsplit(combi$Name[1], split = '[,.]')## [[1]]
## [1] "Braund" " Mr" " Owen Harris"
The symbols inside the square brackets are known as regular expressions. We can pick the title component out of these like this:
strsplit(combi$Name[1], split = '[,.]')[[1]][2]## [1] " Mr"
Now to apply this function to each row of the Name feature, we will have to use another function known as sapply() —
combi$Title <- sapply(combi$Name, FUN = function(x){ strsplit(x, split = '[,.]')[[1]][2]})Each row element goes into our function strsplit as variable x and gets operated upon. Our new feature is now almost ready. Only one more operation is required to get rid of the white space that exists before every title. We use the sub() function for that (sub = substitute),
#substitute the first occurrence of a white space with nothing
combi$Title <- sub(' ', '', combi$Title) We now have our second feature ready! You can have a statistical look at it.
table(combi$Title)##
## Capt Col Don Dona Dr
## 1 4 1 1 8
## Jonkheer Lady Major Master Miss
## 1 1 2 61 260
## Mlle Mme Mr Mrs Ms
## 2 1 757 197 2
## Rev Sir the Countess
## 8 1 1
We see there are so much more titles than we had expected. Moreover, there are some titles that only a single passenger has. To make the data more conformable, we will try to merge some titles into a single, broader category.
# The titles 'Mme' and 'Mlle' are merged into one 'Mlle' and similarly, for others.
# %in% is used to satisfy the logical OR condition, means if the title is one among the titles
# in the given vector, condition is satisfied. Below is the way:
combi$Title[combi$Title %in% c('Mme', 'Mlle')] <- 'Mlle'
combi$Title[combi$Title %in% c('Capt', 'Don', 'Jonkheer', 'Major', 'Sir')] <- 'Sir'
combi$Title[combi$Title %in% c('Dona', 'Lady', 'Ms', 'the Countess', 'Mlle')] <- 'Lady'
#To change into factor (datatype that contains categories)
combi$Title <- factor(combi$Title)
table(combi$Title)##
## Col Dr Lady Master Miss Mr Mrs Rev Sir
## 4 8 8 61 260 757 197 8 6
At last, we have our newly engineered Title feature. You can go ahead and experiment with the number of categories of titles according to your own liking by submitting different outputs at the end of this walkthrough.
One may be tempted to notice that the Name feature has more to offer than just the titles. It also carries with it the surnames of the passengers that may be categorical in nature. We may want to extract that too, for it may add valuable contribution to our model.
combi$Surname <- sapply(combi$Name, FUN = function(x){ strsplit(x, split = '[,.]')[[1]][1]})We can also go one level higher by creating an artificial feature. We can combine FamilySize and Surname together to create FamilyId that would be unique for each family. Maybe, some type of families were more likely to survive than the others because of their FamilySize. That’s why we can paste FamilySize with the Surname and expect accurate results. We can mark those unique families with less than 3 members as ‘Small’. That would save us from having too many FamilyId categories.
# pasting the two columns together
combi$FamilyId <- paste(as.character(combi$FamilySize), combi$Surname, sep = '')
# as obvious below, those with family size less than or equal to 2 will be designated as Small
combi$FamilyId[combi$FamilySize <= 2] <- 'Small'But wait, we can see when we run the command table(combi$FamilyId) some FamilyId values with different values of pasted FamilySize and their frequencies, which should have been the same (didn’t print it here because the output takes way too space). For instance, ‘3Strom’ should have had at least 3 passengers(frequency) with the same FamilyId. But it has only 1, which could mean that only 1 passenger was present in the ship with that FamilyId. This could mean that still, we would need to assign such FamilyId as ‘Small’.
# famId, a temporary variable, will store the FamilyId with their frequencies
famId <- data.frame(table(combi$FamilyId))
# Now, famId will only contain records with frequency or number of passengers in that family
# less than or equal to 2
famId <- famId[famId$Freq <= 2, ]
# Any FamilyId existing in famId would then be declared 'Small'
combi$FamilyId[combi$FamilyId %in% famId$Var1] <- 'Small'
# finally converting to factor
combi$FamilyId <- factor(combi$FamilyId)
table(combi$FamilyId)##
## 11Sage 3Abbott 3Boulos 3Bourke 3Brown
## 11 3 3 3 4
## 3Caldwell 3Collyer 3Compton 3Coutts 3Crosby
## 3 3 3 3 3
## 3Danbom 3Davies 3Dodge 3Drew 3Elias
## 3 5 3 3 3
## 3Goldsmith 3Hart 3Hickman 3Johnson 3Klasen
## 3 3 3 3 3
## 3Mallet 3McCoy 3Moubarek 3Nakid 3Navratil
## 3 3 3 3 3
## 3Peacock 3Peter 3Quick 3Rosblom 3Samaan
## 3 3 3 3 3
## 3Sandstrom 3Spedden 3Taussig 3Thayer 3Touma
## 3 3 3 3 3
## 3van Billiard 3Van Impe 3Wells 3Wick 3Widener
## 3 3 3 3 3
## 4Allison 4Baclini 4Becker 4Carter 4Dean
## 4 4 4 4 4
## 4Herman 4Johnston 4Laroche 4West 5Ford
## 4 4 4 4 5
## 5Lefebre 5Palsson 5Ryerson 6Fortune 6Panula
## 5 5 5 6 6
## 6Rice 6Skoog 7Andersson 7Asplund 8Goodwin
## 6 6 9 7 8
## Small
## 1074
Let’s look at other potentially useful columns.
We can see that we have a column called Cabin that carries values like ‘C85’, ‘G6’, etc. You can use head(combi$Cabin, n) with n being any big enough value to see that. Moreover, you will also notice a lot of NA values. We will deal with them later, but, what is important here is a pattern in the Cabin values. The first character is repetitive and maybe, one may guess that it represents the deck of the ship. We are going to move ahead with this notion and create further another feature called Deck.
#creating a column of NA values first
combi$Deck <- NA
#NULL splits at each position
combi$Deck <- sapply(as.character(combi$Cabin), function(x){ strsplit(x, NULL)[[1]][1]})
combi$Deck <- factor(combi$Deck)
summary(combi$Deck)## A B C D E F G T NA's
## 22 65 94 46 41 21 5 1 1014
This feature too has been successfully created. We will definitely address the issue of NA values later.
So, now we have a total of 4 features that we have engineered ourselves from the already existing features. However, we are yet not in a position to employ them to construct our predictor model. Some of the features need a little bit of retouching before that.
There are some features that have NA values in them. It is beneficial for us to get rid of those missing values. Moreover, some features may still not contain data in the form suitable for our model. This part of the walkthrough will deal with such issues.
First, we look at the features that we would actually want to use in our model and identify those that contain NA values.
summary(combi)## PassengerId Survived Pclass Name
## Min. : 1 Min. :0.0000 Min. :1.000 Length:1309
## 1st Qu.: 328 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median : 655 Median :0.0000 Median :3.000 Mode :character
## Mean : 655 Mean :0.3838 Mean :2.295
## 3rd Qu.: 982 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :1309 Max. :1.0000 Max. :3.000
## NA's :418
## Sex Age SibSp Parch
## female:466 Min. : 0.17 Min. :0.0000 Min. :0.000
## male :843 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.000
## Median :28.00 Median :0.0000 Median :0.000
## Mean :29.88 Mean :0.4989 Mean :0.385
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :80.00 Max. :8.0000 Max. :9.000
## NA's :263
## Ticket Fare Cabin Embarked
## CA. 2343: 11 Min. : 0.000 :1014 : 2
## 1601 : 8 1st Qu.: 7.896 C23 C25 C27 : 6 C:270
## CA 2144 : 8 Median : 14.454 B57 B59 B63 B66: 5 Q:123
## 3101295 : 7 Mean : 33.295 G6 : 5 S:914
## 347077 : 7 3rd Qu.: 31.275 B96 B98 : 4
## 347082 : 7 Max. :512.329 C22 C26 : 4
## (Other) :1261 NA's :1 (Other) : 271
## FamilySize Title Surname FamilyId
## Min. : 1.000 Mr :757 Length:1309 Small :1074
## 1st Qu.: 1.000 Miss :260 Class :character 11Sage : 11
## Median : 1.000 Mrs :197 Mode :character 7Andersson: 9
## Mean : 1.884 Master : 61 8Goodwin : 8
## 3rd Qu.: 2.000 Dr : 8 7Asplund : 7
## Max. :11.000 Lady : 8 6Fortune : 6
## (Other): 18 (Other) : 194
## Deck
## C : 94
## B : 65
## D : 46
## E : 41
## A : 22
## (Other): 27
## NA's :1014
So, Age, Fare, Embarked and Deck are those features (Embarked has 2 blanks which are tantamount to NA here). We will address them one at a time. First, for Age, we will use decision tree function rpart(). We have already imported its library. It is a good model that can be used for both classification and regression problems (like in the case of Age).
# first creating the decision tree on the training data, or only the rows that contain a value for Age column
fitAge <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + FamilySize, data = combi[!(is.na(combi$Age)), ], method = "anova")
# now putting the predicted Age values in the rows that have NA in their Age column
combi$Age[is.na(combi$Age)] <- predict(fitAge, combi[is.na(combi$Age), ])For Fare, we find that only a single NA is present. We can assign the median value of the Fare values to the NA.
# which row(s) has NA in the Fare column
which(is.na(combi$Fare)) ## [1] 1044
# median value assigned to the NA value.
combi$Fare[1044] <- median(combi$Fare, na.rm = TRUE)In case of Embarked, we observe there are just two missing values. Although it is not NA but blank spaces, it is equivalent to NA and we should clean them up too. To do that, we will replace these low number of Embarked blank spaces with the value that appears most frequently, i.e., “Southampton”. So, we replace with “S”.
# which row has blank as Embarked
which(combi$Embarked == '')## [1] 62 830
combi$Embarked[c(62,830)] = 'S'Now, in case of column Deck, we have a lot of NA values. We will again use rpart() but this time we will use method = "class" instead of “anova” as we did in the case of regression in the Age column’s treatment.
fitDeck <- rpart(Deck ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + FamilySize + Age, data = combi[!(is.na(combi$Deck)), ], method = "class")
combi$Deck[is.na(combi$Deck)] <- predict(fitDeck, combi[is.na(combi$Deck), ], type = "class")Go ahead and experiment with the rpart() function by adding or removing the predictor variables.
I got this idea from one of the commenters on Trevor’s blog. When I implemented it, my accuracy improved marginally, which is enough to jump ahead of a few hundred people in the leaderboard. I want to thank you, vruizext(Disqus name), if you ever see this. So, how it works is like this: You can find multiple passengers sharing the same ticket number and hence the same fare on the ticket. Those passengers may not be listed together in the dataset and hence may not be obvious immediately. What we would like to do here is to determine the fare per passenger. To find this, we will divide the Fare amounts by the total number of passengers sharing the corresponding Ticket numbers.
For example, if there are three passengers having Ticket value equal to 12345 and, therefore, having the same Fare amount of, let’s say, 66, then the fare per passenger will be equal to 66 divided by 3, which is 22. We will implement this to all the rows.
# create a temporary data frame storing Ticket values
tempo <- data.frame(table(combi$Ticket))
# store only those Ticket values that belong to more than one passenger
tempo <- tempo[tempo$Freq > 1, ]
tempo$Fare <- 0
# run a nested loop to extract the Fare values for the Ticket values in tempo
for(i in 1:nrow(combi)) {
for(j in 1:nrow(tempo)) {
if(combi$Ticket[i] == tempo$Var1[j]) {
tempo$Fare[j] = combi$Fare[i] } } }
# calculate the Fare per passenger
tempo$Fare <- tempo$Fare / tempo$Freq
# put back the Fare per passenger values to the combi data frame
for(i in 1:nrow(combi)) {
for(j in 1:nrow(tempo)) {
if(combi$Ticket[i] == tempo$Var1[j]) {
combi$Fare[i] = tempo$Fare[j] } } }And we are done with our Fare improvement!
As said before, we are going to build an ensemble model known as Conditional Random Forest. We have already loaded the required package for cforest() known as party. But before building our model, we will split our train and test datasets from combi.
# 1:891 means 1 to 891(included)
train <- combi[1:891, ]
test <- combi[892:1309, ]Now we are in the position to build our model, the last step before making the predictions!
# setting the seed so that everytime the model is created, random numbers produced would be the same
set.seed(400)
# the way of building a model is quite similar to how we made a couple of decision trees
# ntree specifies number of trees as base learners and mtry is for the number of variables to be used in each tree
fit <- cforest(factor(Survived) ~ Pclass + Age + Sex + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyId + Deck, data = train, controls = cforest_unbiased(ntree = 2000, mtry = 3))
# making a bar plot of the importance of predictor variables for the model
par(las = 2)
barplot(varimp(fit))The bar plot is a testimony to the success of our engineered features, one of which (Title) is the most important of all, i.e., removing that particular variable may cause a high decrease in the prediction accuracy.
We will now move forward to making the predictions on the test set.
Since our model is ready, we are all set to predict the survival chances of the rest of the passengers in the test set. Along with the prediction, we will also prepare our submission CSV file holding 0s and 1s as Survived column values for the passengers.
# our model operating upon the test set
Prediction <- predict(fit, test, OOB = TRUE, type = "response")
# create the submit data frame
submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)
# write the data frame as a csv file without the row numbers as a separate column
write.csv(submit, file = "RFcondInfTrees.csv", row.names = FALSE)So, this is it! Go ahead and make your submission. If you have already submitted before and didn’t get a good score, you may get it this time.
This was my first kernel on RMarkdown and I hope you liked it. Please give me feedbacks if you have any suggestion or remark on how I can improve this. In case you have any doubt, please feel free to post it. And even if I don’t know the answer myself, we’ll hunt for it together.
Thanks for visiting!