library(data.table)
library(ggplot2)
library(dplyr)
library(randomForest)
library(easyGgplot2)
library(knitr)

About the Data

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

For more information about the titanic data set visit the data page.

Data Import

The data has been split into two groups:

  1. training set (train.csv)
  2. test set (test.csv)

The training set will be used to train our machine learning model. Both train and test set come with social information (Name, Age, Number of Parents/Siblings aboard, etc) and economic information (Ticket Class, ticket fare, port of embarkation, etc), but only the train set comes with the Survived variable. The test set should be submitted to kaggle to see how well your model performs on unseen data.

# File Download
if (!file.exists("./Initial Data/test.csv")) {
        download.file(url = "https://www.kaggle.com/c/titanic/download/test.csv", "./Initial Data/test.csv")
        download.file(url = "https://www.kaggle.com/c/titanic/download/train.csv", "./Initial Data/train.csv")
}

# Data Import
raw.train <- fread("./Initial Data/train.csv", sep = ",", stringsAsFactors = F)
raw.test <- fread("./Initial Data/test.csv", sep = ",", stringsAsFactors = F)

Data Transformation

Now that we have imported both tables, we want to take a look at all the information that we are handling. Let’s start by setting up a table containing both train and test datasets.

# Bind both train and test datasets, ensuring that we leave out the  survival column in the train set
fulldt <- rbind(raw.train[,-2], raw.test)

# Set a variable to differenciate train and set data
fulldt$Origin <- c(rep("train", times = dim(raw.train)[1]), rep("test", times = dim(raw.test)[1]))

# Let's take a look at the data
str(fulldt)
## Classes 'data.table' and 'data.frame':   1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...
##  $ Origin     : chr  "train" "train" "train" "train" ...
##  - attr(*, ".internal.selfref")=<externalptr>

It appears that several of these variables should be represented as factors and thus should be reclassified.

# Ensure that some variables are factors
cols <- c("Pclass", "Sex", "SibSp", "Parch", "Embarked", "Cabin")
setDT(fulldt)
for(j in cols){
  set(fulldt, i=NULL, j=j, value=factor(fulldt[[j]]))
}

Now we will take a further look into the data.

summary(fulldt)
##   PassengerId   Pclass      Name               Sex           Age       
##  Min.   :   1   1:323   Length:1309        female:466   Min.   : 0.17  
##  1st Qu.: 328   2:277   Class :character   male  :843   1st Qu.:21.00  
##  Median : 655   3:709   Mode  :character                Median :28.00  
##  Mean   : 655                                           Mean   :29.88  
##  3rd Qu.: 982                                           3rd Qu.:39.00  
##  Max.   :1309                                           Max.   :80.00  
##                                                         NA's   :263    
##  SibSp       Parch         Ticket               Fare        
##  0:891   0      :1002   Length:1309        Min.   :  0.000  
##  1:319   1      : 170   Class :character   1st Qu.:  7.896  
##  2: 42   2      : 113   Mode  :character   Median : 14.454  
##  3: 20   3      :   8                      Mean   : 33.295  
##  4: 22   4      :   6                      3rd Qu.: 31.275  
##  5:  6   5      :   6                      Max.   :512.329  
##  8:  9   (Other):   4                      NA's   :1        
##              Cabin      Embarked    Origin         
##                 :1014    :  2    Length:1309       
##  C23 C25 C27    :   6   C:270    Class :character  
##  B57 B59 B63 B66:   5   Q:123    Mode  :character  
##  G6             :   5   S:914                      
##  B96 B98        :   4                              
##  C22 C26        :   4                              
##  (Other)        : 271

As any raw dataset, it seems that a lot of variables have missing or empty values. Before any modelling is done, these cases needs to be addressed:

1. NAs values in Age

The summary of the data shows us that the Age variable has 263 NA values. How is this comparable with our data?

mean(is.na(fulldt$Age)) # Which percentage of Age is NA values?
## [1] 0.2009167

Around 20%, that’s relevant. Which means that we can not discard the NA rows.

As we want to focus on predicting the passengers’ survival in this exercise, for now we will assign the median age to all NA values. But we must keep in mind that we might improve our accuracy by better predicting a passengers’ age.

fulldt[is.na(Age), "Age"] <- round(median(fulldt$Age, na.rm = T), 0) # Plug Age median on NA values

2. Missing values in Embarked

Embarked has only two missing values, so we will replace them with the most common value.

fulldt[which(fulldt$Embarked == ""),"Embarked"] <- "S"
fulldt$Embarked <- factor(fulldt$Embarked) # Make sure that Embarked is a factor

3. NAs values in Fare

fulldt %>%
        filter(is.na(Fare)) %>%
        select(Name, Pclass, Embarked, Age)
##                 Name Pclass Embarked  Age
## 1 Storey, Mr. Thomas      3        S 60.5

It seems that we only have one NA value on the fare variable. We are not dealing with an factor variable, so we can’t use the same approach as before. In this case, we will try to find similar cases to Mr. Thomas’ and then take a mean of all of its fare values.

# Let's filter all those who are 3rd class passengers and embarked from Southampton
        x <- fulldt %>%
                filter(Pclass == "3", Embarked == "S") %>%
                select(PassengerId, Pclass, Embarked, Fare, Age)

        kable(head(x, 10), format = 'markdown')
PassengerId Pclass Embarked Fare Age
1 3 S 7.2500 22
3 3 S 7.9250 26
5 3 S 8.0500 35
8 3 S 21.0750 2
9 3 S 11.1333 27
11 3 S 16.7000 4
13 3 S 8.0500 20
14 3 S 31.2750 39
15 3 S 7.8542 14
19 3 S 18.0000 31
        # Replace NA values by mean of 3rd class passengers embarked from Southampton
        fulldt[is.na(Fare), "Fare"] <- round(mean(x[, "Fare"], na.rm = T),0)

4. PassengerId, Name and Ticket variables

Let’s set a table to verify how much of these variable are duplicate values. We have a hunch that variables like PassengerId, Name and Ticket might not help us, because factor variables with close to zero duplicate values do not help in predicting.

# Check for duplicates
apply(X = fulldt[,c("PassengerId","Name","Ticket")], MARGIN = 2, FUN = function(x) mean(duplicated(x)))
## PassengerId        Name      Ticket 
## 0.000000000 0.001527884 0.290297937

It seems that what we suspected is true. PassengerId, Name and Ticket don’t bring any information to the table as they are, so we won’t use them as features. They are not to be discarded, though. We still can try to extract some information by feature engineering.

5. Empty values in Cabin

Lastly, it seems that most of Cabin values are empty, so we are going to discard it as an feature.

length(which(fulldt$Cabin == ""))/length(fulldt$Cabin) # How much of "Cabin" is empty?
## [1] 0.7746371

Exploring the variables

Now that we have dealt with the missing/NA Values in the dataset, we want to take a further look into how the predictor variables relate with our target variable.

But first we must prepare the data.

# Explit again into test and train set
test <- fulldt %>%
        filter(Origin == "test")
train <- fulldt %>%
        filter(Origin == "train")

# Assign Survived column to train
train <- as.data.table(train)
train$Survived <- raw.train[PassengerId %in% train$PassengerId, "Survived"]
train$Survived <- factor(train$Survived)

Now we are ready to start exploring the variables.

From those plots we can take a few points:

ML Models

Okay, we have taken a look at the variables. Now we want to start building and improving models.

Model 1

In this model we will use the features that we have at hand to build a random forest.

# Setting up the train table
train.1 <- fulldt %>%
        filter(PassengerId <=  891) %>%
        select(Pclass, Sex, Age, SibSp, Parch, Fare, Embarked)
train.1$Survived <- raw.train$Survived

# Setting up the test table
test.1 <- fulldt %>%
        filter(PassengerId >  891) %>%
        select(Pclass, Sex, Age, SibSp, Parch, Fare, Embarked)

Although every ML algorithm with high complexity can overfit, ensemble methods like random forest reduces the prediction variance to almost nothing, improving the accuracy of the result. This is why we won’t be partitioning the data to do cross-validation.

set.seed(1234)
fit.1 <- randomForest(as.factor(Survived) ~ ., data = train.1, ntree = 1000, importance = T)
fit.1
## 
## Call:
##  randomForest(formula = as.factor(Survived) ~ ., data = train.1,      ntree = 1000, importance = T) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 16.84%
## Confusion matrix:
##     0   1 class.error
## 0 499  50  0.09107468
## 1 100 242  0.29239766

The unbiased estimate of the test set error is done by the Out of the Bag (OOB) error, which we can see is around 17% in this model. Also, we seem to be much better in predicting if a person has died rather than lived. We might also want to check if our number of trees chosen was sufficient.

plot(fit.1, ylim=c(0,0.36))

The black line shows the OOB error by the number of trees used. Notice how it is stable at the thousandth tree. Similarly, the green and red lines represent the classification error for the survival and non survival outcomes.

Now let’s take a look at the features.

varImpPlot(fit.1)

The variable importance is calculated by randomly permuting the values of variables in the oob cases and then checking comparing its result with the untouched data. This is important because give us an idea of how an feature might impact the overall result.

For instance, it seems that the port of embarkment and the number of siblings/spouses (SibSp) or parents/children (Parch) aboard did not have a great impact on the overall result.

# Write a submission table
predict.1 <- predict(fit.1, test.1)
submit.1 <- data.table(PassengerId = test$PassengerId, Survived = predict.1)
write.csv(submit.1, file = "./submission1.csv", row.names = F)

Kaggle Accuracy: 0.76555.

Model 2

In the previous model, we have used all features available do predict the result. But if we take a look at the importance plot we will notice that Parch, SibSp and Embarked don’t seem to have a significant impact on the overall result. So, next, we want to try a model removing them as features.

# Setting up the train table
train.2 <- fulldt %>%
        filter(PassengerId <=  891) %>%
        select(Pclass, Sex, Age, Fare) # Removed Parch, SibSp and Embarked
train.2$Survived <- raw.train$Survived

# Setting up the test table
test.2 <- fulldt %>%
        filter(PassengerId >  891) %>%
        select(Pclass, Sex, Age, Fare) # Removed Parch, SibSp and Embarked
set.seed(1234)
fit.2 <- randomForest(as.factor(Survived) ~ ., data = train.2, ntree = 1000, importance = T)
fit.2
## 
## Call:
##  randomForest(formula = as.factor(Survived) ~ ., data = train.2,      ntree = 1000, importance = T) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 16.05%
## Confusion matrix:
##     0   1 class.error
## 0 503  46  0.08378871
## 1  97 245  0.28362573

We can notice already a decrease in the OOB error estimate, which might show that the removed variables might have little explanatory power.

varImpPlot(fit.2)

It seems that by removing these features, we might have increased the others’ importance. I wonder how that might have impacted our results.

# Write a submission table
predict.2 <- predict(fit.2, test.2)
submit.2 <- data.table(PassengerId = test$PassengerId, Survived = predict.2)
write.csv(submit.2, file = "./submission2.csv", row.names = F)

Kaggle Accuracy: 0.77033.

Wow! Looks like removing variables with little predictive power has increased our accuracy.

Model 3

Now, in order to increase our model’s accuracy, we have to do some feature engineering.

Feature Engineering

In feature engineering we transform the data available to create additional relevant features from the existing raw features, increasing the predictive power of the learning algorithm.

Family Size

We will try to increase the importance of both SibSp and Parch features by summing them up. This will result in a “family size” feature that might have a stronger predictive power.

fulldt$Family.Size <- as.numeric(fulldt$SibSp) + as.numeric(fulldt$Parch) - 1 # Factor to Numeric convertion adds 1 to the original number

Let’s take a better look at our new feature.

After analyzing the plot we notice that Single (one person aboard) and Large families (more than 4 people aboard) seem to be more likely to die. But two, three or four people families are just the opposite!

To accentuate this, we will set a new discretized feature based on our analysis of the family member count.

fulldt$Family.Size[fulldt$Family.Size >= 5] <- 'Large Family'
fulldt$Family.Size[fulldt$Family.Size < 5 & fulldt$Family.Size >= 2] <- 'Small Family'
fulldt$Family.Size[fulldt$Family.Size == 1] <- 'No Family'

fulldt$Family.Size <- as.factor(fulldt$Family.Size)

Titles

Let’s extract the title from the unused Name feature, hoping that it might add some value to our predicting model.

fulldt$Title <- gsub('(.*, )|(\\..*)', '', fulldt$Name) # Extract all titles from the Name variable
table(fulldt$Title)
## 
##         Capt          Col          Don         Dona           Dr 
##            1            4            1            1            8 
##     Jonkheer         Lady        Major       Master         Miss 
##            1            1            2           61          260 
##         Mlle          Mme           Mr          Mrs           Ms 
##            2            1          757          197            2 
##          Rev          Sir the Countess 
##            8            1            1

Better to group the titles with low occurrence into bigger baskets to avoid overfitting.

fulldt$Title[fulldt$Title %in% c("Ms", "Mlle")] <- "Miss"

fulldt$Title[fulldt$Title %in% c("Capt", "Col", "Dona", "Major", "Dr", "Rev", "Lady", "Don", "Sir", "the Countess", "Jonkheer")] <- "Other"

fulldt$Title[fulldt$Title == 'L'] <- 'Mr'

fulldt$Title[fulldt$Title == 'Mme'] <- 'Mrs'

fulldt$Title <- factor(fulldt$Title)

summary(fulldt$Title)
## Master   Miss     Mr    Mrs  Other 
##     61    264    757    198     29

Okay, this feature looks very promising! It creates a big discrepancy because you are very likely to die if you are in the “Mr” and “Other” group, while the rest is just the opposite!

Ticket

Although the Ticket variable do not present any value by itself, it might be helpful as a group identifier! The new Family.Size feature showed that a certain size of family is more likely to live than other. That might also hold true to tickets.

Assuming that people who share the same tickets stick together as a group (that group might or not be a family), we can find out if people with the same ticket are more likely to live or die.

# Create a table that counts how many times each ticket appears
ticket.list <- data.table(value = fulldt$Ticket)
nr.of.appearances <- ticket.list[, list(Ticket.appearances =length(value)), 
                                by = list(Ticket = value)]

head(nr.of.appearances)
##              Ticket Ticket.appearances
## 1:        A/5 21171                  1
## 2:         PC 17599                  2
## 3: STON/O2. 3101282                  1
## 4:           113803                  2
## 5:           373450                  1
## 6:           330877                  1
# Join new number of appearances table with our fulldt table
fulldt <- full_join(x = fulldt, y = nr.of.appearances, by = "Ticket")

fulldt <- as.data.table(fulldt) # Join function turns the table into data.frame. Let's ensure that its a data.table again.

Let’s take a look on this new feature.

It seems to have a big correlation with the family size feature. Maybe we are not bringing anything to the table by adding this new feature. We will check in our final results.

As the family variable we will create a discretized Ticket feature.

fulldt$Ticket.appearances[fulldt$Ticket.appearances >= 5]   <- 'Big Group'
fulldt$Ticket.appearances[fulldt$Ticket.appearances < 5 & fulldt$Ticket.appearances>= 2]   <- 'Small Group'
fulldt$Ticket.appearances[fulldt$Ticket.appearances == 1]   <- 'Single'
fulldt$Ticket.appearances <- factor(fulldt$Ticket.appearances)

Modelling

We want to build a model with our new and old features. For this exercise we will keep Parch and SibSp out because they have been converted into a the new feature Family.Size.

# Setting up the train table
train.3 <- fulldt %>%
        filter(Origin == "train") %>%
        select(Title, Sex, Pclass, Fare, Ticket.appearances, Family.Size, Age)
train.3$Survived <- raw.train$Survived

# Setting up the test table
test.3 <- fulldt %>%
        filter(Origin == "test") %>%
        select(Title, Sex, Pclass, Fare, Ticket.appearances, Family.Size, Age)
set.seed(1234)
fit.3 <- randomForest(as.factor(Survived) ~ ., data = train.3, ntree = 1000, importance = T) 
fit.3
## 
## Call:
##  randomForest(formula = as.factor(Survived) ~ ., data = train.3,      ntree = 1000, importance = T) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 15.94%
## Confusion matrix:
##     0   1 class.error
## 0 495  54  0.09836066
## 1  88 254  0.25730994

We already can see a lower OOB error. What about the importance?

varImpPlot(fit.3)

predict.3 <- predict(fit.3, test.3)
submit.3 <- data.table(PassengerId = test$PassengerId, Survived = predict.3)
write.csv(submit.3, file = "./submission3.csv", row.names = F)

Kaggle Accuracy: 0.78947.

Looks like some of our new features helped us after all! Title made a significant difference in the predictive power. Unfortunately the same can’t be said about the other new features.

Conclusion

In this ML project we have put to practice a random forest classification model and learned the importance of data exploration and feature engineering. We have achieved a final model with 79% accuracy, which is pretty good!