library(data.table)
library(ggplot2)
library(dplyr)
library(randomForest)
library(easyGgplot2)
library(knitr)
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
For more information about the titanic data set visit the data page.
The data has been split into two groups:
The training set will be used to train our machine learning model. Both train and test set come with social information (Name, Age, Number of Parents/Siblings aboard, etc) and economic information (Ticket Class, ticket fare, port of embarkation, etc), but only the train set comes with the Survived variable. The test set should be submitted to kaggle to see how well your model performs on unseen data.
# File Download
if (!file.exists("./Initial Data/test.csv")) {
download.file(url = "https://www.kaggle.com/c/titanic/download/test.csv", "./Initial Data/test.csv")
download.file(url = "https://www.kaggle.com/c/titanic/download/train.csv", "./Initial Data/train.csv")
}
# Data Import
raw.train <- fread("./Initial Data/train.csv", sep = ",", stringsAsFactors = F)
raw.test <- fread("./Initial Data/test.csv", sep = ",", stringsAsFactors = F)
Now that we have imported both tables, we want to take a look at all the information that we are handling. Let’s start by setting up a table containing both train and test datasets.
# Bind both train and test datasets, ensuring that we leave out the survival column in the train set
fulldt <- rbind(raw.train[,-2], raw.test)
# Set a variable to differenciate train and set data
fulldt$Origin <- c(rep("train", times = dim(raw.train)[1]), rep("test", times = dim(raw.test)[1]))
# Let's take a look at the data
str(fulldt)
## Classes 'data.table' and 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
## $ Origin : chr "train" "train" "train" "train" ...
## - attr(*, ".internal.selfref")=<externalptr>
It appears that several of these variables should be represented as factors and thus should be reclassified.
# Ensure that some variables are factors
cols <- c("Pclass", "Sex", "SibSp", "Parch", "Embarked", "Cabin")
setDT(fulldt)
for(j in cols){
set(fulldt, i=NULL, j=j, value=factor(fulldt[[j]]))
}
Now we will take a further look into the data.
summary(fulldt)
## PassengerId Pclass Name Sex Age
## Min. : 1 1:323 Length:1309 female:466 Min. : 0.17
## 1st Qu.: 328 2:277 Class :character male :843 1st Qu.:21.00
## Median : 655 3:709 Mode :character Median :28.00
## Mean : 655 Mean :29.88
## 3rd Qu.: 982 3rd Qu.:39.00
## Max. :1309 Max. :80.00
## NA's :263
## SibSp Parch Ticket Fare
## 0:891 0 :1002 Length:1309 Min. : 0.000
## 1:319 1 : 170 Class :character 1st Qu.: 7.896
## 2: 42 2 : 113 Mode :character Median : 14.454
## 3: 20 3 : 8 Mean : 33.295
## 4: 22 4 : 6 3rd Qu.: 31.275
## 5: 6 5 : 6 Max. :512.329
## 8: 9 (Other): 4 NA's :1
## Cabin Embarked Origin
## :1014 : 2 Length:1309
## C23 C25 C27 : 6 C:270 Class :character
## B57 B59 B63 B66: 5 Q:123 Mode :character
## G6 : 5 S:914
## B96 B98 : 4
## C22 C26 : 4
## (Other) : 271
As any raw dataset, it seems that a lot of variables have missing or empty values. Before any modelling is done, these cases needs to be addressed:
The summary of the data shows us that the Age variable has 263 NA values. How is this comparable with our data?
mean(is.na(fulldt$Age)) # Which percentage of Age is NA values?
## [1] 0.2009167
Around 20%, that’s relevant. Which means that we can not discard the NA rows.
As we want to focus on predicting the passengers’ survival in this exercise, for now we will assign the median age to all NA values. But we must keep in mind that we might improve our accuracy by better predicting a passengers’ age.
fulldt[is.na(Age), "Age"] <- round(median(fulldt$Age, na.rm = T), 0) # Plug Age median on NA values
Embarked has only two missing values, so we will replace them with the most common value.
fulldt[which(fulldt$Embarked == ""),"Embarked"] <- "S"
fulldt$Embarked <- factor(fulldt$Embarked) # Make sure that Embarked is a factor
fulldt %>%
filter(is.na(Fare)) %>%
select(Name, Pclass, Embarked, Age)
## Name Pclass Embarked Age
## 1 Storey, Mr. Thomas 3 S 60.5
It seems that we only have one NA value on the fare variable. We are not dealing with an factor variable, so we can’t use the same approach as before. In this case, we will try to find similar cases to Mr. Thomas’ and then take a mean of all of its fare values.
# Let's filter all those who are 3rd class passengers and embarked from Southampton
x <- fulldt %>%
filter(Pclass == "3", Embarked == "S") %>%
select(PassengerId, Pclass, Embarked, Fare, Age)
kable(head(x, 10), format = 'markdown')
| PassengerId | Pclass | Embarked | Fare | Age |
|---|---|---|---|---|
| 1 | 3 | S | 7.2500 | 22 |
| 3 | 3 | S | 7.9250 | 26 |
| 5 | 3 | S | 8.0500 | 35 |
| 8 | 3 | S | 21.0750 | 2 |
| 9 | 3 | S | 11.1333 | 27 |
| 11 | 3 | S | 16.7000 | 4 |
| 13 | 3 | S | 8.0500 | 20 |
| 14 | 3 | S | 31.2750 | 39 |
| 15 | 3 | S | 7.8542 | 14 |
| 19 | 3 | S | 18.0000 | 31 |
# Replace NA values by mean of 3rd class passengers embarked from Southampton
fulldt[is.na(Fare), "Fare"] <- round(mean(x[, "Fare"], na.rm = T),0)
Let’s set a table to verify how much of these variable are duplicate values. We have a hunch that variables like PassengerId, Name and Ticket might not help us, because factor variables with close to zero duplicate values do not help in predicting.
# Check for duplicates
apply(X = fulldt[,c("PassengerId","Name","Ticket")], MARGIN = 2, FUN = function(x) mean(duplicated(x)))
## PassengerId Name Ticket
## 0.000000000 0.001527884 0.290297937
It seems that what we suspected is true. PassengerId, Name and Ticket don’t bring any information to the table as they are, so we won’t use them as features. They are not to be discarded, though. We still can try to extract some information by feature engineering.
Lastly, it seems that most of Cabin values are empty, so we are going to discard it as an feature.
length(which(fulldt$Cabin == ""))/length(fulldt$Cabin) # How much of "Cabin" is empty?
## [1] 0.7746371
Now that we have dealt with the missing/NA Values in the dataset, we want to take a further look into how the predictor variables relate with our target variable.
But first we must prepare the data.
# Explit again into test and train set
test <- fulldt %>%
filter(Origin == "test")
train <- fulldt %>%
filter(Origin == "train")
# Assign Survived column to train
train <- as.data.table(train)
train$Survived <- raw.train[PassengerId %in% train$PassengerId, "Survived"]
train$Survived <- factor(train$Survived)
Now we are ready to start exploring the variables.
From those plots we can take a few points:
Okay, we have taken a look at the variables. Now we want to start building and improving models.
In this model we will use the features that we have at hand to build a random forest.
# Setting up the train table
train.1 <- fulldt %>%
filter(PassengerId <= 891) %>%
select(Pclass, Sex, Age, SibSp, Parch, Fare, Embarked)
train.1$Survived <- raw.train$Survived
# Setting up the test table
test.1 <- fulldt %>%
filter(PassengerId > 891) %>%
select(Pclass, Sex, Age, SibSp, Parch, Fare, Embarked)
Although every ML algorithm with high complexity can overfit, ensemble methods like random forest reduces the prediction variance to almost nothing, improving the accuracy of the result. This is why we won’t be partitioning the data to do cross-validation.
set.seed(1234)
fit.1 <- randomForest(as.factor(Survived) ~ ., data = train.1, ntree = 1000, importance = T)
fit.1
##
## Call:
## randomForest(formula = as.factor(Survived) ~ ., data = train.1, ntree = 1000, importance = T)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 16.84%
## Confusion matrix:
## 0 1 class.error
## 0 499 50 0.09107468
## 1 100 242 0.29239766
The unbiased estimate of the test set error is done by the Out of the Bag (OOB) error, which we can see is around 17% in this model. Also, we seem to be much better in predicting if a person has died rather than lived. We might also want to check if our number of trees chosen was sufficient.
plot(fit.1, ylim=c(0,0.36))
The black line shows the OOB error by the number of trees used. Notice how it is stable at the thousandth tree. Similarly, the green and red lines represent the classification error for the survival and non survival outcomes.
Now let’s take a look at the features.
varImpPlot(fit.1)
The variable importance is calculated by randomly permuting the values of variables in the oob cases and then checking comparing its result with the untouched data. This is important because give us an idea of how an feature might impact the overall result.
For instance, it seems that the port of embarkment and the number of siblings/spouses (SibSp) or parents/children (Parch) aboard did not have a great impact on the overall result.
# Write a submission table
predict.1 <- predict(fit.1, test.1)
submit.1 <- data.table(PassengerId = test$PassengerId, Survived = predict.1)
write.csv(submit.1, file = "./submission1.csv", row.names = F)
Kaggle Accuracy: 0.76555.
In the previous model, we have used all features available do predict the result. But if we take a look at the importance plot we will notice that Parch, SibSp and Embarked don’t seem to have a significant impact on the overall result. So, next, we want to try a model removing them as features.
# Setting up the train table
train.2 <- fulldt %>%
filter(PassengerId <= 891) %>%
select(Pclass, Sex, Age, Fare) # Removed Parch, SibSp and Embarked
train.2$Survived <- raw.train$Survived
# Setting up the test table
test.2 <- fulldt %>%
filter(PassengerId > 891) %>%
select(Pclass, Sex, Age, Fare) # Removed Parch, SibSp and Embarked
set.seed(1234)
fit.2 <- randomForest(as.factor(Survived) ~ ., data = train.2, ntree = 1000, importance = T)
fit.2
##
## Call:
## randomForest(formula = as.factor(Survived) ~ ., data = train.2, ntree = 1000, importance = T)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 16.05%
## Confusion matrix:
## 0 1 class.error
## 0 503 46 0.08378871
## 1 97 245 0.28362573
We can notice already a decrease in the OOB error estimate, which might show that the removed variables might have little explanatory power.
varImpPlot(fit.2)
It seems that by removing these features, we might have increased the others’ importance. I wonder how that might have impacted our results.
# Write a submission table
predict.2 <- predict(fit.2, test.2)
submit.2 <- data.table(PassengerId = test$PassengerId, Survived = predict.2)
write.csv(submit.2, file = "./submission2.csv", row.names = F)
Kaggle Accuracy: 0.77033.
Wow! Looks like removing variables with little predictive power has increased our accuracy.
Now, in order to increase our model’s accuracy, we have to do some feature engineering.
In feature engineering we transform the data available to create additional relevant features from the existing raw features, increasing the predictive power of the learning algorithm.
We will try to increase the importance of both SibSp and Parch features by summing them up. This will result in a “family size” feature that might have a stronger predictive power.
fulldt$Family.Size <- as.numeric(fulldt$SibSp) + as.numeric(fulldt$Parch) - 1 # Factor to Numeric convertion adds 1 to the original number
Let’s take a better look at our new feature.
After analyzing the plot we notice that Single (one person aboard) and Large families (more than 4 people aboard) seem to be more likely to die. But two, three or four people families are just the opposite!
To accentuate this, we will set a new discretized feature based on our analysis of the family member count.
fulldt$Family.Size[fulldt$Family.Size >= 5] <- 'Large Family'
fulldt$Family.Size[fulldt$Family.Size < 5 & fulldt$Family.Size >= 2] <- 'Small Family'
fulldt$Family.Size[fulldt$Family.Size == 1] <- 'No Family'
fulldt$Family.Size <- as.factor(fulldt$Family.Size)
Let’s extract the title from the unused Name feature, hoping that it might add some value to our predicting model.
fulldt$Title <- gsub('(.*, )|(\\..*)', '', fulldt$Name) # Extract all titles from the Name variable
table(fulldt$Title)
##
## Capt Col Don Dona Dr
## 1 4 1 1 8
## Jonkheer Lady Major Master Miss
## 1 1 2 61 260
## Mlle Mme Mr Mrs Ms
## 2 1 757 197 2
## Rev Sir the Countess
## 8 1 1
Better to group the titles with low occurrence into bigger baskets to avoid overfitting.
fulldt$Title[fulldt$Title %in% c("Ms", "Mlle")] <- "Miss"
fulldt$Title[fulldt$Title %in% c("Capt", "Col", "Dona", "Major", "Dr", "Rev", "Lady", "Don", "Sir", "the Countess", "Jonkheer")] <- "Other"
fulldt$Title[fulldt$Title == 'L'] <- 'Mr'
fulldt$Title[fulldt$Title == 'Mme'] <- 'Mrs'
fulldt$Title <- factor(fulldt$Title)
summary(fulldt$Title)
## Master Miss Mr Mrs Other
## 61 264 757 198 29
Okay, this feature looks very promising! It creates a big discrepancy because you are very likely to die if you are in the “Mr” and “Other” group, while the rest is just the opposite!
Although the Ticket variable do not present any value by itself, it might be helpful as a group identifier! The new Family.Size feature showed that a certain size of family is more likely to live than other. That might also hold true to tickets.
Assuming that people who share the same tickets stick together as a group (that group might or not be a family), we can find out if people with the same ticket are more likely to live or die.
# Create a table that counts how many times each ticket appears
ticket.list <- data.table(value = fulldt$Ticket)
nr.of.appearances <- ticket.list[, list(Ticket.appearances =length(value)),
by = list(Ticket = value)]
head(nr.of.appearances)
## Ticket Ticket.appearances
## 1: A/5 21171 1
## 2: PC 17599 2
## 3: STON/O2. 3101282 1
## 4: 113803 2
## 5: 373450 1
## 6: 330877 1
# Join new number of appearances table with our fulldt table
fulldt <- full_join(x = fulldt, y = nr.of.appearances, by = "Ticket")
fulldt <- as.data.table(fulldt) # Join function turns the table into data.frame. Let's ensure that its a data.table again.
Let’s take a look on this new feature.
It seems to have a big correlation with the family size feature. Maybe we are not bringing anything to the table by adding this new feature. We will check in our final results.
As the family variable we will create a discretized Ticket feature.
fulldt$Ticket.appearances[fulldt$Ticket.appearances >= 5] <- 'Big Group'
fulldt$Ticket.appearances[fulldt$Ticket.appearances < 5 & fulldt$Ticket.appearances>= 2] <- 'Small Group'
fulldt$Ticket.appearances[fulldt$Ticket.appearances == 1] <- 'Single'
fulldt$Ticket.appearances <- factor(fulldt$Ticket.appearances)
We want to build a model with our new and old features. For this exercise we will keep Parch and SibSp out because they have been converted into a the new feature Family.Size.
# Setting up the train table
train.3 <- fulldt %>%
filter(Origin == "train") %>%
select(Title, Sex, Pclass, Fare, Ticket.appearances, Family.Size, Age)
train.3$Survived <- raw.train$Survived
# Setting up the test table
test.3 <- fulldt %>%
filter(Origin == "test") %>%
select(Title, Sex, Pclass, Fare, Ticket.appearances, Family.Size, Age)
set.seed(1234)
fit.3 <- randomForest(as.factor(Survived) ~ ., data = train.3, ntree = 1000, importance = T)
fit.3
##
## Call:
## randomForest(formula = as.factor(Survived) ~ ., data = train.3, ntree = 1000, importance = T)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 15.94%
## Confusion matrix:
## 0 1 class.error
## 0 495 54 0.09836066
## 1 88 254 0.25730994
We already can see a lower OOB error. What about the importance?
varImpPlot(fit.3)
predict.3 <- predict(fit.3, test.3)
submit.3 <- data.table(PassengerId = test$PassengerId, Survived = predict.3)
write.csv(submit.3, file = "./submission3.csv", row.names = F)
Kaggle Accuracy: 0.78947.
Looks like some of our new features helped us after all! Title made a significant difference in the predictive power. Unfortunately the same can’t be said about the other new features.
In this ML project we have put to practice a random forest classification model and learned the importance of data exploration and feature engineering. We have achieved a final model with 79% accuracy, which is pretty good!