Random Forest: Titanic Disaster Survival

Set Working Directory

# You may set your own, mine is:
setwd("E:/Titanic_ML/")

This script trains a Random Forest model based on the data, saves a sample submission, and plots the relative importance of the variables in making predictions

Download my_rf_submission.csv from the output below and submit it through https://www.kaggle.com/c/titanic-gettingStarted/submissions/attach to enter this competition!

This analysis uses the Titanic survivor data provided by Kaggle at http://www.kaggle.com/c/titanic/data

I am using R for this analysis.

According to the description on Kaggle Titanic Competition, the variable descriptions are as follows:

    survival: Survival (0 = No; 1 = Yes)
    pclass: Passenger Class (proxy for socio-economic status 1 = 1st; 2 = 2nd; 3 = 3rd)
    name: Name
    sex: Sex
    age: Age
    sibsp: Number of Siblings/Spouses Aboard
    parch: Number of Parents/Children Aboard
    ticket: Ticket Number
    fare: Passenger Fare
    cabin: Cabin
    embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Part 1: Problem Description

The purpose of this ananlysis is to build a model that selects the variables that are able to predctive the passenger's survival with better 
prediction accuracy. 

For this analysis, since I have survival labels (whether survived or not) I'll apply a supervised learning algorithm to build the model.

Part 2: Model Building

First let us load the required libraries in R and also data in the working environment.

2.1: Load Required Libraries

library(ggplot2)
library(randomForest)
library(rpart)
library(rpart.plot)
library(pander)
library(caTools)
library(rattle)
library(RColorBrewer)
library(plyr)

2.2: Load training and test data

set.seed(1)
train <- read.csv("./input/train.csv", stringsAsFactors=FALSE)
test  <- read.csv("./input/test.csv",  stringsAsFactors=FALSE)

2.3: View the data

Training Data:

# Print few rows of training dataet to the console
m <- head(train)
knitr::kable(m, digits = 2, caption = "Training Data: First Few Rows of Data")

Training Data: First Few Rows of Data
PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.25		S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.28	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.92		S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.10	C123	S
5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.05		S
6	0	3	Moran, Mr. James	male	NA	0	330877	8.46		Q

Testing Data:

# Print few rows of testing dataet to the console
n <- head(test)
knitr::kable(n, digits = 2, caption = "Testing Data: First Few Rows of Data")

Testing Data: First Few Rows of Data
PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Embarked
892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.83	Q
893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.00	S
894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.69	Q
895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.66	S
896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.29	S
897	3	Svensson, Mr. Johan Cervin	male	14.0	0	0	7538	9.22	S

2.4: Also view structures

2.4.1: Training dataset

str(train)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

2.4.2: Testing dataset

str(test)

## 'data.frame':    418 obs. of  11 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
##  $ Sex        : chr  "male" "female" "male" "male" ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : chr  "" "" "" "" ...
##  $ Embarked   : chr  "Q" "S" "Q" "S" ...

2.5: Converting variables as factors

From the above results, it is clear that the 'Survived' variable is loaded as integer by default.

Therefore I'll convert it to a categorical varaible (factor). 

I'll also convert other categorical variables to factors as well (which are not loaded as factors but actually are).

train2 <- train
train2$Survived <- as.factor(train2$Survived)
train2$Pclass <- as.factor(train2$Pclass)

2.6: Splitting training data

As previously noted, I'll split the training data into a set that will be used to train the model and a set that will be used to evaluate the model.

Test data set, as created from the above process, will contain 75% of randomly selected observations.

Training data set will contain the rest 25% obesvations (in original training set) which are exluded by newly created test data set.

set.seed(1446)
split <- sample.split(train2$Survived, SplitRatio = 0.75)
sub_training_data <- subset(train2, split == TRUE)
sub_testing_data <- subset(train2, split == FALSE)

Part 3: Visualizing Data

# Males & Females that Survived vs. Males & Females that passed away
table(train2$Sex, train2$Survived)

##         
##            0   1
##   female  81 233
##   male   468 109

# As row-wise proportions
prop.table(table(train$Sex, train$Survived), margin = 1)

##         
##                  0         1
##   female 0.2579618 0.7420382
##   male   0.8110919 0.1889081

I'll first consider a simple model based on gender (Sex variable) of the passengers vs. survival from the disaster.

k <- ggplot(train2, aes(Survived))
k + geom_bar(aes( fill = Sex), width=.85, colour="darkgreen") + scale_fill_brewer() +
  ylab("Survival Count (Genderwise)") +
  xlab("Survived: No = 0, Yes = 1") +
  ggtitle("Titanic Disaster: Gender Vs. Survival (Training Dataset")

The following plot reveals that a majority of the passengers who survived Ire females. 

Referring this idea, I can use Sex attribute for a basic model to predict whether a particular person survived or not.

Using R's "rpart" library, I will develop a simple decision tree (model):

Part 4: Build the decision tree model

formula <- Survived ~ Sex

# Build the decision tree
dtree <- rpart(formula, data=sub_training_data, method="class")

4.1: Performance on the Training Data

dtree_tr_predict <- predict(dtree, newdata=sub_training_data, type="class")
dtree_tr_predict.t <- table(sub_training_data$Survived, dtree_tr_predict)

# Model Accuracy
dtree_tr_accuracy <- (dtree_tr_predict.t[1, 1] + dtree_tr_predict.t[2, 2]) / sum(dtree_tr_predict.t)

# Print accuracy in Prediction
cat("Model Accuracy on Sub sample on training data: ", dtree_tr_accuracy)

## Model Accuracy on Sub sample on training data:  0.7919162

4.2: Performance on the Testing Data

dtree_te_predict <- predict(dtree, newdata=sub_testing_data, type="class")
dtree_te_predict.t <- table(sub_testing_data$Survived, dtree_te_predict)

# Model Accuracy
dtree_testing_accuracy <- (dtree_te_predict.t[1, 1] + dtree_te_predict.t[2, 2]) / sum(dtree_te_predict.t)

# Print accuracy
cat("Model Accuracy in Prediction: ", dtree_testing_accuracy)

## Model Accuracy in Prediction:  0.7713004

4.3: Display the decision tree

fancyRpartPlot(dtree)

Part 5: Revised Model and Analysis of Results

For the revised model, I'll consider "Age" and "Passenger Class" as additional variables.

Note that the "Passenger Class" variable is a proxy for socio-economic status with:
1 = 1st, 
2 = 2nd, and 
3 = 3rd class of passenger's catagory

Part 5.1: Model Rebuilding

formula2 <- Survived ~ Sex + Pclass + Age

# Build the decision tree
dtree2 <- rpart(formula2, data = sub_training_data, method="class")

5.1: Performance on the Training Data

dtree_tr_predict2 <- predict(dtree2, newdata = sub_training_data, type="class")
dtree_tr_predict.t2 <- table(sub_training_data$Survived, dtree_tr_predict2)

# Model Accuracy
dtree_tr_accuracy2 <- (dtree_tr_predict.t2[1, 1] + dtree_tr_predict.t2[2, 2]) / sum(dtree_tr_predict.t2)

# Print accuracy in Prediction
cat("Model Accuracy on Sub sample on training data: ", dtree_tr_accuracy2)

## Model Accuracy on Sub sample on training data:  0.8128743

5.2: Performance on the Testing Data

dtree_te_predict2 <- predict(dtree2, newdata = sub_testing_data, type="class")
dtree_te_predict.t2 <- table(sub_testing_data$Survived, dtree_te_predict2)

# Model Accuracy
dtree_testing_accuracy2 <- (dtree_te_predict.t2[1, 1] + dtree_te_predict.t2[2, 2]) / sum(dtree_te_predict.t2)

# Print accuracy
cat("Model Accuracy in Prediction: ", dtree_testing_accuracy2)

## Model Accuracy in Prediction:  0.7847534

5.3: Display the Decision Tree

fancyRpartPlot(dtree2)

Part 6: Building Model using randomForest package

Since Random Forest funciton in randomForest package throughs an error if there is any variable that has NAs in it and is used to develop decision tree using this function.

We, therefore, first find out which are rows in the training dataset that has the NAs.

If complete row has NAs for all the variables in the training dataset, then we will simple remove it. 

Or else we will replace the NAs with suitable imputation method.

6.1: Finding NAs and Iputing/Removing the NAs

# Function to count number of missing values
nmissing <- function(x) sum(is.na(x))

# Apply the above function to every column in a train dataset
L <- colwise(nmissing)(train)
knitr::kable(L, digits = 2, caption = "Training Data: NA's")

Training Data: NA’s
PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	0	0	0	0	177	0	0	0	0	0	0

As we can see that there are total of 176 rows in varaiable "Age" column that has NAs.

None other variable has NAs.

This makes our life simple, we can replace these NAs using imputation.

Question is which method will be suitable for replacing NAs for a continuous variable?

AaHa! PREDICTION.. that's what are we doing until now, and will do again.

I'll make a prediction of an individual passenger's "Age"" using the other variables and a decision tree model. 

But, this time I will use "anova", why? because I am trying to predict a continuous variable, as I said earlier.

So let's do is using R's "rpart" package. But why "rpart"? Becasue it's function take care of NAs in the data to build up prediciton model.

names(train) # use all of these

##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"

predicted_age <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Survived, data = train[!is.na(train$Age), ], method = "anova")
train1 <- train
test1 <- test

#Summary before imputaiton
summary(train1$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177

train1$Age[is.na(train1$Age)] <- predict(predicted_age, train[is.na(train$Age), ])
#Summary after imputaiton
summary(train1$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.42   22.00   27.84   29.57   37.00   80.00

k1 <- colwise(nmissing)(train1)
knitr::kable(k1, digits = 2, caption = "Training Data: NA's")

Training Data: NA’s
PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	0	0	0	0	0	0	0	0	0	0	0

Therefore, we can observe from the above summary results (before and after inputation) that change effects due to imputation are almost negligiable on the dataset.

We now can use randomForest package to build RF model on train1 data:

6.2: Developing RF Model

First convert variables into factors.

train1$Survived <- as.factor(train1$Survived)
#train1$Pclass <- as.factor(train1$Pclass)
train1$Sex <- as.factor(train1$Sex)
#train1$Parch <- as.factor(train1$Parch)
train1$Embarked <- as.factor(train1$Embarked)
#train1$SibSp <- as.factor(train1$SibSp)

#test1$Pclass <- as.factor(test$Pclass)
test1$Sex <- as.factor(test1$Sex)
#test1$Parch <- as.factor(test1$Parch)
test1$Embarked <- as.factor(test1$Embarked)
#test1$SibSp <- as.factor(test1$SibSp)

sapply(train1, class)

## PassengerId    Survived      Pclass        Name         Sex         Age 
##   "integer"    "factor"   "integer" "character"    "factor"   "numeric" 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##   "integer"   "integer" "character"   "numeric" "character"    "factor"

sapply(test1, class)

## PassengerId      Pclass        Name         Sex         Age       SibSp 
##   "integer"   "integer" "character"    "factor"   "numeric"   "integer" 
##       Parch      Ticket        Fare       Cabin    Embarked 
##   "integer" "character"   "numeric" "character"    "factor"

Apply randomForest to builde RF tree:

formula2 <- Survived ~ Pclass + Sex + Age
# Build the decision tree
my_rf_model <- randomForest(formula2, data = train1, importance=TRUE, ntree=2000)
#my_rf_model <- randomForest(formula2, data = train1, importance=TRUE, ntree=2000,proximity=TRUE)

Variable importance:

importance(my_rf_model)

##               0         1 MeanDecreaseAccuracy MeanDecreaseGini
## Pclass 42.37356  48.41536             52.11166         37.32574
## Sex    94.66900 116.35814            109.42921        102.67206
## Age    39.05505  30.48756             51.39344         23.96749

Plot Variable importance:

varImpPlot(my_rf_model, main = "RF: Variable Importance")

Make Prediciton:

# Make your prediction using the test set
my_RF_prediction <- predict(my_rf_model, test1)

Modeling Errors (Confusion Matrix):

options('digits'= 3)
conf_matrix <- my_rf_model$confusion
knitr::kable(conf_matrix, digits = 2, caption = "Prediciton Errors: ")

Prediciton Errors:
	0	1	class.error
0	493	56	0.10
1	112	230	0.33

Accuracy of the model:

options('digits'= 3)
my_rf_model$confusion[, 'class.error']

##     0     1 
## 0.102 0.327

# Look at OOB errors
my_rf_model

## 
## Call:
##  randomForest(formula = formula2, data = train1, importance = TRUE,      ntree = 2000) 
##                Type of random forest: classification
##                      Number of trees: 2000
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 18.9%
## Confusion matrix:
##     0   1 class.error
## 0 493  56       0.102
## 1 112 230       0.327

Out of bag (OOB) error is the overall classification error in the above model and as we can see, OOB error still is more than 15%.

This means that we still can reduce the OOB error in the model.

Part 7: Saving Result for Kaggle Submission

# Create a data frame with two columns: PassengerId & Survived. 
## Survived contains your predictions
my_solution_rf <- data.frame(PassengerId = test1$PassengerId, Survived = my_RF_prediction)


# Write your solution away to a csv file with the name my_solution.csv
write.csv(my_solution_rf, file = "my_solution_rf.csv" , row.names=FALSE)