Random Forest: Assignment

Set Working Directory

# You may set your own, mine is:
setwd("E:/Titanic_ML/")

This script trains a Random Forest model based on the data, saves a sample submission, and plots the relative importance of the variables in making predictions

Download my_rf_submission.csv from the output below and submit it through https://www.kaggle.com/c/titanic-gettingStarted/submissions/attach to enter this competition!

This analysis uses the Titanic survivor data provided by Kaggle at http://www.kaggle.com/c/titanic/data

I am using R for this analysis.

According to the description on Kaggle Titanic Competition, the variable descriptions are as follows:

    survival: Survival (0 = No; 1 = Yes)
    pclass: Passenger Class (proxy for socio-economic status 1 = 1st; 2 = 2nd; 3 = 3rd)
    name: Name
    sex: Sex
    age: Age
    sibsp: Number of Siblings/Spouses Aboard
    parch: Number of Parents/Children Aboard
    ticket: Ticket Number
    fare: Passenger Fare
    cabin: Cabin
    embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Part 1: Problem Description

The purpose of this ananlysis is to build a model that selects the variables that are able to predctive the passenger's survival with better 
prediction accuracy. 

For this analysis, since I have survival labels (whether survived or not) I'll apply a supervised learning algorithm to build the model.

Part 2: Model Building

First let us load the required libraries in R and also data in the working environment.

2.1: Load Required Libraries

library(ggplot2)
library(randomForest)
library(rpart)
library(rpart.plot)
library(pander)
library(caTools)
library(rattle)
library(RColorBrewer)

2.2: Load training and test data

set.seed(1)
train <- read.csv("./input/train.csv", stringsAsFactors=FALSE)
test  <- read.csv("./input/test.csv",  stringsAsFactors=FALSE)

2.3: View the data

Training Data:

# Print few rows of training dataet to the console
m <- head(train)
knitr::kable(m, digits = 2, caption = "Training Data: First Few Rows of Data")

Training Data: First Few Rows of Data
PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.25		S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.28	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.92		S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.10	C123	S
5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.05		S
6	0	3	Moran, Mr. James	male	NA	0	330877	8.46		Q

Testing Data:

# Print few rows of testing dataet to the console
n <- head(test)
knitr::kable(n, digits = 2, caption = "Testing Data: First Few Rows of Data")

Testing Data: First Few Rows of Data
PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Embarked
892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.83	Q
893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.00	S
894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.69	Q
895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.66	S
896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.29	S
897	3	Svensson, Mr. Johan Cervin	male	14.0	0	0	7538	9.22	S

2.4: Also view structures

2.4.1: Training dataset

str(train)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

2.4.2: Testing dataset

str(test)

## 'data.frame':    418 obs. of  11 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
##  $ Sex        : chr  "male" "female" "male" "male" ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : chr  "" "" "" "" ...
##  $ Embarked   : chr  "Q" "S" "Q" "S" ...

2.5: Converting variables as factors

From the above results, it is clear that the 'Survived' variable is loaded as integer by default.

Therefore I'll convert it to a categorical varaible (factor). 

I'll also convert other categorical variables to factors as Ill (which are not loaded as factors but actually are).

train$Survived <- as.factor(train$Survived)
train$Pclass <- as.factor(train$Pclass)

2.6: Splitting training data

As previously noted, I'll split the training data into a set that will be used to train the model and a set that will be used to evaluate the model.

Test data set, as created from the above process, will contain 75% of randomly selected observations.

Training data set will contain the rest 25% obesvations (in original training set) which are exluded by newly created test data set.

set.seed(1446)
split <- sample.split(train$Survived, SplitRatio = 0.75)
sub_training_data <- subset(train, split == TRUE)
sub_testing_data <- subset(train, split == FALSE)

Part 3: Visualizing Data

I'll first consider a simple model based on gender (Sex variable) of the passengers vs. survival from the disaster.

k <- ggplot(train, aes(Survived))
k + geom_bar(aes( fill = Sex), width=.85, colour="darkgreen") + scale_fill_brewer() +
  ylab("Survival Count (Genderwise)") +
  xlab("Survived: No = 0, Yes = 1") +
  ggtitle("Titanic Disaster: Gender Vs. Survival (Training Dataset")

The following plot reveals that a majority of the passengers who survived Ire females. 

Referring this idea, I can use Sex attribute for a basic model to predict whether a particular person survived or not.

Using R's "rpart" library, I will develop a simple decision tree (model):

Part 4: Build the decision tree model

formula <- Survived ~ Sex

# Build the decision tree
dtree <- rpart(formula, data=sub_training_data, method="class")

4.1: Performance on the Training Data

dtree_tr_predict <- predict(dtree, newdata=sub_training_data, type="class")
dtree_tr_predict.t <- table(sub_training_data$Survived, dtree_tr_predict)

# Model Accuracy
dtree_tr_accuracy <- (dtree_tr_predict.t[1, 1] + dtree_tr_predict.t[2, 2]) / sum(dtree_tr_predict.t)

# Print accuracy in Prediction
cat("Model Accuracy on Sub sample on training data: ", dtree_tr_accuracy)

## Model Accuracy on Sub sample on training data:  0.7919162

4.2: Performance on the Testing Data

dtree_te_predict <- predict(dtree, newdata=sub_testing_data, type="class")
dtree_te_predict.t <- table(sub_testing_data$Survived, dtree_te_predict)

# Model Accuracy
dtree_testing_accuracy <- (dtree_te_predict.t[1, 1] + dtree_te_predict.t[2, 2]) / sum(dtree_te_predict.t)

# Print accuracy
cat("Model Accuracy in Prediction: ", dtree_testing_accuracy)

## Model Accuracy in Prediction:  0.7713004

4.3: Display the decision tree

fancyRpartPlot(dtree)

Part 5: Revised Model and Analysis of Results

For the revised model, I'll consider "Age" and "Passenger Class" as additional variables.

Note that the "Passenger Class" variable is a proxy for socio-economic status with:
1 = 1st, 
2 = 2nd, and 
3 = 3rd class of passenger's catagory

Part 5.1: Model Rebuilding

formula2 <- Survived ~ Sex + Pclass + Age

# Build the decision tree
dtree2 <- rpart(formula2, data = sub_training_data, method="class")

5.1: Performance on the Training Data

dtree_tr_predict2 <- predict(dtree2, newdata = sub_training_data, type="class")
dtree_tr_predict.t2 <- table(sub_training_data$Survived, dtree_tr_predict2)

# Model Accuracy
dtree_tr_accuracy2 <- (dtree_tr_predict.t2[1, 1] + dtree_tr_predict.t2[2, 2]) / sum(dtree_tr_predict.t2)

# Print accuracy in Prediction
cat("Model Accuracy on Sub sample on training data: ", dtree_tr_accuracy2)

## Model Accuracy on Sub sample on training data:  0.8128743

5.2: Performance on the Testing Data

dtree_te_predict2 <- predict(dtree2, newdata = sub_testing_data, type="class")
dtree_te_predict.t2 <- table(sub_testing_data$Survived, dtree_te_predict2)

# Model Accuracy
dtree_testing_accuracy2 <- (dtree_te_predict.t2[1, 1] + dtree_te_predict.t2[2, 2]) / sum(dtree_te_predict.t2)

# Print accuracy
cat("Model Accuracy in Prediction: ", dtree_testing_accuracy2)

## Model Accuracy in Prediction:  0.7847534

5.3: Display the Decision Tree

fancyRpartPlot(dtree2)