Set Working Directory

# You may set your own, mine is:
setwd("E:/Titanic_ML/")

This script trains a Random Forest model based on the data, saves a sample submission, and plots the relative importance of the variables in making predictions

Download my_rf_submission.csv from the output below and submit it through https://www.kaggle.com/c/titanic-gettingStarted/submissions/attach to enter this competition!


This analysis uses the Titanic survivor data provided by Kaggle at http://www.kaggle.com/c/titanic/data

I am using R for this analysis.

According to the description on Kaggle Titanic Competition, the variable descriptions are as follows:

    survival: Survival (0 = No; 1 = Yes)
    pclass: Passenger Class (proxy for socio-economic status 1 = 1st; 2 = 2nd; 3 = 3rd)
    name: Name
    sex: Sex
    age: Age
    sibsp: Number of Siblings/Spouses Aboard
    parch: Number of Parents/Children Aboard
    ticket: Ticket Number
    fare: Passenger Fare
    cabin: Cabin
    embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Part 1: Problem Description

The purpose of this ananlysis is to build a model that selects the variables that are able to predctive the passenger's survival with better 
prediction accuracy. 

For this analysis, since I have survival labels (whether survived or not) I'll apply a supervised learning algorithm to build the model.

Part 2: Model Building

First let us load the required libraries in R and also data in the working environment.

2.1: Load Required Libraries

library(ggplot2)
library(randomForest)
library(rpart)
library(rpart.plot)
library(pander)
library(caTools)
library(rattle)
library(RColorBrewer)

2.2: Load training and test data

set.seed(1)
train <- read.csv("./input/train.csv", stringsAsFactors=FALSE)
test  <- read.csv("./input/test.csv",  stringsAsFactors=FALSE)

2.3: View the data

Training Data:
# Print few rows of training dataet to the console
m <- head(train)
knitr::kable(m, digits = 2, caption = "Training Data: First Few Rows of Data")
Training Data: First Few Rows of Data
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.28 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.92 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.10 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 S
6 0 3 Moran, Mr. James male NA 0 0 330877 8.46 Q
Testing Data:
# Print few rows of testing dataet to the console
n <- head(test)
knitr::kable(n, digits = 2, caption = "Testing Data: First Few Rows of Data")
Testing Data: First Few Rows of Data
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
892 3 Kelly, Mr. James male 34.5 0 0 330911 7.83 Q
893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.00 S
894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.69 Q
895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.66 S
896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.29 S
897 3 Svensson, Mr. Johan Cervin male 14.0 0 0 7538 9.22 S

2.4: Also view structures

2.4.1: Training dataset

str(train)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

2.4.2: Testing dataset

str(test)
## 'data.frame':    418 obs. of  11 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
##  $ Sex        : chr  "male" "female" "male" "male" ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : chr  "" "" "" "" ...
##  $ Embarked   : chr  "Q" "S" "Q" "S" ...

2.5: Converting variables as factors

From the above results, it is clear that the 'Survived' variable is loaded as integer by default.

Therefore I'll convert it to a categorical varaible (factor). 

I'll also convert other categorical variables to factors as Ill (which are not loaded as factors but actually are).
train$Survived <- as.factor(train$Survived)
train$Pclass <- as.factor(train$Pclass)

2.6: Splitting training data

As previously noted, I'll split the training data into a set that will be used to train the model and a set that will be used to evaluate the model.

Test data set, as created from the above process, will contain 75% of randomly selected observations.

Training data set will contain the rest 25% obesvations (in original training set) which are exluded by newly created test data set.
set.seed(1446)
split <- sample.split(train$Survived, SplitRatio = 0.75)
sub_training_data <- subset(train, split == TRUE)
sub_testing_data <- subset(train, split == FALSE)

Part 3: Visualizing Data

I'll first consider a simple model based on gender (Sex variable) of the passengers vs. survival from the disaster. 
k <- ggplot(train, aes(Survived))
k + geom_bar(aes( fill = Sex), width=.85, colour="darkgreen") + scale_fill_brewer() +
  ylab("Survival Count (Genderwise)") +
  xlab("Survived: No = 0, Yes = 1") +
  ggtitle("Titanic Disaster: Gender Vs. Survival (Training Dataset")

The following plot reveals that a majority of the passengers who survived Ire females. 

Referring this idea, I can use Sex attribute for a basic model to predict whether a particular person survived or not.

Using R's "rpart" library, I will develop a simple decision tree (model):

Part 4: Build the decision tree model

formula <- Survived ~ Sex

# Build the decision tree
dtree <- rpart(formula, data=sub_training_data, method="class")

4.1: Performance on the Training Data

dtree_tr_predict <- predict(dtree, newdata=sub_training_data, type="class")
dtree_tr_predict.t <- table(sub_training_data$Survived, dtree_tr_predict)

# Model Accuracy
dtree_tr_accuracy <- (dtree_tr_predict.t[1, 1] + dtree_tr_predict.t[2, 2]) / sum(dtree_tr_predict.t)

# Print accuracy in Prediction
cat("Model Accuracy on Sub sample on training data: ", dtree_tr_accuracy)
## Model Accuracy on Sub sample on training data:  0.7919162

4.2: Performance on the Testing Data

dtree_te_predict <- predict(dtree, newdata=sub_testing_data, type="class")
dtree_te_predict.t <- table(sub_testing_data$Survived, dtree_te_predict)

# Model Accuracy
dtree_testing_accuracy <- (dtree_te_predict.t[1, 1] + dtree_te_predict.t[2, 2]) / sum(dtree_te_predict.t)

# Print accuracy
cat("Model Accuracy in Prediction: ", dtree_testing_accuracy)
## Model Accuracy in Prediction:  0.7713004

4.3: Display the decision tree

fancyRpartPlot(dtree)

Part 5: Revised Model and Analysis of Results

For the revised model, I'll consider "Age" and "Passenger Class" as additional variables.

Note that the "Passenger Class" variable is a proxy for socio-economic status with:
1 = 1st, 
2 = 2nd, and 
3 = 3rd class of passenger's catagory

Part 5.1: Model Rebuilding

formula2 <- Survived ~ Sex + Pclass + Age

# Build the decision tree
dtree2 <- rpart(formula2, data = sub_training_data, method="class")

5.1: Performance on the Training Data

dtree_tr_predict2 <- predict(dtree2, newdata = sub_training_data, type="class")
dtree_tr_predict.t2 <- table(sub_training_data$Survived, dtree_tr_predict2)

# Model Accuracy
dtree_tr_accuracy2 <- (dtree_tr_predict.t2[1, 1] + dtree_tr_predict.t2[2, 2]) / sum(dtree_tr_predict.t2)

# Print accuracy in Prediction
cat("Model Accuracy on Sub sample on training data: ", dtree_tr_accuracy2)
## Model Accuracy on Sub sample on training data:  0.8128743

5.2: Performance on the Testing Data

dtree_te_predict2 <- predict(dtree2, newdata = sub_testing_data, type="class")
dtree_te_predict.t2 <- table(sub_testing_data$Survived, dtree_te_predict2)

# Model Accuracy
dtree_testing_accuracy2 <- (dtree_te_predict.t2[1, 1] + dtree_te_predict.t2[2, 2]) / sum(dtree_te_predict.t2)

# Print accuracy
cat("Model Accuracy in Prediction: ", dtree_testing_accuracy2)
## Model Accuracy in Prediction:  0.7847534

5.3: Display the Decision Tree

fancyRpartPlot(dtree2)