# You may set your own, mine is:
setwd("E:/Titanic_ML/")
This script trains a Random Forest model based on the data, saves a sample submission, and plots the relative importance of the variables in making predictions
Download my_rf_submission.csv from the output below and submit it through https://www.kaggle.com/c/titanic-gettingStarted/submissions/attach to enter this competition!
This analysis uses the Titanic survivor data provided by Kaggle at http://www.kaggle.com/c/titanic/data
I am using R for this analysis.
According to the description on Kaggle Titanic Competition, the variable descriptions are as follows:
survival: Survival (0 = No; 1 = Yes)
pclass: Passenger Class (proxy for socio-economic status 1 = 1st; 2 = 2nd; 3 = 3rd)
name: Name
sex: Sex
age: Age
sibsp: Number of Siblings/Spouses Aboard
parch: Number of Parents/Children Aboard
ticket: Ticket Number
fare: Passenger Fare
cabin: Cabin
embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
The purpose of this ananlysis is to build a model that selects the variables that are able to predctive the passenger's survival with better
prediction accuracy.
For this analysis, since I have survival labels (whether survived or not) I'll apply a supervised learning algorithm to build the model.
First let us load the required libraries in R and also data in the working environment.
library(ggplot2)
library(randomForest)
library(rpart)
library(rpart.plot)
library(pander)
library(caTools)
library(rattle)
library(RColorBrewer)
set.seed(1)
train <- read.csv("./input/train.csv", stringsAsFactors=FALSE)
test <- read.csv("./input/test.csv", stringsAsFactors=FALSE)
Training Data:
# Print few rows of training dataet to the console
m <- head(train)
knitr::kable(m, digits = 2, caption = "Training Data: First Few Rows of Data")
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.25 | S | |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.28 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.92 | S | |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.10 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.05 | S | |
6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.46 | Q |
Testing Data:
# Print few rows of testing dataet to the console
n <- head(test)
knitr::kable(n, digits = 2, caption = "Testing Data: First Few Rows of Data")
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|
892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.83 | Q | |
893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.00 | S | |
894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.69 | Q | |
895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.66 | S | |
896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.29 | S | |
897 | 3 | Svensson, Mr. Johan Cervin | male | 14.0 | 0 | 0 | 7538 | 9.22 | S |
str(train)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
str(test)
## 'data.frame': 418 obs. of 11 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : chr "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
## $ Sex : chr "male" "female" "male" "male" ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : chr "330911" "363272" "240276" "315154" ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : chr "" "" "" "" ...
## $ Embarked : chr "Q" "S" "Q" "S" ...
From the above results, it is clear that the 'Survived' variable is loaded as integer by default.
Therefore I'll convert it to a categorical varaible (factor).
I'll also convert other categorical variables to factors as Ill (which are not loaded as factors but actually are).
train$Survived <- as.factor(train$Survived)
train$Pclass <- as.factor(train$Pclass)
As previously noted, I'll split the training data into a set that will be used to train the model and a set that will be used to evaluate the model.
Test data set, as created from the above process, will contain 75% of randomly selected observations.
Training data set will contain the rest 25% obesvations (in original training set) which are exluded by newly created test data set.
set.seed(1446)
split <- sample.split(train$Survived, SplitRatio = 0.75)
sub_training_data <- subset(train, split == TRUE)
sub_testing_data <- subset(train, split == FALSE)
I'll first consider a simple model based on gender (Sex variable) of the passengers vs. survival from the disaster.
k <- ggplot(train, aes(Survived))
k + geom_bar(aes( fill = Sex), width=.85, colour="darkgreen") + scale_fill_brewer() +
ylab("Survival Count (Genderwise)") +
xlab("Survived: No = 0, Yes = 1") +
ggtitle("Titanic Disaster: Gender Vs. Survival (Training Dataset")
The following plot reveals that a majority of the passengers who survived Ire females.
Referring this idea, I can use Sex attribute for a basic model to predict whether a particular person survived or not.
Using R's "rpart" library, I will develop a simple decision tree (model):
formula <- Survived ~ Sex
# Build the decision tree
dtree <- rpart(formula, data=sub_training_data, method="class")
dtree_tr_predict <- predict(dtree, newdata=sub_training_data, type="class")
dtree_tr_predict.t <- table(sub_training_data$Survived, dtree_tr_predict)
# Model Accuracy
dtree_tr_accuracy <- (dtree_tr_predict.t[1, 1] + dtree_tr_predict.t[2, 2]) / sum(dtree_tr_predict.t)
# Print accuracy in Prediction
cat("Model Accuracy on Sub sample on training data: ", dtree_tr_accuracy)
## Model Accuracy on Sub sample on training data: 0.7919162
dtree_te_predict <- predict(dtree, newdata=sub_testing_data, type="class")
dtree_te_predict.t <- table(sub_testing_data$Survived, dtree_te_predict)
# Model Accuracy
dtree_testing_accuracy <- (dtree_te_predict.t[1, 1] + dtree_te_predict.t[2, 2]) / sum(dtree_te_predict.t)
# Print accuracy
cat("Model Accuracy in Prediction: ", dtree_testing_accuracy)
## Model Accuracy in Prediction: 0.7713004
fancyRpartPlot(dtree)
For the revised model, I'll consider "Age" and "Passenger Class" as additional variables.
Note that the "Passenger Class" variable is a proxy for socio-economic status with:
1 = 1st,
2 = 2nd, and
3 = 3rd class of passenger's catagory
formula2 <- Survived ~ Sex + Pclass + Age
# Build the decision tree
dtree2 <- rpart(formula2, data = sub_training_data, method="class")
dtree_tr_predict2 <- predict(dtree2, newdata = sub_training_data, type="class")
dtree_tr_predict.t2 <- table(sub_training_data$Survived, dtree_tr_predict2)
# Model Accuracy
dtree_tr_accuracy2 <- (dtree_tr_predict.t2[1, 1] + dtree_tr_predict.t2[2, 2]) / sum(dtree_tr_predict.t2)
# Print accuracy in Prediction
cat("Model Accuracy on Sub sample on training data: ", dtree_tr_accuracy2)
## Model Accuracy on Sub sample on training data: 0.8128743
dtree_te_predict2 <- predict(dtree2, newdata = sub_testing_data, type="class")
dtree_te_predict.t2 <- table(sub_testing_data$Survived, dtree_te_predict2)
# Model Accuracy
dtree_testing_accuracy2 <- (dtree_te_predict.t2[1, 1] + dtree_te_predict.t2[2, 2]) / sum(dtree_te_predict.t2)
# Print accuracy
cat("Model Accuracy in Prediction: ", dtree_testing_accuracy2)
## Model Accuracy in Prediction: 0.7847534
fancyRpartPlot(dtree2)