Titanic: Machine Learning from Disaster

Description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Introduction

This is my first Kaggle competition and I have chosen to work with the Titanic dataset after spending time deciding which comeptition I would like to participate.

LOAD AND CHECK DATA

# Load packages
library('ggplot2') # visualization
library('ggthemes') # visualization
library('scales') # visualization
library('dplyr') # data manipulation
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library('mice') # imputation
## Loading required package: Rcpp
## mice 2.25 2015-11-09
library('randomForest') # classification algorithm
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(xtable)
library(caTools)
library(rpart.plot)
## Loading required package: rpart
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(RColorBrewer)

Now that are packages are load, lets take a peak at our data.

train <- read.csv('../Titantic-Machine Learning from Disaster/train.csv', stringsAsFactors = F)
test  <- read.csv('../Titantic-Machine Learning from Disaster/test.csv', stringsAsFactors = F)

Let’s take a look at our Train data.

#Print few rows of training dataet to the console
m <- head(train)
knitr::kable(m, digits = 2,align=c("l", "c", "c"), caption = "Training Data: First Few Rows of Data")
Training Data: First Few Rows of Data
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.28 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.92 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.10 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 S
6 0 3 Moran, Mr. James male NA 0 0 330877 8.46 Q
Let’s take a l ook at our * *Test** data.
# Print few rows of testing dataet to the console
n <- head(test)
knitr::kable(n, digits = 2, caption = "Testing Data: First Few Rows of Data")
Testing Data: First Few Rows of Data
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
892 3 Kelly, Mr. James male 34.5 0 0 330911 7.83 Q
893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.00 S
894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.69 Q
895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.66 S
896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.29 S
897 3 Svensson, Mr. Johan Cervin male 14.0 0 0 7538 9.22 S

Now we are going to view the data structures in the Train table.

str(train)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Now we are going to view the data structures in the Test table.

str(test)
## 'data.frame':    418 obs. of  11 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
##  $ Sex        : chr  "male" "female" "male" "male" ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : chr  "" "" "" "" ...
##  $ Embarked   : chr  "Q" "S" "Q" "S" ...

To make a sense of our variables, their class type, and the first few observations of each. We know we’re working with 1309 observations of 12 variables. To make it more readable:

Variable Name Description
Survived Survived (1) or died (0)
Pclass Passenger’s class
Name Passenger’s name
Sex Passenger’s sex
Age Passenger’s age
SibSp Number of siblings/spouses aboard
Parch Number of parents/children aboard
Ticket Ticket number
Fare Fare
Cabin Cabin
Embarked Port of embarkation

DATA VISUALIZATION

library(ggplot2)
library(lattice)
library(caret)
data <- read.csv("../Titantic-Machine Learning from Disaster/train.csv",header = T, stringsAsFactors = T)
levels(data$Survived) <- c("0","1")
data$Survived <- as.factor(data$Survived)
par(mfrow=c(4,2))
barplot(table(data$Survived),
        names.arg = c("Perished", "Survived"),
        main="Survived (passenger fate)", col="black")
barplot(table(data$Pclass), 
        names.arg = c("first", "second", "third"),
        main="Pclass (passenger traveling class)", col="firebrick")
barplot(table(data$Sex), main="Sex (gender)", col="darkviolet")
hist(data$Age, main="Age", xlab = NULL, col="brown")
barplot(table(data$SibSp), main="SibSp (siblings + spouse aboard)", 
        col="darkblue")
barplot(table(data$Parch), main="Parch (parents + kids aboard)", 
        col="gray50")
hist(data$Fare, main="Fare (fee paid for ticket[s])", xlab = NULL, 
     col="darkgreen")
barplot(table(data$Embarked), 
        names.arg = c("","C", "Q", "S"),
        main="Embarked (port of embarkation)", col="sienna")

Converting variables as factors

From the above results, it is clear that the ‘Survived’ variable is loaded as integer by default.

Therefore I’ll convert it to a categorical varaible (factor).

I’ll also convert other categorical variables to factors as Ill (which are not loaded as factors but actually are).

train$Survived <- as.factor(train$Survived)
train$Pclass <- as.factor(train$Pclass)

Splitting training data As previously noted, I’ll split the training data into a set that will be used to train the model and a set that will be used to evaluate the model.

Test data set, as created from the above process, will contain 75% of randomly selected observations.

Training data set will contain the rest 25% obesvations (in original training set) which are exluded by newly created test data set.

set.seed(1446)
split <- sample.split(train$Survived, SplitRatio = 0.75)
sub_training_data <- subset(train, split == TRUE)
sub_testing_data <- subset(train, split == FALSE)

Visualizing Data Let’s explore the basic patterns in the dataset using GGally. You can observe that

1st class passengers are more likely to survive Female passengers are more likely to survive Children are more likely to survive

Part 4: Build the decision tree model

formula <- Survived ~ Sex

# Build the decision tree
dtree <- rpart(formula, data=sub_training_data, method="class")

Performance on the Training Data

dtree_tr_predict <- predict(dtree, newdata=sub_training_data, type="class")
dtree_tr_predict.t <- table(sub_training_data$Survived, dtree_tr_predict)

# Model Accuracy
dtree_tr_accuracy <- (dtree_tr_predict.t[1, 1] + dtree_tr_predict.t[2, 2]) / sum(dtree_tr_predict.t)

# Print accuracy in Prediction
cat("Model Accuracy on Sub sample on training data: ", dtree_tr_accuracy)
## Model Accuracy on Sub sample on training data:  0.7919162

Performance on the Testing Data

dtree_te_predict <- predict(dtree, newdata=sub_testing_data, type="class")
dtree_te_predict.t <- table(sub_testing_data$Survived, dtree_te_predict)

# Model Accuracy
dtree_testing_accuracy <- (dtree_te_predict.t[1, 1] + dtree_te_predict.t[2, 2]) / sum(dtree_te_predict.t)

# Print accuracy
cat("Model Accuracy in Prediction: ", dtree_testing_accuracy)
## Model Accuracy in Prediction:  0.7713004

Display the decision tree

fancyRpartPlot(dtree)

Part 5: Revised Model and Analysis of Results For the revised model, I’ll consider “Age” and “Passenger Class” as additional variables.

Note that the “Passenger Class” variable is a proxy for socio-economic status with: 1 = 1st, 2 = 2nd, and 3 = 3rd class of passenger’s catagory

formula2 <- Survived ~ Sex + Pclass + Age

# Build the decision tree
dtree2 <- rpart(formula2, data = sub_training_data, method="class")

Performance on the Training Data

dtree_tr_predict2 <- predict(dtree2, newdata = sub_training_data, type="class")
dtree_tr_predict.t2 <- table(sub_training_data$Survived, dtree_tr_predict2)

# Model Accuracy
dtree_tr_accuracy2 <- (dtree_tr_predict.t2[1, 1] + dtree_tr_predict.t2[2, 2]) / sum(dtree_tr_predict.t2)

# Print accuracy in Prediction
cat("Model Accuracy on Sub sample on training data: ", dtree_tr_accuracy2)
## Model Accuracy on Sub sample on training data:  0.8128743

Performance on the Testing Data

dtree_te_predict2 <- predict(dtree2, newdata = sub_testing_data, type="class")
dtree_te_predict.t2 <- table(sub_testing_data$Survived, dtree_te_predict2)

# Model Accuracy
dtree_testing_accuracy2 <- (dtree_te_predict.t2[1, 1] + dtree_te_predict.t2[2, 2]) / sum(dtree_te_predict.t2)

# Print accuracy
cat("Model Accuracy in Prediction: ", dtree_testing_accuracy2)
## Model Accuracy in Prediction:  0.7847534

Display the Decision Tree

fancyRpartPlot(dtree2)