Titanic: Machine Learning from Disaster

Description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Introduction

This is my first Kaggle competition and I have chosen to work with the Titanic dataset after spending time deciding which comeptition I would like to participate.

LOAD AND CHECK DATA

# Load packages
library('ggplot2') # visualization
library('ggthemes') # visualization
library('scales') # visualization
library('dplyr') # data manipulation

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library('mice') # imputation

## Loading required package: Rcpp

## mice 2.25 2015-11-09

library('randomForest') # classification algorithm

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(GGally)

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:dplyr':
## 
##     nasa

library(xtable)
library(caTools)
library(rpart.plot)

## Loading required package: rpart

library(rattle)

## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(RColorBrewer)

Now that are packages are load, lets take a peak at our data.

train <- read.csv('../Titantic-Machine Learning from Disaster/train.csv', stringsAsFactors = F)
test  <- read.csv('../Titantic-Machine Learning from Disaster/test.csv', stringsAsFactors = F)

Let’s take a look at our Train data.

#Print few rows of training dataet to the console
m <- head(train)
knitr::kable(m, digits = 2,align=c("l", "c", "c"), caption = "Training Data: First Few Rows of Data")

Training Data: First Few Rows of Data
PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen Harris	male	22	1	0	A/5 21171	7.25		S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	0	PC 17599	71.28	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	0	STON/O2. 3101282	7.92		S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	0	113803	53.10	C123	S
5	0	3	Allen, Mr. William Henry	male	35	0	0	373450	8.05		S
6	0	3	Moran, Mr. James	male	NA	0	0	330877	8.46		Q
Let’s take a l	ook at our *	Test*	data.

# Print few rows of testing dataet to the console
n <- head(test)
knitr::kable(n, digits = 2, caption = "Testing Data: First Few Rows of Data")

Testing Data: First Few Rows of Data
PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Embarked
892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.83	Q
893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.00	S
894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.69	Q
895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.66	S
896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.29	S
897	3	Svensson, Mr. Johan Cervin	male	14.0	0	0	7538	9.22	S

Now we are going to view the data structures in the Train table.

str(train)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Now we are going to view the data structures in the Test table.

str(test)

## 'data.frame':    418 obs. of  11 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
##  $ Sex        : chr  "male" "female" "male" "male" ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : chr  "" "" "" "" ...
##  $ Embarked   : chr  "Q" "S" "Q" "S" ...

To make a sense of our variables, their class type, and the first few observations of each. We know we’re working with 1309 observations of 12 variables. To make it more readable:

Variable Name	Description
Survived	Survived (1) or died (0)
Pclass	Passenger’s class
Name	Passenger’s name
Sex	Passenger’s sex
Age	Passenger’s age
SibSp	Number of siblings/spouses aboard
Parch	Number of parents/children aboard
Ticket	Ticket number
Fare	Fare
Cabin	Cabin
Embarked	Port of embarkation

DATA VISUALIZATION

library(ggplot2)
library(lattice)
library(caret)
data <- read.csv("../Titantic-Machine Learning from Disaster/train.csv",header = T, stringsAsFactors = T)
levels(data$Survived) <- c("0","1")
data$Survived <- as.factor(data$Survived)

par(mfrow=c(4,2))
barplot(table(data$Survived),
        names.arg = c("Perished", "Survived"),
        main="Survived (passenger fate)", col="black")
barplot(table(data$Pclass), 
        names.arg = c("first", "second", "third"),
        main="Pclass (passenger traveling class)", col="firebrick")
barplot(table(data$Sex), main="Sex (gender)", col="darkviolet")
hist(data$Age, main="Age", xlab = NULL, col="brown")
barplot(table(data$SibSp), main="SibSp (siblings + spouse aboard)", 
        col="darkblue")
barplot(table(data$Parch), main="Parch (parents + kids aboard)", 
        col="gray50")
hist(data$Fare, main="Fare (fee paid for ticket[s])", xlab = NULL, 
     col="darkgreen")
barplot(table(data$Embarked), 
        names.arg = c("","C", "Q", "S"),
        main="Embarked (port of embarkation)", col="sienna")

Converting variables as factors

From the above results, it is clear that the ‘Survived’ variable is loaded as integer by default.

Therefore I’ll convert it to a categorical varaible (factor).

I’ll also convert other categorical variables to factors as Ill (which are not loaded as factors but actually are).

train$Survived <- as.factor(train$Survived)
train$Pclass <- as.factor(train$Pclass)

Splitting training data As previously noted, I’ll split the training data into a set that will be used to train the model and a set that will be used to evaluate the model.

Test data set, as created from the above process, will contain 75% of randomly selected observations.

Training data set will contain the rest 25% obesvations (in original training set) which are exluded by newly created test data set.

set.seed(1446)
split <- sample.split(train$Survived, SplitRatio = 0.75)
sub_training_data <- subset(train, split == TRUE)
sub_testing_data <- subset(train, split == FALSE)

Visualizing Data Letâs explore the basic patterns in the dataset using GGally. You can observe that

1st class passengers are more likely to survive Female passengers are more likely to survive Children are more likely to survive

Part 4: Build the decision tree model

formula <- Survived ~ Sex

# Build the decision tree
dtree <- rpart(formula, data=sub_training_data, method="class")

Performance on the Training Data

dtree_tr_predict <- predict(dtree, newdata=sub_training_data, type="class")
dtree_tr_predict.t <- table(sub_training_data$Survived, dtree_tr_predict)

# Model Accuracy
dtree_tr_accuracy <- (dtree_tr_predict.t[1, 1] + dtree_tr_predict.t[2, 2]) / sum(dtree_tr_predict.t)

# Print accuracy in Prediction
cat("Model Accuracy on Sub sample on training data: ", dtree_tr_accuracy)

## Model Accuracy on Sub sample on training data:  0.7919162

Performance on the Testing Data

dtree_te_predict <- predict(dtree, newdata=sub_testing_data, type="class")
dtree_te_predict.t <- table(sub_testing_data$Survived, dtree_te_predict)

# Model Accuracy
dtree_testing_accuracy <- (dtree_te_predict.t[1, 1] + dtree_te_predict.t[2, 2]) / sum(dtree_te_predict.t)

# Print accuracy
cat("Model Accuracy in Prediction: ", dtree_testing_accuracy)

## Model Accuracy in Prediction:  0.7713004

Display the decision tree

fancyRpartPlot(dtree)

Part 5: Revised Model and Analysis of Results For the revised model, I’ll consider “Age” and “Passenger Class” as additional variables.

Note that the “Passenger Class” variable is a proxy for socio-economic status with: 1 = 1st, 2 = 2nd, and 3 = 3rd class of passenger’s catagory

formula2 <- Survived ~ Sex + Pclass + Age

# Build the decision tree
dtree2 <- rpart(formula2, data = sub_training_data, method="class")

Performance on the Training Data

dtree_tr_predict2 <- predict(dtree2, newdata = sub_training_data, type="class")
dtree_tr_predict.t2 <- table(sub_training_data$Survived, dtree_tr_predict2)

# Model Accuracy
dtree_tr_accuracy2 <- (dtree_tr_predict.t2[1, 1] + dtree_tr_predict.t2[2, 2]) / sum(dtree_tr_predict.t2)

# Print accuracy in Prediction
cat("Model Accuracy on Sub sample on training data: ", dtree_tr_accuracy2)

## Model Accuracy on Sub sample on training data:  0.8128743

Performance on the Testing Data

dtree_te_predict2 <- predict(dtree2, newdata = sub_testing_data, type="class")
dtree_te_predict.t2 <- table(sub_testing_data$Survived, dtree_te_predict2)

# Model Accuracy
dtree_testing_accuracy2 <- (dtree_te_predict.t2[1, 1] + dtree_te_predict.t2[2, 2]) / sum(dtree_te_predict.t2)

# Print accuracy
cat("Model Accuracy in Prediction: ", dtree_testing_accuracy2)

## Model Accuracy in Prediction:  0.7847534

Display the Decision Tree

fancyRpartPlot(dtree2)

Machine Learning from Disaster

William Green

February 2, 2017

Titanic: Machine Learning from Disaster

Description

Introduction

LOAD AND CHECK DATA

DATA VISUALIZATION