The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
This is my first Kaggle competition and I have chosen to work with the Titanic dataset after spending time deciding which comeptition I would like to participate.
# Load packages
library('ggplot2') # visualization
library('ggthemes') # visualization
library('scales') # visualization
library('dplyr') # data manipulation
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library('mice') # imputation
## Loading required package: Rcpp
## mice 2.25 2015-11-09
library('randomForest') # classification algorithm
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(GGally)
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(xtable)
library(caTools)
library(rpart.plot)
## Loading required package: rpart
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(RColorBrewer)
Now that are packages are load, lets take a peak at our data.
train <- read.csv('../Titantic-Machine Learning from Disaster/train.csv', stringsAsFactors = F)
test <- read.csv('../Titantic-Machine Learning from Disaster/test.csv', stringsAsFactors = F)
Let’s take a look at our Train data.
#Print few rows of training dataet to the console
m <- head(train)
knitr::kable(m, digits = 2,align=c("l", "c", "c"), caption = "Training Data: First Few Rows of Data")
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.25 | S | |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.28 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.92 | S | |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.10 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.05 | S | |
| 6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.46 | Q | |
| Let’s take a l | ook at our * | *Test** | data. |
# Print few rows of testing dataet to the console
n <- head(test)
knitr::kable(n, digits = 2, caption = "Testing Data: First Few Rows of Data")
| PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|
| 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.83 | Q | |
| 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.00 | S | |
| 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.69 | Q | |
| 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.66 | S | |
| 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.29 | S | |
| 897 | 3 | Svensson, Mr. Johan Cervin | male | 14.0 | 0 | 0 | 7538 | 9.22 | S |
Now we are going to view the data structures in the Train table.
str(train)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
Now we are going to view the data structures in the Test table.
str(test)
## 'data.frame': 418 obs. of 11 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : chr "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
## $ Sex : chr "male" "female" "male" "male" ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : chr "330911" "363272" "240276" "315154" ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : chr "" "" "" "" ...
## $ Embarked : chr "Q" "S" "Q" "S" ...
To make a sense of our variables, their class type, and the first few observations of each. We know we’re working with 1309 observations of 12 variables. To make it more readable:
| Variable Name | Description |
|---|---|
| Survived | Survived (1) or died (0) |
| Pclass | Passenger’s class |
| Name | Passenger’s name |
| Sex | Passenger’s sex |
| Age | Passenger’s age |
| SibSp | Number of siblings/spouses aboard |
| Parch | Number of parents/children aboard |
| Ticket | Ticket number |
| Fare | Fare |
| Cabin | Cabin |
| Embarked | Port of embarkation |
library(ggplot2)
library(lattice)
library(caret)
data <- read.csv("../Titantic-Machine Learning from Disaster/train.csv",header = T, stringsAsFactors = T)
levels(data$Survived) <- c("0","1")
data$Survived <- as.factor(data$Survived)
par(mfrow=c(4,2))
barplot(table(data$Survived),
names.arg = c("Perished", "Survived"),
main="Survived (passenger fate)", col="black")
barplot(table(data$Pclass),
names.arg = c("first", "second", "third"),
main="Pclass (passenger traveling class)", col="firebrick")
barplot(table(data$Sex), main="Sex (gender)", col="darkviolet")
hist(data$Age, main="Age", xlab = NULL, col="brown")
barplot(table(data$SibSp), main="SibSp (siblings + spouse aboard)",
col="darkblue")
barplot(table(data$Parch), main="Parch (parents + kids aboard)",
col="gray50")
hist(data$Fare, main="Fare (fee paid for ticket[s])", xlab = NULL,
col="darkgreen")
barplot(table(data$Embarked),
names.arg = c("","C", "Q", "S"),
main="Embarked (port of embarkation)", col="sienna")
Converting variables as factors
From the above results, it is clear that the ‘Survived’ variable is loaded as integer by default.
Therefore I’ll convert it to a categorical varaible (factor).
I’ll also convert other categorical variables to factors as Ill (which are not loaded as factors but actually are).
train$Survived <- as.factor(train$Survived)
train$Pclass <- as.factor(train$Pclass)
Splitting training data As previously noted, I’ll split the training data into a set that will be used to train the model and a set that will be used to evaluate the model.
Test data set, as created from the above process, will contain 75% of randomly selected observations.
Training data set will contain the rest 25% obesvations (in original training set) which are exluded by newly created test data set.
set.seed(1446)
split <- sample.split(train$Survived, SplitRatio = 0.75)
sub_training_data <- subset(train, split == TRUE)
sub_testing_data <- subset(train, split == FALSE)
Visualizing Data Letâs explore the basic patterns in the dataset using GGally. You can observe that
1st class passengers are more likely to survive Female passengers are more likely to survive Children are more likely to survive
Part 4: Build the decision tree model
formula <- Survived ~ Sex
# Build the decision tree
dtree <- rpart(formula, data=sub_training_data, method="class")
Performance on the Training Data
dtree_tr_predict <- predict(dtree, newdata=sub_training_data, type="class")
dtree_tr_predict.t <- table(sub_training_data$Survived, dtree_tr_predict)
# Model Accuracy
dtree_tr_accuracy <- (dtree_tr_predict.t[1, 1] + dtree_tr_predict.t[2, 2]) / sum(dtree_tr_predict.t)
# Print accuracy in Prediction
cat("Model Accuracy on Sub sample on training data: ", dtree_tr_accuracy)
## Model Accuracy on Sub sample on training data: 0.7919162
Performance on the Testing Data
dtree_te_predict <- predict(dtree, newdata=sub_testing_data, type="class")
dtree_te_predict.t <- table(sub_testing_data$Survived, dtree_te_predict)
# Model Accuracy
dtree_testing_accuracy <- (dtree_te_predict.t[1, 1] + dtree_te_predict.t[2, 2]) / sum(dtree_te_predict.t)
# Print accuracy
cat("Model Accuracy in Prediction: ", dtree_testing_accuracy)
## Model Accuracy in Prediction: 0.7713004
Display the decision tree
fancyRpartPlot(dtree)
Part 5: Revised Model and Analysis of Results For the revised model, I’ll consider “Age” and “Passenger Class” as additional variables.
Note that the “Passenger Class” variable is a proxy for socio-economic status with: 1 = 1st, 2 = 2nd, and 3 = 3rd class of passenger’s catagory
formula2 <- Survived ~ Sex + Pclass + Age
# Build the decision tree
dtree2 <- rpart(formula2, data = sub_training_data, method="class")
Performance on the Training Data
dtree_tr_predict2 <- predict(dtree2, newdata = sub_training_data, type="class")
dtree_tr_predict.t2 <- table(sub_training_data$Survived, dtree_tr_predict2)
# Model Accuracy
dtree_tr_accuracy2 <- (dtree_tr_predict.t2[1, 1] + dtree_tr_predict.t2[2, 2]) / sum(dtree_tr_predict.t2)
# Print accuracy in Prediction
cat("Model Accuracy on Sub sample on training data: ", dtree_tr_accuracy2)
## Model Accuracy on Sub sample on training data: 0.8128743
Performance on the Testing Data
dtree_te_predict2 <- predict(dtree2, newdata = sub_testing_data, type="class")
dtree_te_predict.t2 <- table(sub_testing_data$Survived, dtree_te_predict2)
# Model Accuracy
dtree_testing_accuracy2 <- (dtree_te_predict.t2[1, 1] + dtree_te_predict.t2[2, 2]) / sum(dtree_te_predict.t2)
# Print accuracy
cat("Model Accuracy in Prediction: ", dtree_testing_accuracy2)
## Model Accuracy in Prediction: 0.7847534
Display the Decision Tree
fancyRpartPlot(dtree2)