Project Proposal - Predicting survival on the Titanic

Titanic Tragedy

1. Introduction

I intend to use the ‘Titanic practice dataset’ from the Kaggle Competition (Titanic- Machine Learning from Disaster) to predict the types of passengers who are most likely to survive the disaster. In this project proposal, I provide a quick overview of the problem, the dataset and my current thinking on how I will approach solving the problem.

1.1 Competition Description (from Kaggle)

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

2. The Dataset

A quick review of the training & test datasets shows 12 variables with ~1300 line items. Each row provides details about an individual passenger such as Age, sex, Passenger Class, family details etc. The independent variable is the ‘Survived’ field with a binary (1/0) output. The full name and description of each variable are listed below. A detailed explanation of the data can be found in the kaggle data dictionary

train <- read.csv('train.csv', stringsAsFactors = F)
knitr::kable(str(train))

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

2.1 Variable Name & Description

names <- c("Survived","Pclass","Name","Sex","Age","SibSp","Parch","Ticket","Fare","Cabin","Embarked")

description <- c("Survived (1) or died (0)","Passenger’s class","Passenger’s name","Passenger’s sex","Passenger’s age","Number of siblings & spouses aboard","Number of parents & children aboard",
"Ticket number","Fare","Cabin","Port of embarkation")

NamesDesc <- data.frame(names,description)
NamesDesc

##       names                         description
## 1  Survived            Survived (1) or died (0)
## 2    Pclass                   Passengers class
## 3      Name                    Passengers name
## 4       Sex                     Passengers sex
## 5       Age                     Passengers age
## 6     SibSp Number of siblings & spouses aboard
## 7     Parch Number of parents & children aboard
## 8    Ticket                       Ticket number
## 9      Fare                                Fare
## 10    Cabin                               Cabin
## 11 Embarked                 Port of embarkation

3. Approach

My proposal on providing a solution to this problem is as follows

1. Wrangle Data and Cleanup

The first step is to parse the training dataset and provide a more clear picture of the data. This may include - identifying and mapping passenger titles from the passenger name - Breakdown the Cabin variable values to identify cabin location - Impute mising values (see below)

1.1. Impute missing values

For variables that have large chunks of missing data, I will aim to impute the missing data through both simple statistical approaches (Mean/Median/Mode etc) as well as by using more complex imputation packages such as MICE / Amelia.

2. Identify key variables

Based on the provided details and reviewing the history of the incident, I will attempt to identify the most intutive variables that impact survivability.

Based on initial review of the dataset and historical incident details some of the key variables impacting survival can be intutively identified such as the passenger class (Upper is better) sex (Female over Male) and Age (children vs Adults).

Other variables are not as intutive and will require analysis. For example

By family size - (SibSp) + Parch
By location i.e., Cabin values seems to have the deck details

3. Confirm with Data Analysis and identify other interdepent variables

Subject the selected variables to co-relation analysis with the independent variable as well as any other ‘yet to be determind’ analysis to determine the best choices.I will graphically plot the interdependence using various type of graphs to provide a visual showcase.

4. Develop Model and compare against test dataset

Suject the initial selection of key variables to survivability to a classification algorithm (such as Random Forest) and test acccuracy against the independet variable. I expect that there will need to be multiple adjustments and iterations before locking on an optimal version to run against the test dataset. The best outcome is to identify the top ranking variables that imapct survivability.

References

Kaggle Titanic Dataset