Titanic Tragedy
I intend to use the ‘Titanic practice dataset’ from the Kaggle Competition (Titanic- Machine Learning from Disaster) to predict the types of passengers who are most likely to survive the disaster. In this project proposal, I provide a quick overview of the problem, the dataset and my current thinking on how I will approach solving the problem.
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
A quick review of the training & test datasets shows 12 variables with ~1300 line items. Each row provides details about an individual passenger such as Age, sex, Passenger Class, family details etc. The independent variable is the ‘Survived’ field with a binary (1/0) output. The full name and description of each variable are listed below. A detailed explanation of the data can be found in the kaggle data dictionary
train <- read.csv('train.csv', stringsAsFactors = F)
knitr::kable(str(train))
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
names <- c("Survived","Pclass","Name","Sex","Age","SibSp","Parch","Ticket","Fare","Cabin","Embarked")
description <- c("Survived (1) or died (0)","Passenger’s class","Passenger’s name","Passenger’s sex","Passenger’s age","Number of siblings & spouses aboard","Number of parents & children aboard",
"Ticket number","Fare","Cabin","Port of embarkation")
NamesDesc <- data.frame(names,description)
NamesDesc
## names description
## 1 Survived Survived (1) or died (0)
## 2 Pclass Passengers class
## 3 Name Passengers name
## 4 Sex Passengers sex
## 5 Age Passengers age
## 6 SibSp Number of siblings & spouses aboard
## 7 Parch Number of parents & children aboard
## 8 Ticket Ticket number
## 9 Fare Fare
## 10 Cabin Cabin
## 11 Embarked Port of embarkation
My proposal on providing a solution to this problem is as follows
The first step is to parse the training dataset and provide a more clear picture of the data. This may include - identifying and mapping passenger titles from the passenger name - Breakdown the Cabin variable values to identify cabin location - Impute mising values (see below)
For variables that have large chunks of missing data, I will aim to impute the missing data through both simple statistical approaches (Mean/Median/Mode etc) as well as by using more complex imputation packages such as MICE / Amelia.
Based on the provided details and reviewing the history of the incident, I will attempt to identify the most intutive variables that impact survivability.
Based on initial review of the dataset and historical incident details some of the key variables impacting survival can be intutively identified such as the passenger class (Upper is better) sex (Female over Male) and Age (children vs Adults).
Other variables are not as intutive and will require analysis. For example
Subject the selected variables to co-relation analysis with the independent variable as well as any other ‘yet to be determind’ analysis to determine the best choices.I will graphically plot the interdependence using various type of graphs to provide a visual showcase.
Suject the initial selection of key variables to survivability to a classification algorithm (such as Random Forest) and test acccuracy against the independet variable. I expect that there will need to be multiple adjustments and iterations before locking on an optimal version to run against the test dataset. The best outcome is to identify the top ranking variables that imapct survivability.