Executive Report - Titanic

Caio Miyashiro
22.09.2017

An overview of the analysis over the Titanic database.

Introduction

Famous dataset from kaggle, containing information about passengers and a final variable indicating if the passenger survived the tragedy or not.

Objectives

Evaluate dataset and each feature capacity to predict survival
Indicate if more features can be extracted from the existing dataset
Build and evaluate models who can predict whether the passenger is going to survive or not

Data Overview and procedures

names(titanicDataset)

 [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
 [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
[11] "Cabin"       "Embarked"

Data cleaning and preparation. Missing data and transformations
Exploratory data analysis. Outliers and correlation analysis
Modelling. Simple machine learning models and Stack ensembles
Evaluation, conclusion and next directions.

Findings and directions

Exploratory Data Analysis detected Sex and Pclass as most important features. If you are a woman and first or second class, your chances of survival were higher
Extracted features (title and family size) proven to be good, but create multicollinearity (redundancy) with Sex variable
Models achieved ~79% Accuracy in test set. Naive Bayes identified as the best model to correctly identify if a person has a chance of survival given his attributes
Complete data analysis can be found here!