Executive Report - Titanic

Caio Miyashiro
22.09.2017

An overview of the analysis over the Titanic database.

Introduction

Famous dataset from kaggle, containing information about passengers and a final variable indicating if the passenger survived the tragedy or not.

Objectives

  • Evaluate dataset and each feature capacity to predict survival
  • Indicate if more features can be extracted from the existing dataset
  • Build and evaluate models who can predict whether the passenger is going to survive or not

Data Overview and procedures

names(titanicDataset)
 [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
 [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
[11] "Cabin"       "Embarked"   
  1. Data cleaning and preparation. Missing data and transformations
  2. Exploratory data analysis. Outliers and correlation analysis
  3. Modelling. Simple machine learning models and Stack ensembles
  4. Evaluation, conclusion and next directions.

Findings and directions

  1. Exploratory Data Analysis detected Sex and Pclass as most important features. If you are a woman and first or second class, your chances of survival were higher
  2. Extracted features (title and family size) proven to be good, but create multicollinearity (redundancy) with Sex variable
  3. Models achieved ~79% Accuracy in test set. Naive Bayes identified as the best model to correctly identify if a person has a chance of survival given his attributes

  4. Complete data analysis can be found here!