Developing Data Products: Titanic Machine Learning from Disaster

Georgios Mintzopoulos
23/09/2015

Summary

In the context of the course Developing Data Products we used the Titanic dataset as provided by Kaggle. We use the provided train csv file for building two prediction on who survived the disaster.

The application written in shiny for this project works as following:

  1. It loads the train.csv file with the training data as provided in Kaggle.
  2. An exploratory analysis of this daset is done.
  3. A new data set is loaded (training.csv) as was wrangled in another RScript
  4. We use the training data set to do a split in a training and a test set.
    • The split is based on user input, with default 80% training and 20% testing
  5. We fit two tree models to predict the outcome (who Survived), according to user input for the tuning parameters.
    • Model 1 is a simple classification tree (CART) using rpart package.
    • Model 2 is a Random Forest

Data

The initial dataset has 891 observations and 11 variables as possible predictors, and the structure:

str(train)
'data.frame':   891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
 $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

Data Wnagling

  • Data wrangling was performed outside the shiny app to fit space requirements in shinyapps cloud
  • The wrangling steps are data imputation of missing values and variables standardization, described in the shiny app.

The wrangled dataset has 6 predictors and the following structure:

str(training)
'data.frame':   714 obs. of  7 variables:
 $ Survived: int  0 1 1 0 0 0 1 1 1 1 ...
 $ Pclass  : int  3 1 1 3 1 3 3 2 3 1 ...
 $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 2 2 2 1 1 1 1 ...
 $ Age     : num  22 38 35 29 54 2 27 14 4 58 ...
 $ SibSp   : int  1 1 1 0 0 3 0 1 1 0 ...
 $ Parch   : int  0 0 0 0 0 1 2 0 1 0 ...
 $ Fare    : num  7.25 71.28 53.1 8.46 51.86 ...

Model Fit

The models we fit are:

  1. A simple CART tree. The user chooses the train/test split percentage and the cp parameter.
  2. A Random Forest. The user chooses the train/test split, number of trees (ntree) and mtry parameters.

The fit results are shown as a confusion matrix.