Random Forest Approach to Titanic Problem

Preview of The Problem

We want to build a RF model that classifies passengers of the Titanic ship that sunk in 1912 as having survived or not based on a number of predictors (features). This is a Kaggle begginer challenge. Both test and train datasets are obtained here.

Loading Packages

rm(list = ls(all.names = TRUE))

library(dplyr)         # For data wrangling
library(mice)          # For imputaion
library(VIM)           # For imputaion
library(caret)         # For ML
library(randomForest)  # For RD

Loading Data

Train <- read.csv("D:\\Data Science\\Hackathons\\Kaggle\\Titanic--Machine Learning from Disaster\\train.csv") %>% select(-c(PassengerId, Name, Ticket, Cabin))

Test <- read.csv("D:\\Data Science\\Hackathons\\Kaggle\\Titanic--Machine Learning from Disaster\\test.csv") %>% select(-c(Name, Ticket, Cabin))

set.seed(1111)
Train %>% sample_n(5)  # View 5 random rows

##   Survived Pclass    Sex Age SibSp Parch    Fare Embarked
## 1        0      3   male  39     0     0 24.1500        S
## 2        1      2 female  24     2     3 18.7500        S
## 3        0      3 female  18     1     0 17.8000        S
## 4        0      1   male  NA     0     0 30.6958        C
## 5        1      1 female  19     1     0 91.0792        C

We should ensure that categorical variables such as Survived are captured as type factor and so on

Cleaning Data

str(Train)

## 'data.frame':    891 obs. of  8 variables:
##  $ Survived: int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass  : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch   : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

The above results show that variables are not in appropritae class. We correct this below.

Embarked_Names <- c(C = "Cherbourg", Q = "Queenstown", S = "Southampton" )

Train <- within(Train, {
  Survived <- factor(Survived)
  Pclass <- factor(Pclass,
                   levels = c(1, 2, 3),
                   labels = c("1st", "2nd", "3rd"))
  Embarked <- recode(Embarked, !!!Embarked_Names) %>% factor()
})

Test <- within(Test, {
  Pclass <- factor(Pclass,
                   levels = c(1, 2, 3),
                   labels = c("1st", "2nd", "3rd"))
  Embarked <- recode(Embarked, !!!Embarked_Names) %>% factor()
})

We need to eensure that our dataset has no missing instances wtherether in the DV or any of the IVs.

anyNA(Train)

## [1] TRUE

The result above shows that there are missing instances in some columns.

md.pairs(Train)

## $rr
##          Survived Pclass Sex Age SibSp Parch Fare Embarked
## Survived      891    891 891 714   891   891  891      891
## Pclass        891    891 891 714   891   891  891      891
## Sex           891    891 891 714   891   891  891      891
## Age           714    714 714 714   714   714  714      714
## SibSp         891    891 891 714   891   891  891      891
## Parch         891    891 891 714   891   891  891      891
## Fare          891    891 891 714   891   891  891      891
## Embarked      891    891 891 714   891   891  891      891
## 
## $rm
##          Survived Pclass Sex Age SibSp Parch Fare Embarked
## Survived        0      0   0 177     0     0    0        0
## Pclass          0      0   0 177     0     0    0        0
## Sex             0      0   0 177     0     0    0        0
## Age             0      0   0   0     0     0    0        0
## SibSp           0      0   0 177     0     0    0        0
## Parch           0      0   0 177     0     0    0        0
## Fare            0      0   0 177     0     0    0        0
## Embarked        0      0   0 177     0     0    0        0
## 
## $mr
##          Survived Pclass Sex Age SibSp Parch Fare Embarked
## Survived        0      0   0   0     0     0    0        0
## Pclass          0      0   0   0     0     0    0        0
## Sex             0      0   0   0     0     0    0        0
## Age           177    177 177   0   177   177  177      177
## SibSp           0      0   0   0     0     0    0        0
## Parch           0      0   0   0     0     0    0        0
## Fare            0      0   0   0     0     0    0        0
## Embarked        0      0   0   0     0     0    0        0
## 
## $mm
##          Survived Pclass Sex Age SibSp Parch Fare Embarked
## Survived        0      0   0   0     0     0    0        0
## Pclass          0      0   0   0     0     0    0        0
## Sex             0      0   0   0     0     0    0        0
## Age             0      0   0 177     0     0    0        0
## SibSp           0      0   0   0     0     0    0        0
## Parch           0      0   0   0     0     0    0        0
## Fare            0      0   0   0     0     0    0        0
## Embarked        0      0   0   0     0     0    0        0

rr = ‘observed vs observed’
rm = ‘observed vs missing’
mr = ‘missing vs observed’
mm = ‘missing vs missing’

We visualise this as follows.

md.pattern(Train)

##     Survived Pclass Sex SibSp Parch Fare Embarked Age    
## 714        1      1   1     1     1    1        1   1   0
## 177        1      1   1     1     1    1        1   0   1
##            0      0   0     0     0    0        0 177 177

The plot above reveals to us that only Age has missing values at 177 instances. We can thus impute this in the following manner.

Impute_train <- mice(Train, m = 3, seed = 1111)

## 
##  iter imp variable
##   1   1  Age
##   1   2  Age
##   1   3  Age
##   2   1  Age
##   2   2  Age
##   2   3  Age
##   3   1  Age
##   3   2  Age
##   3   3  Age
##   4   1  Age
##   4   2  Age
##   4   3  Age
##   5   1  Age
##   5   2  Age
##   5   3  Age

Impute_train

## Class: mids
## Number of multiple imputations:  3 
## Imputation methods:
## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##       ""       ""       ""    "pmm"       ""       ""       ""       "" 
## PredictorMatrix:
##          Survived Pclass Sex Age SibSp Parch Fare Embarked
## Survived        0      1   1   1     1     1    1        1
## Pclass          1      0   1   1     1     1    1        1
## Sex             1      1   0   1     1     1    1        1
## Age             1      1   1   0     1     1    1        1
## SibSp           1      1   1   1     0     1    1        1
## Parch           1      1   1   1     1     0    1        1

The function mice() will generate three random imputations (if m = 3) and interate 5 times by default. The default method for imputing numeric varaibles is pmm, which stands for predictive mean matching, while for factor variables is polyreg which is multinomial logistic regression

We can look at the first 5 imputed values using;

Impute_train$imp$Age[1:5, ]

##     1  2  3
## 6  46 47 27
## 18 31 43 18
## 20 39 24  2
## 27 25 36 39
## 29 35 25  2

We can obatin the final dataset using the first impuations by;

Train <- complete(Impute_train, 1)

anyNA(Train)

## [1] FALSE

To visualize the imputed values, we can do stripplot;

stripplot(Impute_train, pch = 20, cex = 1.2)

In the above diagram, 0 on x-axis represents original data, while 1, 2, and 3 are the 1st, 2nd and 3rd imputaions respectively.

Building the RF Model

There are only three hyperparameters involved in a RF, namely the number of trees ntree , the number of variables tried at each split mtry and the the number of terminal nodes (leaves, directly related to tree debth: the deeper the tree, the fewer the leaves) nodesize

RF_Model <- randomForest(Survived ~ .,
                         data = Train,
                         importance = TRUE)
RF_Model

## 
## Call:
##  randomForest(formula = Survived ~ ., data = Train, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 16.72%
## Confusion matrix:
##     0   1 class.error
## 0 499  50  0.09107468
## 1  99 243  0.28947368

Since this is a classification problem, we measure model accuracy using Accuracy metric and not regression metrics such as MAE, RMSE, MSE etc.

We can obtain accuracy by summming the diagomal elemets of the confusion matrix and dividing by the table total:

(504+238)/(504+238+104+45)

## [1] 0.8327722

To view varaible importance:-

RF_Model$importance

##                   0           1 MeanDecreaseAccuracy MeanDecreaseGini
## Pclass   0.03002979 0.103039035           0.05799803         34.10667
## Sex      0.12693107 0.209098819           0.15822907        103.52928
## Age      0.04770524 0.038393866           0.04406702         60.64767
## SibSp    0.03170089 0.004421977           0.02119980         17.41862
## Parch    0.01544175 0.008230806           0.01265676         12.21988
## Fare     0.04080752 0.069731258           0.05183229         62.69512
## Embarked 0.00844533 0.013936855           0.01058102         11.40967

Or viisualize varaible importance:-

varImpPlot(RF_Model, 
           pch = 20,
           col = "green",
           main = "Feature Importance")

Validating the Model

Ensure the testing set does not have mismatching factor levels and variable names, otherwise we get the error Type of predictors in new data do not match that of the training data

Impute_test <- mice(Test, m = 3, seed = 2222)

## 
##  iter imp variable
##   1   1  Age  Fare
##   1   2  Age  Fare
##   1   3  Age  Fare
##   2   1  Age  Fare
##   2   2  Age  Fare
##   2   3  Age  Fare
##   3   1  Age  Fare
##   3   2  Age  Fare
##   3   3  Age  Fare
##   4   1  Age  Fare
##   4   2  Age  Fare
##   4   3  Age  Fare
##   5   1  Age  Fare
##   5   2  Age  Fare
##   5   3  Age  Fare

Test <- complete(Impute_test, 1)

common <- intersect(names(Train), names(Test)) 
for (p in common) { 
  if (class(Train[[p]]) == "factor") { 
    levels(Test[[p]]) <- levels(Train[[p]]) 
  } 
}

Pred <- predict(RF_Model, Test)

We can now save th results for submission.

Submit <- data.frame(PassengerId = Test$PassengerId, Survived = Pred)
write.csv(Submit, 
          row.names = FALSE,
          file = "C:\\Users\\User\\Desktop\\Desktop\\Cory_02042020.csv")

The above submission gives me a score of 0.75119! This is amazing for a good start.