Titanic competition from Kaggle. Part 3.

This part is to set up tuneControl, configure tuning parameters, including tuneLength and tuneGrid.

Part 1 is https://rpubs.com/Minxing2046/395349
Part 2 is https://rpubs.com/Minxing2046/395356

library(tidyverse)
library(DataExplorer)
library(lubridate)
library(pander)
library(data.table)
library(grid)
library(gridExtra)
library(mice)
library(caret)

1 Loading the data

the dataset is generated from part 2:

https://rpubs.com/Minxing2046/395356

The missing data in the dataset is imputed by mice package.

titanic.mice <- read.csv("titanic_mice_dummy.csv")

str(data.frame(titanic.mice))

## 'data.frame':    1309 obs. of  11 variables:
##  $ X          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass.1   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pclass.2   : int  0 0 1 0 0 0 0 1 0 0 ...
##  $ Pclass.3   : int  1 1 0 1 1 1 1 0 1 1 ...
##  $ Sex.female : int  0 1 0 0 1 0 1 0 1 0 ...
##  $ Sex.male   : int  1 0 1 1 0 1 0 1 0 1 ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Survived   : int  NA NA NA NA NA NA NA NA NA NA ...

Drop the first row number column

titanic.mice <- titanic.mice[,-1]

str(data.frame(titanic.mice))

## 'data.frame':    1309 obs. of  10 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass.1   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pclass.2   : int  0 0 1 0 0 0 0 1 0 0 ...
##  $ Pclass.3   : int  1 1 0 1 1 1 1 0 1 1 ...
##  $ Sex.female : int  0 1 0 0 1 0 1 0 1 0 ...
##  $ Sex.male   : int  1 0 1 1 0 1 0 1 0 1 ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Survived   : int  NA NA NA NA NA NA NA NA NA NA ...

titanic.mice$Survived <- as.factor(titanic.mice$Survived)

2 training and test

split the full dataset into training and test

titanic.train <- titanic.mice %>% filter(!is.na(Survived))
titanic.test <- titanic.mice %>% filter(is.na(Survived))

head(titanic.train)

anyNA(titanic.train$Survived)

## [1] FALSE

head(titanic.test)

3 trainControl

The trainControl function.

search = random is to randomly select tuning parameters.

Depending on what method is chosen in the train (), the tuning parameters are different.

For example, if method = ranger is selected in the train call, the tuning parameters are

mtry (Randomly Selected Predictors)
splitrule (Splitting Rule)
min.node.size (Minimal Node Size)

myControl <- trainControl(method ="repeatedcv", 
                          number = 10,
                          repeats = 3,
                          savePredictions = "final",
                          search = "random",
                          verboseIter = FALSE)

In this example, the metric is not specified, the accuracy will be used by train function.

titanic.train[,-1] is to remove the PassenagerID as predictor.

3.1 tuneLength Function

tuneLength is to tell the algorithm to try different default values for the main parameter.

More details are here

http://www.rpubs.com/Mentors_Ubiqum/tunegrid_tunelength

set.seed let the model re-produce the same results.

For example, mtry tuning parameter is 6,5,1,4,7, the tuning parameters won’t change next time the model is run.

If set.seed is not set. the tuning parameters will change again when next time the model is run.

set.seed(2456)
Model.mice <- train(Survived ~., data = titanic.train[,-1],
                method = "ranger",
                trControl = myControl,
                tuneLength = 5
                )
Model.mice

## Random Forest 
## 
## 891 samples
##   8 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 802, 802, 802, 802, 802, 802, ... 
## Resampling results across tuning parameters:
## 
##   min.node.size  mtry  splitrule   Accuracy   Kappa    
##    3             6     gini        0.8103567  0.5910529
##    5             5     extratrees  0.8182053  0.6042057
##    9             1     gini        0.8088711  0.5813119
##   12             4     extratrees  0.8174771  0.6000868
##   19             7     gini        0.8163284  0.6012792
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 5, splitrule =
##  extratrees and min.node.size = 5.

In this example, the value for each tuning parameter

min.node.size
mtry
splitrule are randomly selected, it is defined by the search function in the trainControlcall.

The model has been tuned 5 times, which is defined by the tuneLength function in the traincall.

The results show mtry = 5, splitrule = extratrees and min.node.size = 5 have the highest accuracy.

plot(Model.mice)

3.2 TuneGrid

Alternative, we can specify the values for tuning parameters using tuneGrid function in thetrain call.

the training data set has 10 columns. the 1st column is passengerID, not a predictor, the last column is outcome. The number of predictors is 8.

In below example, let’s setup mtry to be 1:8 in the tuneGrid function. method = ranger has 3 parameters. in below example, the ranger is changed to rf in the method function as rf only has one parameter, which is mtry

myControl.tunegird <- trainControl(method ="repeatedcv", 
                          number = 10,
                          repeats = 3,
                          savePredictions = "final",
                          verboseIter = FALSE)

In this example, tuneGrid is specified by myGrid, mtry is from 1 to 8 predictors

set.seed(2600)

myGrid <- expand.grid(.mtry = c(1:8))

Model.mice.tunegrid <- train(Survived ~., data = titanic.train[,-1],
                method = "rf",
                trControl = myControl.tunegird,
                tuneGrid = myGrid
                )
Model.mice.tunegrid

## Random Forest 
## 
## 891 samples
##   8 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 803, 802, 802, 802, 802, 801, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   1     0.8091247  0.5816191
##   2     0.8196288  0.6063718
##   3     0.8267367  0.6202330
##   4     0.8248639  0.6157731
##   5     0.8222338  0.6121237
##   6     0.8151133  0.5995484
##   7     0.8109935  0.5894494
##   8     0.8072606  0.5821678
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.

when mtry =3, the model has the highest accuracy, therefore, mtry = 3 is selected.

randomForest, the rule of thumb,

for classification,mtry is the square root of the number of predictor variables (rounded down).
for regression models, it is the number of predictor variables divided by 3 (rounded down)

in this example, the square root of the number of predictor variables (rounded down) is

floor(sqrt(ncol(titanic.mice)))

## [1] 3

3.3 ROC metric

In the above example, the accuracy is used as the metric function is not defined in the train call.

Classification can also be evaluated by ROC.

3.3.1 ROC trainControl

to use ROC as a metric in the train call, classProbs and summaryFunction must be defined in the trainControl

myControl.ROC <- trainControl(method ="repeatedcv", 
                          number = 10,
                          repeats = 3,
                          savePredictions = "final",
                          summaryFunction = twoClassSummary,
                          classProbs = TRUE,
                          verboseIter = FALSE)

In addition, the value in the Survived column can’t be 0/1, it has to letter. Otherwise error message

Error: At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1 . Please use factor levels that can be used as valid R variable names (see ?make.names for help).

table(titanic.train$Survived)

## 
##   0   1 
## 549 342

below codes to convert 0/1 to X0/X1

levels(titanic.train$Survived) <- make.names(levels(factor(titanic.train$Survived)))

check again, 0/1 is converted into X0/X1.

table(titanic.train$Survived)

## 
##  X0  X1 
## 549 342

set.seed(3001)
myGrid <- expand.grid(.mtry = c(1:8))

Model.mice.ROC <- train(Survived ~., data = titanic.train[,-1],
                method = "rf",
                trControl = myControl.ROC,
                metric = "ROC",
                tuneGrid = myGrid
                )
Model.mice.ROC

## Random Forest 
## 
## 891 samples
##   8 predictor
##   2 classes: 'X0', 'X1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 802, 802, 802, 802, 802, 802, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##   1     0.8562551  0.9016049  0.6580672
##   2     0.8585748  0.9028283  0.6835294
##   3     0.8601011  0.9143771  0.6815126
##   4     0.8601161  0.9113692  0.6766387
##   5     0.8588008  0.8962065  0.6872549
##   6     0.8528111  0.8864422  0.6900840
##   7     0.8468462  0.8834343  0.6822129
##   8     0.8426973  0.8804040  0.6792437
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.

plot(Model.mice.ROC)

Kaggle-Titanic-Caret-3

Ming Si

June 10 2018