Titanic competition from Kaggle. Part 3.
This part is to set up tuneControl, configure tuning parameters, including tuneLength and tuneGrid.
Part 1 is https://rpubs.com/Minxing2046/395349
Part 2 is https://rpubs.com/Minxing2046/395356
library(tidyverse)
library(DataExplorer)
library(lubridate)
library(pander)
library(data.table)
library(grid)
library(gridExtra)
library(mice)
library(caret)the dataset is generated from part 2:
https://rpubs.com/Minxing2046/395356
The missing data in the dataset is imputed by mice package.
## 'data.frame': 1309 obs. of 11 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass.1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Pclass.2 : int 0 0 1 0 0 0 0 1 0 0 ...
## $ Pclass.3 : int 1 1 0 1 1 1 1 0 1 1 ...
## $ Sex.female : int 0 1 0 0 1 0 1 0 1 0 ...
## $ Sex.male : int 1 0 1 1 0 1 0 1 0 1 ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Survived : int NA NA NA NA NA NA NA NA NA NA ...
Drop the first row number column
## 'data.frame': 1309 obs. of 10 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass.1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Pclass.2 : int 0 0 1 0 0 0 0 1 0 0 ...
## $ Pclass.3 : int 1 1 0 1 1 1 1 0 1 1 ...
## $ Sex.female : int 0 1 0 0 1 0 1 0 1 0 ...
## $ Sex.male : int 1 0 1 1 0 1 0 1 0 1 ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Survived : int NA NA NA NA NA NA NA NA NA NA ...
split the full dataset into training and test
titanic.train <- titanic.mice %>% filter(!is.na(Survived))
titanic.test <- titanic.mice %>% filter(is.na(Survived))## [1] FALSE
The trainControl function.
search = random is to randomly select tuning parameters.
Depending on what method is chosen in the train (), the tuning parameters are different.
For example, if method = ranger is selected in the train call, the tuning parameters are
myControl <- trainControl(method ="repeatedcv",
number = 10,
repeats = 3,
savePredictions = "final",
search = "random",
verboseIter = FALSE)In this example, the metric is not specified, the accuracy will be used by train function.
titanic.train[,-1] is to remove the PassenagerID as predictor.
tuneLength is to tell the algorithm to try different default values for the main parameter.
More details are here
http://www.rpubs.com/Mentors_Ubiqum/tunegrid_tunelength
set.seed let the model re-produce the same results.
For example, mtry tuning parameter is 6,5,1,4,7, the tuning parameters won’t change next time the model is run.
If set.seed is not set. the tuning parameters will change again when next time the model is run.
set.seed(2456)
Model.mice <- train(Survived ~., data = titanic.train[,-1],
method = "ranger",
trControl = myControl,
tuneLength = 5
)
Model.mice## Random Forest
##
## 891 samples
## 8 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 802, 802, 802, 802, 802, 802, ...
## Resampling results across tuning parameters:
##
## min.node.size mtry splitrule Accuracy Kappa
## 3 6 gini 0.8103567 0.5910529
## 5 5 extratrees 0.8182053 0.6042057
## 9 1 gini 0.8088711 0.5813119
## 12 4 extratrees 0.8174771 0.6000868
## 19 7 gini 0.8163284 0.6012792
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 5, splitrule =
## extratrees and min.node.size = 5.
In this example, the value for each tuning parameter
search function in the trainControlcall.The model has been tuned 5 times, which is defined by the tuneLength function in the traincall.
The results show mtry = 5, splitrule = extratrees and min.node.size = 5 have the highest accuracy.
Alternative, we can specify the values for tuning parameters using tuneGrid function in thetrain call.
the training data set has 10 columns. the 1st column is passengerID, not a predictor, the last column is outcome. The number of predictors is 8.
In below example, let’s setup mtry to be 1:8 in the tuneGrid function. method = ranger has 3 parameters. in below example, the ranger is changed to rf in the method function as rf only has one parameter, which is mtry
myControl.tunegird <- trainControl(method ="repeatedcv",
number = 10,
repeats = 3,
savePredictions = "final",
verboseIter = FALSE)In this example, tuneGrid is specified by myGrid, mtry is from 1 to 8 predictors
set.seed(2600)
myGrid <- expand.grid(.mtry = c(1:8))
Model.mice.tunegrid <- train(Survived ~., data = titanic.train[,-1],
method = "rf",
trControl = myControl.tunegird,
tuneGrid = myGrid
)
Model.mice.tunegrid## Random Forest
##
## 891 samples
## 8 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 803, 802, 802, 802, 802, 801, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.8091247 0.5816191
## 2 0.8196288 0.6063718
## 3 0.8267367 0.6202330
## 4 0.8248639 0.6157731
## 5 0.8222338 0.6121237
## 6 0.8151133 0.5995484
## 7 0.8109935 0.5894494
## 8 0.8072606 0.5821678
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
when mtry =3, the model has the highest accuracy, therefore, mtry = 3 is selected.
randomForest, the rule of thumb,
for classification,mtry is the square root of the number of predictor variables (rounded down).
for regression models, it is the number of predictor variables divided by 3 (rounded down)
in this example, the square root of the number of predictor variables (rounded down) is
## [1] 3
more on randomForest can be found here
In the above example, the accuracy is used as the metric function is not defined in the train call.
Classification can also be evaluated by ROC.
More on how to choose metric is
https://www.machinelearningplus.com/machine-learning/evaluation-metrics-classification-models-r/
to use ROC as a metric in the train call, classProbs and summaryFunction must be defined in the trainControl
myControl.ROC <- trainControl(method ="repeatedcv",
number = 10,
repeats = 3,
savePredictions = "final",
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = FALSE)In addition, the value in the Survived column can’t be 0/1, it has to letter. Otherwise error message
Error: At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1 . Please use factor levels that can be used as valid R variable names (see ?make.names for help).
##
## 0 1
## 549 342
below codes to convert 0/1 to X0/X1
check again, 0/1 is converted into X0/X1.
##
## X0 X1
## 549 342
set.seed(3001)
myGrid <- expand.grid(.mtry = c(1:8))
Model.mice.ROC <- train(Survived ~., data = titanic.train[,-1],
method = "rf",
trControl = myControl.ROC,
metric = "ROC",
tuneGrid = myGrid
)
Model.mice.ROC## Random Forest
##
## 891 samples
## 8 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 802, 802, 802, 802, 802, 802, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 1 0.8562551 0.9016049 0.6580672
## 2 0.8585748 0.9028283 0.6835294
## 3 0.8601011 0.9143771 0.6815126
## 4 0.8601161 0.9113692 0.6766387
## 5 0.8588008 0.8962065 0.6872549
## 6 0.8528111 0.8864422 0.6900840
## 7 0.8468462 0.8834343 0.6822129
## 8 0.8426973 0.8804040 0.6792437
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.