A brief walk-through of mlr functions that are necessary for basic model tunning.
mlr has more possibilities to create different types of tasks: Classification, Regression, Survival, Clustering, Multilabel, Cost-sensitive, Imbalanced data, Functional data, Spatial data. caret only seems to have possibilities for Classification, Regression and Cost-sensitive.
mlr has 71 metrics which are easy to choose from, more complicated to change metrics in caret.
mlr supports more tuning methods such as F-Racing and model-based optimization, while caret supports only random and grid search.
mlr is more compatible for parallelization.
Goal:
data(BreastCancer, package = 'mlbench')
df <- BreastCancer
str(df)
## 'data.frame': 699 obs. of 11 variables:
## $ Id : chr "1000025" "1002945" "1015425" "1016277" ...
## $ Cl.thickness : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...
## $ Cell.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
## $ Cell.shape : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...
## $ Marg.adhesion : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
## $ Epith.c.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...
## $ Bare.nuclei : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
## $ Bl.cromatin : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...
## $ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
## $ Mitoses : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
## $ Class : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
summary(df)
## Id Cl.thickness Cell.size Cell.shape
## Length:699 1 :145 1 :384 1 :353
## Class :character 5 :130 10 : 67 2 : 59
## Mode :character 3 :108 3 : 52 10 : 58
## 4 : 80 2 : 45 3 : 56
## 10 : 69 4 : 40 4 : 44
## 2 : 50 5 : 30 5 : 34
## (Other):117 (Other): 81 (Other): 95
## Marg.adhesion Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli
## 1 :407 2 :386 1 :402 2 :166 1 :443
## 2 : 58 3 : 72 10 :132 3 :165 10 : 61
## 3 : 58 4 : 48 2 : 30 1 :152 3 : 44
## 10 : 55 1 : 47 5 : 30 7 : 73 2 : 36
## 4 : 33 6 : 41 3 : 28 4 : 40 8 : 24
## 8 : 25 5 : 39 (Other): 61 5 : 34 6 : 22
## (Other): 63 (Other): 66 NA's : 16 (Other): 69 (Other): 69
## Mitoses Class
## 1 :579 benign :458
## 2 : 35 malignant:241
## 3 : 33
## 10 : 14
## 4 : 12
## 7 : 9
## (Other): 17
Looks like we have an ID column, which we should remove in the prediction. We also have some NA values for the Bare.nuclei column, let’s set the NAs to be a different class.
df$Id = NULL
new_col <- as.numeric(df$Bare.nuclei)
new_col[is.na(new_col)] <- 5.5
new_col <- factor(new_col, levels = c("1", "2", "3", "4", "5", "5.5", "6", "7", "8", "9", "10"), ordered = T)
table(new_col)
## new_col
## 1 2 3 4 5 5.5 6 7 8 9 10
## 402 30 28 19 30 16 4 8 21 9 132
df$Bare.nuclei <- new_col
df$Epith.c.size <- NULL
Other options:
Goal:
bc_task = makeClassifTask(data = df, target = 'Class')
bc_task
## Supervised task: df
## Type: classif
## Target: Class
## Observations: 699
## Features:
## numerics factors ordered functionals
## 0 3 5 0
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
## Classes: 2
## benign malignant
## 458 241
## Positive class: benign
bc_task = makeClassifTask(id = 'breast cancer classification',
data = df, target = 'Class',
positive = "malignant")
Notes:
Other tasks:
Goal:
bc_learner = makeLearner(cl = "classif.ranger", # class of learner
predict.type = "prob", # prediction output type
par.vals = list(), # set hyperparameters
predict.threshold = NULL, # threshold for prediction
fix.factors.prediction = TRUE,
id = 'bc ranger') # add a factor level for missing data in test
bc_learner
## Learner bc ranger from package ranger
## Type: classif
## Name: Random Forests; Short name: ranger
## Class: classif.ranger
## Properties: twoclass,multiclass,prob,numerics,factors,ordered,featimp,weights,oobpreds
## Predict-Type: prob
## Hyperparameters: num.threads=1,verbose=FALSE,respect.unordered.factors=order
To see available learners for the particular task:
learners <- listLearners(bc_task, properties = 'prob', check.packages = FALSE)
head(learners[,c('class', 'package')])
## class package
## 1 classif.cforest party
## 2 classif.ctree party
## 3 classif.evtree evtree
## 4 classif.featureless mlr
## 5 classif.penalized penalized
## 6 classif.randomForest randomForest
Note that somehow this function is not returning all available options for me, so I’d recommend checking out link for all Integrated learners:.
A handy function to examine the available hyperparameters is to use the getParamSet function:
getParamSet(bc_learner)
## Type len Def
## num.trees integer - 500
## mtry integer - -
## min.node.size integer - -
## replace logical - TRUE
## sample.fraction numeric - -
## split.select.weights numericvector <NA> -
## always.split.variables untyped - -
## respect.unordered.factors discrete - ignore
## importance discrete - none
## write.forest logical - TRUE
## scale.permutation.importance logical - FALSE
## num.threads integer - -
## save.memory logical - FALSE
## verbose logical - TRUE
## seed integer - -
## splitrule discrete - gini
## num.random.splits integer - 1
## keep.inbag logical - FALSE
## Constr Req Tunable Trafo
## num.trees 1 to Inf - TRUE -
## mtry 1 to Inf - TRUE -
## min.node.size 1 to Inf - TRUE -
## replace - - TRUE -
## sample.fraction 0 to 1 - TRUE -
## split.select.weights 0 to 1 - TRUE -
## always.split.variables - - TRUE -
## respect.unordered.factors ignore,order,partition - TRUE -
## importance none,impurity,permutation - FALSE -
## write.forest - - FALSE -
## scale.permutation.importance - Y FALSE -
## num.threads 1 to Inf - FALSE -
## save.memory - - FALSE -
## verbose - - FALSE -
## seed -Inf to Inf - FALSE -
## splitrule gini,extratrees - TRUE -
## num.random.splits 1 to Inf Y TRUE -
## keep.inbag - - FALSE -
List of properties:
Goal:
bc_model <- train(learner = bc_learner,
task = bc_task)
bc_model
## Model for learner.id=bc ranger; learner.class=classif.ranger
## Trained on: task.id = breast cancer classification; obs = 699; features = 8
## Hyperparameters: num.threads=1,verbose=FALSE,respect.unordered.factors=order
If the default setting of the learner is okay, you can skip makeLearner() and just start to train the model.
bc_model_default <- train(learner = 'classif.ranger',
task = bc_task)
bc_model_default
## Model for learner.id=classif.ranger; learner.class=classif.ranger
## Trained on: task.id = breast cancer classification; obs = 699; features = 8
## Hyperparameters: num.threads=1,verbose=FALSE,respect.unordered.factors=order
To train only a subset, use the subset argument.
n <- nrow(df)
in_train <- sample.int(n, round(0.8*n))
in_test <- setdiff(1:n, in_train)
bc_model <- train(learner = bc_learner,
task = bc_task,
subset = in_train)
To assign weight to observations if supported by learner. Use cases:
1. reduce the influence of outliers 2. increase the influence of certain data, e.g. more recent data 3. incorporate misclassification costs 4. account for class imbalance, give both classes equal importance
target <- df$Class[in_train]
tab <- as.numeric(table(target))
w <- 1/tab[target]
bc_model <- train(learner = bc_learner,
task = bc_task,
subset = in_train,
weights = w)
Goal:
To make a prediction:
bc_pred <- predict(object = bc_model,
task = bc_task,
subset = in_test)
bc_pred
## Prediction: 140 observations
## predict.type: prob
## threshold: benign=0.50,malignant=0.50
## time: 0.03
## id truth prob.benign prob.malignant response
## 9 9 benign 0.88703492 1.129651e-01 benign
## 10 10 benign 0.99892322 1.076776e-03 benign
## 12 12 benign 1.00000000 0.000000e+00 benign
## 16 16 malignant 0.41872698 5.812730e-01 malignant
## 19 19 malignant 0.01901746 9.809825e-01 malignant
## 25 25 benign 0.99991111 8.888889e-05 benign
## ... (#rows: 140, #cols: 5)
To adjust threshold:
bc_pred_2 <- setThreshold(pred = bc_pred, threshold = 0.8)
To visualize the prediction:
# plotLearnerPrediction(learner = bc_learner, task = bc_task,
# features = c('Cell.shape', 'Marg.adhesion'))
lrn = makeLearner("classif.rpart", id = "CART")
plotLearnerPrediction(lrn, task = iris.task)
Figure note: The type of symbol shows the true class labels of the data points. Symbols with white border indicate misclassified observations. The posterior probabilities (if the learner under consideration supports this) are represented by the background color where higher saturation means larger probabilities.
Handy functions:
1. getPredictionTruth( bc_pred )
2. getPredictionResponse( bc_pred )
3. getPredictionSE() for standard errors of regression predictions
4. getPredictionProbabilities( bc_pred )
5. calculateConfusionMatrix()
Goal:
To access performance measures:
## list measures
listMeasures("classif", properties = "classif.multi")
## [1] "featperc" "mmce" "lsr"
## [4] "bac" "qsr" "timeboth"
## [7] "multiclass.aunp" "timetrain" "multiclass.aunu"
## [10] "ber" "timepredict" "multiclass.brier"
## [13] "ssr" "acc" "logloss"
## [16] "wkappa" "multiclass.au1p" "multiclass.au1u"
## [19] "kappa"
Task-specific measures:
listMeasures(bc_task)
## [1] "tnr" "tpr" "featperc"
## [4] "f1" "mmce" "mcc"
## [7] "brier.scaled" "lsr" "bac"
## [10] "fn" "fp" "fnr"
## [13] "qsr" "fpr" "npv"
## [16] "brier" "auc" "timeboth"
## [19] "multiclass.aunp" "timetrain" "multiclass.aunu"
## [22] "ber" "timepredict" "multiclass.brier"
## [25] "ssr" "ppv" "acc"
## [28] "logloss" "wkappa" "tn"
## [31] "tp" "multiclass.au1p" "multiclass.au1u"
## [34] "fdr" "kappa" "gpr"
## [37] "gmean"
To access performance measures:
performance(pred = bc_pred, measures = list(f1, fpr, auc, acc))
## f1 fpr auc acc
## 0.95744681 0.02150538 0.99748341 0.97142857
To plot performance vs threshold:
d <- generateThreshVsPerfData(bc_pred, measures = list(fpr, fnr))
plotThreshVsPerf(d)
Goal:
To set resampling:
# create resampling strategy, note that cv5 is pre-defined in mlr as well
cv5 <- makeResampleDesc(method = 'CV', iters = 5, stratify = T)
bc_cv5 <- resample(learner = bc_learner,
task = bc_task,
resampling = cv5,
measures = auc,
models = T) # to keep the model
## Resampling: cross-validation
## Measures: auc
## [Resample] iter 1: 0.9910714
## [Resample] iter 2: 0.9879982
## [Resample] iter 3: 0.9936594
## [Resample] iter 4: 1.0000000
## [Resample] iter 5: 0.9877717
##
## Aggregated Result: auc.test.mean=0.9921002
##
To access the resampling results:
bc_cv5$aggr
## auc.test.mean
## 0.9921002
bc_cv5$measures.test
## iter auc
## 1 1 0.9910714
## 2 2 0.9879982
## 3 3 0.9936594
## 4 4 1.0000000
## 5 5 0.9877717
To set resampling instance:
bc_cv5_in <- makeResampleInstance(desc = cv5, task = bc_task)
Resample instances make it easy to reproduce model results and to compare models. We can run different models on the same resampling instance with the same split of the data, which is more fair than comparisons across different splits of the data.
Goal:
Steps:
First, set the search space:
bc_params <- makeParamSet(
makeIntegerParam('num.trees', lower = 100, upper = 1000),
makeIntegerParam('mtry', lower = 1, upper = 9)
)
More options:
Second, select the optimization algorithm:
bc_control = makeTuneControlRandom(maxit = 5L)
More options:
Third, we select the resampling strategy and the performance measure for the tuning. In this case let’s use the pre-defined resampling method cv5.
bc_tune <- tuneParams(learner = bc_learner,
task = bc_task,
resampling = cv5,
par.set = bc_params,
control = bc_control,
measures = auc)
## [Tune] Started tuning learner bc ranger for parameter set:
## Type len Def Constr Req Tunable Trafo
## num.trees integer - - 100 to 1e+03 - TRUE -
## mtry integer - - 1 to 9 - TRUE -
## With control class: TuneControlRandom
## Imputation value: -0
## [Tune-x] 1: num.trees=101; mtry=3
## [Tune-y] 1: auc.test.mean=0.9891311; time: 0.0 min
## [Tune-x] 2: num.trees=560; mtry=8
## [Tune-y] 2: auc.test.mean=0.9872177; time: 0.0 min
## [Tune-x] 3: num.trees=741; mtry=4
## [Tune-y] 3: auc.test.mean=0.9892417; time: 0.0 min
## [Tune-x] 4: num.trees=869; mtry=2
## [Tune-y] 4: auc.test.mean=0.9905993; time: 0.0 min
## [Tune-x] 5: num.trees=586; mtry=3
## [Tune-y] 5: auc.test.mean=0.9898753; time: 0.0 min
## [Tune] Result: num.trees=869; mtry=2 : auc.test.mean=0.9905993
To access the tuning results:
# best-performing parameters
bc_tune$x
## $num.trees
## [1] 869
##
## $mtry
## [1] 2
# best performance
bc_tune$y
## auc.test.mean
## 0.9905993
To update the learner with the best parameters:
final_bc_learner <- setHyperPars(learner = bc_learner,
par.vals = bc_tune$x)
final_bc_model <- train(learner = final_bc_learner,
task = bc_task,
subset = in_train)
final_bc_predict <- predict(object = final_bc_model,
task = bc_task,
subset = in_test)
To view the tuning effect:
tune_data = generateHyperParsEffectData( tune.result = bc_tune )
plotHyperParsEffect(hyperpars.effect.data = tune_data,
x = "iteration", y = "auc.test.mean",
plot.type = "line")
Goal:
Steps:
First, build individual learners to compare and put in a list.
# xgboost learner
bc_learner_rpart <- makeLearner(cl = 'classif.rpart',
predict.type = 'prob',
id = 'bc rpart')
# list learners together
learners <- list(bc_learner, bc_learner_rpart)
Second, choose the resampling strategy and benchmark. Note that we are comparing the learners on the default hyperparameters before tuning. If you want each model to pull out its A-game for the benchmarking, make sure to tune the parameters beforehand.
# choose the resampling strategy
bc_cv5_in <- makeResampleInstance(desc = cv5, task = bc_task)
bc_benchmark <- benchmark(learners = learners,
task = bc_task,
resamplings = bc_cv5_in,
measures = auc,
keep.pred = F) # drop predictions to conserve memeory
## Task: breast cancer classification, Learner: bc ranger
## Resampling: cross-validation
## Measures: auc
## [Resample] iter 1: 0.9984149
## [Resample] iter 2: 0.9769022
## [Resample] iter 3: 0.9900177
## [Resample] iter 4: 0.9979396
## [Resample] iter 5: 0.9894689
##
## Aggregated Result: auc.test.mean=0.9905486
##
## Task: breast cancer classification, Learner: bc rpart
## Resampling: cross-validation
## Measures: auc
## [Resample] iter 1: 0.9626359
## [Resample] iter 2: 0.9438406
## [Resample] iter 3: 0.9212511
## [Resample] iter 4: 0.9568452
## [Resample] iter 5: 0.9345238
##
## Aggregated Result: auc.test.mean=0.9438193
##
Third, understand the benchmark results.
# performance per iteration
getBMRPerformances(bc_benchmark)
## $`breast cancer classification`
## $`breast cancer classification`$`bc ranger`
## iter auc
## 1 1 0.9984149
## 2 2 0.9769022
## 3 3 0.9900177
## 4 4 0.9979396
## 5 5 0.9894689
##
## $`breast cancer classification`$`bc rpart`
## iter auc
## 1 1 0.9626359
## 2 2 0.9438406
## 3 3 0.9212511
## 4 4 0.9568452
## 5 5 0.9345238
# aggregate performance
getBMRAggrPerformances(bc_benchmark, as.df = T) # as.df make the results easier to read
## task.id learner.id auc.test.mean
## 1 breast cancer classification bc ranger 0.9905486
## 2 breast cancer classification bc rpart 0.9438193
Further, to visualize the benchmark results.
plotBMRBoxplots(bmr = bc_benchmark,
measure = auc)