1 Introduction

The plotLearnerPrediction() function consists of an integrated procedure:

  • Training and cross-validation: The learner will be trained many times upon 75% of the original dataset in classification Task. For each iteration, trained model will be tested on either train subset (75%) and test subset (remaning 25% of original dataset). The model’s performance (by default: mean missclassification error- mmce; it could be customised by user to include more performance metrics) will be averaged. The learner’s name and its cross-validation result will be reported as subtitle on the final graph.

  • A two dimensional data space will be set from X and Y variables as introduced by the ‘features’ arguments. Then a scatter dot plot will be generated by geom_point() function in ggplot2.

  • Prediction boundaries will be generated using geom_tile() function in ggplot2 with either predicted probability (when available) or predicted classes. When probabilities are used, the prediction will be labelled by color fill and alpha is coded by probability values.

The final output consists of a ggplot2 object and therefore could be customised by scale_fill_color() or ggplot2’s themes setting.

2 Example: The IRIS classification task

Here is the association between Sepal’s length and width and the two dimensional distribution of iris species in our datset:

library(ggplot2)

ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width))+stat_density2d(geom="polygon",aes(fill=iris$Species,alpha = ..level..))+geom_point(aes(shape=Species),color="black",size=2)+theme_bw()+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))

The same analysis could be done for Petal’s length and width:

ggplot(iris,aes(x=Petal.Length,y=Petal.Width))+stat_density2d(geom="polygon",aes(fill=iris$Species,alpha = ..level..))+geom_point(aes(shape=Species),color="black",size=2)+theme_bw()+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))

As usual, we will set a classification task in mlr then introduce some learners

#Make a multiclass classification task in mlr

library(mlr)
## Loading required package: ParamHelpers
## Warning: replacing previous import 'BBmisc::isFALSE' by
## 'backports::isFALSE' when loading 'mlr'
taskiris=makeClassifTask(id="iris",data=iris,target="Species")

# Making 7 different Learners (algorithm)

learnerCART=makeLearner(id="CART","classif.rpart", predict.type = "prob")

learnerRF=makeLearner(id="RF","classif.randomForestSRC", predict.type = "prob")

learnerSVM=makeLearner(id="SVM","classif.svm", predict.type = "prob")

learnerGBM=makeLearner(id="GBM","classif.gbm", predict.type = "prob")

learnerGLMN=makeLearner(id="Elasticnet","classif.glmnet", predict.type = "prob")

learnerKNN=makeLearner(id="KNN","classif.knn")

learnerLDA=makeLearner(id="LDA","classif.lda", predict.type = "prob")

3 The syntax of plotLearnerPrediction( ) function

** plotLearnerPrediction(learner, task, features = NULL, measures, cv = 10L, …, gridsize, pointsize = 2, prob.alpha = TRUE, se.band = TRUE, err.col =”white“, greyscale = FALSE) **

Where:

  • learner is object’s name for learner task is the object name for task

  • features argument : up to 2 features could be introduced here. By default the first 2 features are used

  • measures indicate Performance measure(s) to evaluate. Default is the default measure for the task

  • cv for setting the cross-validation and reporting its result as plot title. Number of folds. cv=0 means no CV. Default is 10.

  • gridsize is the grid resolution per axis for background predictions. Default is 100 for 2D.

  • Pointsize for ggplot2 geom_point for data points. Default is 2.

  • prob.alpha is a logical argument, for setting alpha value of background to probability for predicted class? Allows visualization of “confidence” for prediction. If not, only a constant color is displayed in the background for the predicted label. Default is TRUE.

  • se.band: For regression in 1D: Show band for standard error estimation? Default is TRUE.

  • err.col: For classification, Color of misclassified data points. Default is “white”

  • greyscale is a logical argument: Should the plot be greyscale completely? Default is FALSE

CART algorithm

plotLearnerPrediction(learnerCART,taskiris,features=c("Sepal.Length","Sepal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()

plotLearnerPrediction(learnerCART,taskiris,features=c("Petal.Length","Petal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()

** Support vector machine algorithm **

plotLearnerPrediction(learnerSVM,taskiris,features=c("Sepal.Length","Sepal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()

plotLearnerPrediction(learnerSVM,taskiris,features=c("Petal.Length","Petal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()

** Gradient boosting machine algorithm **

plotLearnerPrediction(learnerGBM,taskiris,features=c("Sepal.Length","Sepal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()
## Distribution not specified, assuming multinomial ...
## Distribution not specified, assuming multinomial ...
## Distribution not specified, assuming multinomial ...
## Distribution not specified, assuming multinomial ...
## Distribution not specified, assuming multinomial ...
## Distribution not specified, assuming multinomial ...
## Distribution not specified, assuming multinomial ...
## Distribution not specified, assuming multinomial ...
## Distribution not specified, assuming multinomial ...
## Distribution not specified, assuming multinomial ...
## Distribution not specified, assuming multinomial ...

plotLearnerPrediction(learnerGBM,taskiris,features=c("Petal.Length","Petal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()
## Distribution not specified, assuming multinomial ...
## Distribution not specified, assuming multinomial ...
## Distribution not specified, assuming multinomial ...
## Distribution not specified, assuming multinomial ...
## Distribution not specified, assuming multinomial ...
## Distribution not specified, assuming multinomial ...
## Distribution not specified, assuming multinomial ...
## Distribution not specified, assuming multinomial ...
## Distribution not specified, assuming multinomial ...
## Distribution not specified, assuming multinomial ...
## Distribution not specified, assuming multinomial ...

Elastic net (logistic) algorithm

plotLearnerPrediction(learnerGLMN,taskiris,features=c("Sepal.Length","Sepal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()

plotLearnerPrediction(learnerGLMN,taskiris,features=c("Petal.Length","Petal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()

** KNN algorithm**

plotLearnerPrediction(learnerKNN,taskiris,features=c("Sepal.Length","Sepal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()

plotLearnerPrediction(learnerKNN,taskiris,features=c("Petal.Length","Petal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()

LDA algorithm

plotLearnerPrediction(learnerLDA,taskiris,features=c("Sepal.Length","Sepal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()

plotLearnerPrediction(learnerLDA,taskiris,features=c("Petal.Length","Petal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()

Random Forest algorithm

plotLearnerPrediction(learnerRF,taskiris,features=c("Sepal.Length","Sepal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()

plotLearnerPrediction(learnerRF,taskiris,features=c("Petal.Length","Petal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()

4 Conclusion

plotLearnerPrediction() is a hidden tool in mlr package. This useful function allows to generate beautiful plots for the illustration purpose. These plots provide many information, such as:

  • A visual perception of model’s performance: its ability to classify the instances into two or more classes, the correct classification and error rates.

  • A visual presentation of the underlying mechanism of the model, via the prediction boundaries

  • Assocation between two features and their contribution to model’s prediction.

  • Numerical result of cross-validation: averaged model’s performance metrics

---
title: <center> A hidden function in MLR </center>
subtitle: <center> plotLearnerPrediction() function </center>
author: <center> Oleg Baydakov </center>
date: <center> December 02, 2017 </center>
output: 
  html_document: 
    code_download: yes
    code_folding: show
    df_print: kable
    number_sections: yes
    theme: flatly
    toc: yes
    toc_float: yes
---

<center>![](hidden_function_mlr.png){ width=50%}</center>
<br>

# Introduction

The **plotLearnerPrediction()** function consists of an integrated procedure:

+ Training and cross-validation: The learner will be trained many times upon 75% of the original dataset in classification Task. For each iteration, trained model will be tested on either train subset (75%) and test subset (remaning 25% of original dataset). The model’s performance (by default: mean missclassification error- mmce; it could be customised by user to include more performance metrics) will be averaged. The learner’s name and its cross-validation result will be reported as subtitle on the final graph.

+ A two dimensional data space will be set from X and Y variables as introduced by the ‘features’ arguments. Then a scatter dot plot will be generated by geom_point() function in ggplot2.

+ Prediction boundaries will be generated using geom_tile() function in ggplot2 with either predicted probability (when available) or predicted classes. When probabilities are used, the prediction will be labelled by color fill and alpha is coded by probability values.

The final output consists of a ggplot2 object and therefore could be customised by scale_fill_color() or ggplot2’s themes setting.

# Example: The IRIS classification task
Here is the association between Sepal’s length and width and the two dimensional distribution of iris species in our datset:
```{r}
library(ggplot2)

ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width))+stat_density2d(geom="polygon",aes(fill=iris$Species,alpha = ..level..))+geom_point(aes(shape=Species),color="black",size=2)+theme_bw()+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))
```
The same analysis could be done for Petal’s length and width:
```{r}
ggplot(iris,aes(x=Petal.Length,y=Petal.Width))+stat_density2d(geom="polygon",aes(fill=iris$Species,alpha = ..level..))+geom_point(aes(shape=Species),color="black",size=2)+theme_bw()+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))
```

As usual, we will set a classification task in mlr then introduce some learners
```{r}
#Make a multiclass classification task in mlr

library(mlr)

taskiris=makeClassifTask(id="iris",data=iris,target="Species")

# Making 7 different Learners (algorithm)

learnerCART=makeLearner(id="CART","classif.rpart", predict.type = "prob")

learnerRF=makeLearner(id="RF","classif.randomForestSRC", predict.type = "prob")

learnerSVM=makeLearner(id="SVM","classif.svm", predict.type = "prob")

learnerGBM=makeLearner(id="GBM","classif.gbm", predict.type = "prob")

learnerGLMN=makeLearner(id="Elasticnet","classif.glmnet", predict.type = "prob")

learnerKNN=makeLearner(id="KNN","classif.knn")

learnerLDA=makeLearner(id="LDA","classif.lda", predict.type = "prob")
```

# The syntax of plotLearnerPrediction( ) function

** plotLearnerPrediction(learner, task, features = NULL, measures, cv = 10L, …, gridsize, pointsize = 2, prob.alpha = TRUE, se.band = TRUE, err.col =”white“, greyscale = FALSE) **

Where:

+ learner is object’s name for learner task is the object name for task

+ features argument : up to 2 features could be introduced here. By default the first 2 features are used

+ measures indicate Performance measure(s) to evaluate. Default is the default measure for the task

+ cv for setting the cross-validation and reporting its result as plot title. Number of folds. cv=0 means no CV. Default is 10.

+ gridsize is the grid resolution per axis for background predictions. Default is 100 for 2D.

+ Pointsize for ggplot2 geom_point for data points. Default is 2.

+ prob.alpha is a logical argument, for setting alpha value of background to probability for predicted class? Allows visualization of “confidence” for prediction. If not, only a constant color is displayed in the background for the predicted label. Default is TRUE.

+ se.band: For regression in 1D: Show band for standard error estimation? Default is TRUE.

+ err.col: For classification, Color of misclassified data points. Default is “white”

+ greyscale is a logical argument: Should the plot be greyscale completely? Default is FALSE

**CART algorithm**
```{r}
plotLearnerPrediction(learnerCART,taskiris,features=c("Sepal.Length","Sepal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()
```
```{r}
plotLearnerPrediction(learnerCART,taskiris,features=c("Petal.Length","Petal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()
```
** Support vector machine algorithm **
```{r}
plotLearnerPrediction(learnerSVM,taskiris,features=c("Sepal.Length","Sepal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()
```
```{r}
plotLearnerPrediction(learnerSVM,taskiris,features=c("Petal.Length","Petal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()
```
** Gradient boosting machine algorithm **
```{r}
plotLearnerPrediction(learnerGBM,taskiris,features=c("Sepal.Length","Sepal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()
```

```{r}
plotLearnerPrediction(learnerGBM,taskiris,features=c("Petal.Length","Petal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()
```
**Elastic net (logistic) algorithm**
```{r}
plotLearnerPrediction(learnerGLMN,taskiris,features=c("Sepal.Length","Sepal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()
```
```{r}
plotLearnerPrediction(learnerGLMN,taskiris,features=c("Petal.Length","Petal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()
```
** KNN algorithm**
```{r}
plotLearnerPrediction(learnerKNN,taskiris,features=c("Sepal.Length","Sepal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()
```
```{r}
plotLearnerPrediction(learnerKNN,taskiris,features=c("Petal.Length","Petal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()
```

**LDA algorithm**
```{r}
plotLearnerPrediction(learnerLDA,taskiris,features=c("Sepal.Length","Sepal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()
```

```{r}
plotLearnerPrediction(learnerLDA,taskiris,features=c("Petal.Length","Petal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()
```
**Random Forest algorithm**
```{r}
plotLearnerPrediction(learnerRF,taskiris,features=c("Sepal.Length","Sepal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()
```
```{r}
plotLearnerPrediction(learnerRF,taskiris,features=c("Petal.Length","Petal.Width"),cv=100L,gridsize=100)+scale_fill_manual(values=c("#ff0061","#11a6fc","#ffae00"))+theme_bw()
```

# Conclusion
**plotLearnerPrediction()** is a hidden tool in mlr package. This useful function allows to generate beautiful plots for the illustration purpose. These plots provide many information, such as:

+ A visual perception of model’s performance: its ability to classify the instances into two or more classes, the correct classification and error rates.

+ A visual presentation of the underlying mechanism of the model, via the prediction boundaries

+ Assocation between two features and their contribution to model’s prediction.

+ Numerical result of cross-validation: averaged model’s performance metrics

