Some Machine Learning packages

Introduction

We will use Wine Quality data set from the UCI Machine Learning Repository. The “red wine” data span 1599 probes with 11 numerical attributes and one target attribute called quality.

redwine = read.table("winequality-red.csv", header = TRUE, sep = ';')
# redwine$quality = factor(redwine$quality)
names(redwine)

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

The quality classes are not balanced, i.e. there are many more normal wines than excellent or poor ones.

barplot(table(redwine$quality), col="darkred")

table(redwine$quality)

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

NOTE: There are more frameworks wrapping several Machine Learning packages in R, see for example ‘caret’ (Doc), ‘rminer’ (Tutorial), or ‘SuperLearner’ (Guide). We will focus on ‘mlr3’ (Book).

mlr3 Base Learners

Package ‘mlr3’, “Machine Learning for R”, provides resp. provides base and extra learners. See a list of learners coming with the additional package ‘mlr3learners’ at https://github.com/mlr-org/mlr3learners.

See the mlr3examples repository of examples presented at the Rhein-Neckar Meetups during late summer 2021.

Tasks and Learners

### Loading the mlr3 package family
library("mlr3verse", quietly = TRUE)

## classification learners
#  learner = lrn("classif.rpart")                    # 56.5 % in  6.2 sec
#  learner = mlr3::lrn("classif.svm")                # 62.6 % in 23.9 sec
   learner = lrn("classif.ranger", num.trees = 50L)  # 70.2 % in  8.7 sec
#  learner = lrn("classif.xgboost")                  # 60.4 % in  3.7 sec
print(learner)

 <LearnerClassifRanger:classif.ranger>
* Model: -
* Parameters: num.threads=1, num.trees=50
* Packages: mlr3, mlr3learners, ranger
* Predict Type: response
* Feature types: logical, integer, numeric, character, factor, ordered
* Properties: hotstart_backward, importance, multiclass, oob_error, twoclass, weights

We need to change the target column ‘quality’ into a factor to make mlr3 recognize this is a classification, not a regression task.

winedata = read.table("winequality-red.csv", header = TRUE, sep = ';')
winedata$quality = factor(winedata$quality)

Define a task fixing a name, the data, and the target attribute.

mytask <- TaskClassif$new(id = "redwine",
                          backend = winedata, target = "quality")

Now sample test and training data, train the learner, predict the test data, and estimate the accuracy.

n = nrow(winedata); m = round(0.1*n)
itest  <- sample(1:n, m)
itrain <- setdiff(1:n, itest)

learner$train(mytask, row_ids = itrain)

predicts <- learner$predict(mytask, row_ids = itest)
predicts$score(msrs(c("classif.acc", "classif.ce")))

classif.acc  classif.ce 
     0.7125      0.2875

We can inspect the model generated by the learner and print the confusion matrix.

# learner$model

confusion_matrix <- predicts$confusion
print(confusion_matrix)

            truth
response  3  4  5  6  7  8
       3  0  0  0  0  0  0
       4  0  0  0  0  0  0
       5  1  1 52 10  2  0
       6  1  2 11 52  8  0
       7  0  0  0  7  9  3
       8  0  0  0  0  0  1

Accuracies

Now train the model, predict new cases, calculate the mean of all accuracies, doing a 10x10 cross-validation. We define a resampling method and apply it to the task and learner defined before, the result contains the score.

resample = rsmp("repeated_cv", folds = 10)            # or "cv"
resres = resample(task = mytask, learner, resample)

acc = resres$score(msr("classif.acc"))$classif.acc
cat("Accuracy:", mean(acc), " (10-fold cross-validation)", '\n')

Accuracy: 0.7020578  (10-fold cross-validation)

For timings and accuracies of different base learners see above.

mlr3 Extra Learners

For classification tasks, the ‘mlr3extralearners’ package provides, e.g., AdaBoostM1, BART, C50, or a Random Forest learner. An overview of all learners can be found here: Search Learners

C50

C50 is an implementation of Ross Quinlan’s C4.5 decision tree algorithm. The following packages have to be loaded:

# remotes::install_github("mlr-org/mlr3extralearners")
library("mlr3verse", quietly = TRUE)
library("mlr3extralearners")
library(C50)

The data are again

winedata = read.table("winequality-red.csv", header = TRUE, sep = ';')
winedata$quality = factor(winedata$quality)

Learner and task:

learner = lrn("classif.C50")
mytask <- TaskClassif$new(id = "redwine",
                          backend = winedata, target = "quality")
print(learner)

<LearnerClassifC50:classif.C50>
* Model: -
* Parameters: list()
* Packages: mlr3, mlr3extralearners, C50
* Predict Type: response
* Feature types: numeric, factor, ordered
* Properties: missings, multiclass, twoclass, weights

Apply the resampling function with 10x10-fold cross-validation

resample = rsmp("repeated_cv", folds = 10)            # or "cv"
resres = resample(task = mytask, learner, resample)

acc = resres$score(msr("classif.acc"))$classif.acc
cat("Accuracy:", mean(acc), " (10-fold cross-validation)", '\n')

It prints out a lot of messages and stops in about 10 seconds.

INFO  [mlr3] Applying learner 'classif.C50' on task 'redwine' (iter 26/100) 
....
INFO  [mlr3] Applying learner 'classif.C50' on task 'redwine' (iter 10/100)

Accuracy: 0.6152017  (10-fold cross-validation)

AdaBoost M1/SAMME

‘mlr3’ provides AdaBoostM1 through the WEKA software and the ‘RWeka’ package. Because of problems with installing ‘rjava’ I tried the AdaboostM1 function in the WEKA interface directly.

It turned out that this implementation is quite fast, even for 10-fold cross-validation, but the accuracy with 0.425 is very low. So I looked for other implementations of AdaBoostM1 in R packages.

lightgbm

Was not able to get reasonable results with the ‘lightgbm’ extra learner that returned accuracies of 0.01875!

The mlpack Package

‘mlpack’ is a “flexible machine learning library, written in C++, that aims to provide fast, extensible implementations of cutting-edge machine learning algorithms.”

‘mlpack’ doumentation

library(mlpack)

## integer column for classification
winedata <- read.table("winequality-red.csv", header = TRUE, sep = ';')
n <- nrow(winedata); m <- ncol(winedata)

# Split the labels.
labels  <- as.data.frame(winedata$quality)
dataset <- winedata[, 1:11]                 # 1:m resulted in 80 % accuracy !

pracma::tic()
# Split the dataset using mlpack.
prepdata <- preprocess_split(input = dataset,
                             input_labels = labels,
                             test_ratio = 0.1,
                             seed = 0, verbose = FALSE)
# training and results
output <- adaboost( training = prepdata$training,
                    labels = prepdata$training_labels,
                    weak_learner = "decision_stump",
                    verbose = FALSE)
model  <- output$output_model

# predict labels of test points
output = adaboost(input_model = model,
                  test = prepdata$test)
pracma::toc()

elapsed time is 0.021000 seconds

This is the time for one run. As it turned out, ‘mlpack’ kills the R process if called several times, and then the user has to restart R !

Now print the accuracy:

correct <- sum(output$predictions == prepdata$test_labels)
                      # elapsed time is 0.02 seconds
cat(correct, "out of", length(prepdata$test_labels), "test points correct:",
    correct / length(prepdata$test_labels) * 100.0, "%\n")

91 out of 159 test points correct: 57.2327 %

The run-time of 0.02 sec is very good, but not of use if adaboost() kills the R process. This happens with other learners in ‘mlpack’, too, and makes the package unusable.

AdaBoost M1 and SAMME

adabag

“Freund and Schapire’s Adaboost.M1 algorithm and Breiman’s Bagging algorithm using classification trees as individual classifiers.” (2018)

library(adabag, quietly = TRUE)

mydata = read.table("winequality-red.csv", header = TRUE, sep = ';')
mydata$quality = as.factor(mydata$quality)
k = 10; n = nrow(mydata); m = round(n/k)

pracma::tic()
acc = numeric(k)
for (i in 1:k) {
    itest  <- sample(1:n, m)
    itrain <- setdiff(1:n, itest)
    # training
    ab = boosting(quality ~ ., data = mydata[itrain, ],
                  boos = FALSE, mfinal = 80)
    pr = predict(ab, mydata[itest, 1:11])
    acc[i] = sum(pr$class == mydata[itest, 12]) / m
}
pracma::toc()                       # elapsed time is 20 minutes (100 tests)
cat("Accuracy:", mean(acc), '\n')   # Accuracy: 0.60

elapsed time is 112.947000 seconds 
Accuracy: 0.60375

NOTE: There is ‘rbooster’ (2021) for pre-ready or custom-made classifiers, while Packages ‘sboost’ (2019) and ‘fastAdaboost’ (2016) only support binary classification tasks.

And there is ‘AdaBoostM1’ in WEKA, callable through *RWeka’, that is relatively fast, but reaches only 42 % accuracy.

rbooster

‘rbooster’ is an AdaBoost framework to allow for using custom functions in boosting, e.g., ‘rpart’, ‘glm’, or ‘earth’ (2021). More boosting methods, e.g., Gradient Boosting or XGBoost, shall be added.

library(rbooster)

We will use a randomized 10-fold testing scheme, because this boosting of custom functions is quite slow.

winedata = read.table("winequality-red.csv", header = TRUE, sep = ';')
winedata$quality = factor(winedata$quality)
k = 10; n = nrow(winedata); m = round(n/k)

acc = numeric(k)
pracma::tic()
for (i in 1:k) {
    itest  <- sample(1:n, m)
    itrain <- setdiff(1:n, itest)

    res <- booster(x_train = winedata[itrain, 1:11],
                   y_train = winedata[itrain, 12], 
                   classifier = "rpart", method = "discrete",
                   x_test = winedata[itest, 1:11],
                   y_test = winedata[itest, 12], 
                   weighted_bootstrap = FALSE,
                   max_iter = 80, lambda = 1, 
                   print_detail = FALSE, print_plot = FALSE, 
                   bag_frac = 0.8, p_weak = 4)
    acc[i] = 1 - res$err_test[100]
}
pracma::toc()
cat("Accuracy", mean(acc))

elapsed time is 18.700000 seconds 
Accuracy 0.634375

Interestingly, booster generates a plot that displays the advances during the learning process.

Learning curve

Other Packages

Rborist

‘Rborist’ provides an (scalable and parallelizable) implementation of the Random Forest algorithm for classification and regression trees.

library(Rborist)

winedata = read.table("winequality-red.csv", header = TRUE, sep = ';')
k = 10; n = nrow(winedata); m = round(n/k)

X = winedata[, 1:11]
Y = as.factor(winedata[, 12])

We need to set up our own realization of a 10x10 crossover validation. The following function returns indices fo a 10-fold split.

split2test <- function(N, k = 10) {
    n = floor(N/k)
    r = N - n*k
    L = rep(n, k)
    for (l in 1:r) L[l] = L[l] + 1 
    L = cumsum(c(1, L))
    matrix(c(L[1:k], L[2:(k+1)] - 1), ncol=2)
}

We generate a Matrix M of indices for our data with

( M = split2test(nrow(winedata)) )

      [,1] [,2]
 [1,]    1  160
 [2,]  161  320
 [3,]  321  480
 [4,]  481  640
 [5,]  641  800
 [6,]  801  960
 [7,]  961 1120
 [8,] 1121 1280
 [9,] 1281 1440
[10,] 1441 1599

and apply a 10x10-fold cross validation like this:

pracma::tic()
Acc = matrix(NA, k, k)
for (i in 1:k) {
    indices = sample(1:n, n)
    for (j in 1:k) {
        itest = indices[M[i, 1]:M[i, 2]]
        itrain = setdiff(1:n, itest)

    rs = Rborist(X[itrain,], Y[itrain])     # training
    pr = predict(rs, X[itest,])$yPred       # prediction

    Acc[i, j] = sum(pr == Y[itest]) / m     # accuracy
    }
}
pracma::toc()                       # elapsed time is ~100 seconds
cat("Accuracy:", mean(Acc), '\n')   # Accuracy: 0.707625

elapsed time is 98.100000 seconds 
Accuracy: 0.700625

Let’s display the accuracy matrix to see how big the variations are for an overall accuracy of 70 percent.

round(Acc, 2)

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,] 0.68 0.68 0.68 0.68 0.68 0.66 0.68 0.67 0.67  0.68
 [2,] 0.70 0.69 0.70 0.69 0.69 0.69 0.69 0.69 0.68  0.68
 [3,] 0.64 0.64 0.64 0.63 0.64 0.62 0.64 0.64 0.64  0.64
 [4,] 0.75 0.74 0.74 0.73 0.72 0.76 0.74 0.75 0.74  0.74
 [5,] 0.68 0.66 0.68 0.68 0.68 0.68 0.66 0.68 0.67  0.67
 [6,] 0.72 0.69 0.71 0.72 0.72 0.72 0.71 0.69 0.71  0.69
 [7,] 0.73 0.75 0.74 0.76 0.76 0.74 0.74 0.74 0.74  0.74
 [8,] 0.66 0.68 0.64 0.66 0.65 0.64 0.65 0.64 0.66  0.65
 [9,] 0.78 0.77 0.78 0.78 0.78 0.76 0.78 0.80 0.79  0.78
[10,] 0.71 0.70 0.70 0.71 0.71 0.71 0.71 0.70 0.69  0.71

The accuraccy is comparable to that of ‘ranger’, another implementation of the Random Forest algorithm, but much slower.

The tabnet Package

Package tabnet fits models for classification and regression tasks for tabular data, using the TabNet deep learning architecture (based on pyTorch and the ‘torch’ package). It is fully compatible with the ‘tidymodels’ ecosystem, for examples see the vignettes.

  library(tabnet)
# library(rsample)      # will be loaded with 'tidymodels'
# library(recipes)                      # same
# library(yardstick, quietly = TRUE)    # same

We collect the same data as before.

winedata = read.table("winequality-red.csv", header = TRUE, sep = ';')
winedata$quality = factor(winedata$quality)
k = 10; n = nrow(winedata); m = round(n/k)

itest = sample(1:n, m)
train = winedata[-itest, ]
test = winedata[itest, ]

Next we define a recipe (similar to the task in ‘mlr3’) fit a deep model.

myrecipe = recipe(quality ~ ., data = train)
fit <- tabnet_fit(myrecipe, train, epochs = 30)
# suppressWarnings(autoplot(fit))

We can take a look at the model with minimal information with print(fit).

An `nn_module` containing 6,407 parameters.

── Modules ───────────────────────────────────────────────────────────────
• embedder: <embedding_generator> #0 parameters
• tabnet: <tabnet_no_embedding> #6,406 parameters

── Parameters ──────────────────────────────────────────────────────────── 
• .check: Float [1:1]

We apply the ‘yardstick::metric_set’ function to extract the accuracy,

metrics <- metric_set(accuracy)     # , precision, recall

pred = cbind(test, predict(fit, test))
metrics(pred, quality, estimate = .pred_class)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass      0.55

or we calculate it directly with

acc = sum(pred$quality == pred$.pred_class) / m
cat("Accuracy:", acc, '\n')

Accuracy: 0.55

The time to fit the model with table_fit is too long (~35 sec) to do a full crossover validation cycle.

Stratification

tbd.

Appendix

Comparison

Learner	Accuracy	Timing	Remark
classif.rpart	56.5 %	0.06 sec
classif.xgboost	60.5 %	0.04 sec
classif.C50	61.5 %	~ 0.1 sec
classif.svm	62.5 %	0.24 sec
classif::lightgbm	-	-	no reasonable results
classif.ranger	70.0 %	0.09 sec
mlpack::adaboost	57.0 %	0.02 sec	kills R process
adabag::boosting	60.0 %	1.20 sec
RWeka::adaboost	42.5 %	~ 0.1 sec	external software
rbooster::rpart	63.5 %	2.35 sec	improves rpart
Rborist::Rborist	71.0 %	1.10 sec

tabnet (torch)	57.8 %	36.5 sec	connecting to Torch