We will use Wine Quality data set from the UCI Machine Learning Repository. The “red wine” data span 1599 probes with 11 numerical attributes and one target attribute called quality.
redwine = read.table("winequality-red.csv", header = TRUE, sep = ';')
# redwine$quality = factor(redwine$quality)
names(redwine)
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
The quality classes are not balanced, i.e. there are many more normal wines than excellent or poor ones.
barplot(table(redwine$quality), col="darkred")
table(redwine$quality)
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
NOTE: There are more frameworks wrapping several Machine Learning packages in R, see for example ‘caret’ (Doc), ‘rminer’ (Tutorial), or ‘SuperLearner’ (Guide). We will focus on ‘mlr3’ (Book).
Package ‘mlr3’, “Machine Learning for R”, provides resp. provides base and extra learners. See a list of learners coming with the additional package ‘mlr3learners’ at https://github.com/mlr-org/mlr3learners.
See the mlr3examples repository of examples presented at the Rhein-Neckar Meetups during late summer 2021.
### Loading the mlr3 package family
library("mlr3verse", quietly = TRUE)
## classification learners
# learner = lrn("classif.rpart") # 56.5 % in 6.2 sec
# learner = mlr3::lrn("classif.svm") # 62.6 % in 23.9 sec
learner = lrn("classif.ranger", num.trees = 50L) # 70.2 % in 8.7 sec
# learner = lrn("classif.xgboost") # 60.4 % in 3.7 sec
print(learner)
<LearnerClassifRanger:classif.ranger>
* Model: -
* Parameters: num.threads=1, num.trees=50
* Packages: mlr3, mlr3learners, ranger
* Predict Type: response
* Feature types: logical, integer, numeric, character, factor, ordered
* Properties: hotstart_backward, importance, multiclass, oob_error, twoclass, weights
We need to change the target column ‘quality’ into a factor to make mlr3 recognize this is a classification, not a regression task.
winedata = read.table("winequality-red.csv", header = TRUE, sep = ';')
winedata$quality = factor(winedata$quality)
Define a task fixing a name, the data, and the target attribute.
mytask <- TaskClassif$new(id = "redwine",
backend = winedata, target = "quality")
Now sample test and training data, train the learner, predict the test data, and estimate the accuracy.
n = nrow(winedata); m = round(0.1*n)
itest <- sample(1:n, m)
itrain <- setdiff(1:n, itest)
learner$train(mytask, row_ids = itrain)
predicts <- learner$predict(mytask, row_ids = itest)
predicts$score(msrs(c("classif.acc", "classif.ce")))
classif.acc classif.ce
0.7125 0.2875
We can inspect the model generated by the learner and print the confusion matrix.
# learner$model
confusion_matrix <- predicts$confusion
print(confusion_matrix)
truth
response 3 4 5 6 7 8
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 1 1 52 10 2 0
6 1 2 11 52 8 0
7 0 0 0 7 9 3
8 0 0 0 0 0 1
Now train the model, predict new cases, calculate the mean of all accuracies, doing a 10x10 cross-validation. We define a resampling method and apply it to the task and learner defined before, the result contains the score.
resample = rsmp("repeated_cv", folds = 10) # or "cv"
resres = resample(task = mytask, learner, resample)
acc = resres$score(msr("classif.acc"))$classif.acc
cat("Accuracy:", mean(acc), " (10-fold cross-validation)", '\n')
Accuracy: 0.7020578 (10-fold cross-validation)
For timings and accuracies of different base learners see above.
For classification tasks, the ‘mlr3extralearners’ package provides, e.g., AdaBoostM1, BART, C50, or a Random Forest learner. An overview of all learners can be found here: Search Learners
C50 is an implementation of Ross Quinlan’s C4.5 decision tree algorithm. The following packages have to be loaded:
# remotes::install_github("mlr-org/mlr3extralearners")
library("mlr3verse", quietly = TRUE)
library("mlr3extralearners")
library(C50)
The data are again
winedata = read.table("winequality-red.csv", header = TRUE, sep = ';')
winedata$quality = factor(winedata$quality)
Learner and task:
learner = lrn("classif.C50")
mytask <- TaskClassif$new(id = "redwine",
backend = winedata, target = "quality")
print(learner)
<LearnerClassifC50:classif.C50>
* Model: -
* Parameters: list()
* Packages: mlr3, mlr3extralearners, C50
* Predict Type: response
* Feature types: numeric, factor, ordered
* Properties: missings, multiclass, twoclass, weights
Apply the resampling function with 10x10-fold cross-validation
resample = rsmp("repeated_cv", folds = 10) # or "cv"
resres = resample(task = mytask, learner, resample)
acc = resres$score(msr("classif.acc"))$classif.acc
cat("Accuracy:", mean(acc), " (10-fold cross-validation)", '\n')
It prints out a lot of messages and stops in about 10 seconds.
INFO [mlr3] Applying learner 'classif.C50' on task 'redwine' (iter 26/100)
....
INFO [mlr3] Applying learner 'classif.C50' on task 'redwine' (iter 10/100)
Accuracy: 0.6152017 (10-fold cross-validation)
‘mlr3’ provides AdaBoostM1 through the WEKA software and the ‘RWeka’ package. Because of problems with installing ‘rjava’ I tried the AdaboostM1 function in the WEKA interface directly.
It turned out that this implementation is quite fast, even for 10-fold cross-validation, but the accuracy with 0.425 is very low. So I looked for other implementations of AdaBoostM1 in R packages.
Was not able to get reasonable results with the ‘lightgbm’ extra learner that returned accuracies of 0.01875!
‘mlpack’ is a “flexible machine learning library, written in C++, that aims to provide fast, extensible implementations of cutting-edge machine learning algorithms.”
‘mlpack’ doumentation
library(mlpack)
## integer column for classification
winedata <- read.table("winequality-red.csv", header = TRUE, sep = ';')
n <- nrow(winedata); m <- ncol(winedata)
# Split the labels.
labels <- as.data.frame(winedata$quality)
dataset <- winedata[, 1:11] # 1:m resulted in 80 % accuracy !
pracma::tic()
# Split the dataset using mlpack.
prepdata <- preprocess_split(input = dataset,
input_labels = labels,
test_ratio = 0.1,
seed = 0, verbose = FALSE)
# training and results
output <- adaboost( training = prepdata$training,
labels = prepdata$training_labels,
weak_learner = "decision_stump",
verbose = FALSE)
model <- output$output_model
# predict labels of test points
output = adaboost(input_model = model,
test = prepdata$test)
pracma::toc()
elapsed time is 0.021000 seconds
This is the time for one run. As it turned out, ‘mlpack’ kills the R process if called several times, and then the user has to restart R !
Now print the accuracy:
correct <- sum(output$predictions == prepdata$test_labels)
# elapsed time is 0.02 seconds
cat(correct, "out of", length(prepdata$test_labels), "test points correct:",
correct / length(prepdata$test_labels) * 100.0, "%\n")
91 out of 159 test points correct: 57.2327 %
The run-time of 0.02 sec is very good, but not of use if
adaboost() kills the R process. This happens with other
learners in ‘mlpack’, too, and makes the package unusable.
“Freund and Schapire’s Adaboost.M1 algorithm and Breiman’s Bagging algorithm using classification trees as individual classifiers.” (2018)
library(adabag, quietly = TRUE)
mydata = read.table("winequality-red.csv", header = TRUE, sep = ';')
mydata$quality = as.factor(mydata$quality)
k = 10; n = nrow(mydata); m = round(n/k)
pracma::tic()
acc = numeric(k)
for (i in 1:k) {
itest <- sample(1:n, m)
itrain <- setdiff(1:n, itest)
# training
ab = boosting(quality ~ ., data = mydata[itrain, ],
boos = FALSE, mfinal = 80)
pr = predict(ab, mydata[itest, 1:11])
acc[i] = sum(pr$class == mydata[itest, 12]) / m
}
pracma::toc() # elapsed time is 20 minutes (100 tests)
cat("Accuracy:", mean(acc), '\n') # Accuracy: 0.60
elapsed time is 112.947000 seconds
Accuracy: 0.60375
NOTE: There is ‘rbooster’ (2021) for pre-ready or custom-made classifiers, while Packages ‘sboost’ (2019) and ‘fastAdaboost’ (2016) only support binary classification tasks.
And there is ‘AdaBoostM1’ in WEKA, callable through *RWeka’, that is relatively fast, but reaches only 42 % accuracy.
‘rbooster’ is an AdaBoost framework to allow for using custom functions in boosting, e.g., ‘rpart’, ‘glm’, or ‘earth’ (2021). More boosting methods, e.g., Gradient Boosting or XGBoost, shall be added.
library(rbooster)
We will use a randomized 10-fold testing scheme, because this boosting of custom functions is quite slow.
winedata = read.table("winequality-red.csv", header = TRUE, sep = ';')
winedata$quality = factor(winedata$quality)
k = 10; n = nrow(winedata); m = round(n/k)
acc = numeric(k)
pracma::tic()
for (i in 1:k) {
itest <- sample(1:n, m)
itrain <- setdiff(1:n, itest)
res <- booster(x_train = winedata[itrain, 1:11],
y_train = winedata[itrain, 12],
classifier = "rpart", method = "discrete",
x_test = winedata[itest, 1:11],
y_test = winedata[itest, 12],
weighted_bootstrap = FALSE,
max_iter = 80, lambda = 1,
print_detail = FALSE, print_plot = FALSE,
bag_frac = 0.8, p_weak = 4)
acc[i] = 1 - res$err_test[100]
}
pracma::toc()
cat("Accuracy", mean(acc))
elapsed time is 18.700000 seconds
Accuracy 0.634375
Interestingly, booster generates a plot that displays
the advances during the learning process.
Learning curve
‘Rborist’ provides an (scalable and parallelizable) implementation of the Random Forest algorithm for classification and regression trees.
library(Rborist)
winedata = read.table("winequality-red.csv", header = TRUE, sep = ';')
k = 10; n = nrow(winedata); m = round(n/k)
X = winedata[, 1:11]
Y = as.factor(winedata[, 12])
We need to set up our own realization of a 10x10 crossover validation. The following function returns indices fo a 10-fold split.
split2test <- function(N, k = 10) {
n = floor(N/k)
r = N - n*k
L = rep(n, k)
for (l in 1:r) L[l] = L[l] + 1
L = cumsum(c(1, L))
matrix(c(L[1:k], L[2:(k+1)] - 1), ncol=2)
}
We generate a Matrix M of indices for our data with
( M = split2test(nrow(winedata)) )
[,1] [,2]
[1,] 1 160
[2,] 161 320
[3,] 321 480
[4,] 481 640
[5,] 641 800
[6,] 801 960
[7,] 961 1120
[8,] 1121 1280
[9,] 1281 1440
[10,] 1441 1599
and apply a 10x10-fold cross validation like this:
pracma::tic()
Acc = matrix(NA, k, k)
for (i in 1:k) {
indices = sample(1:n, n)
for (j in 1:k) {
itest = indices[M[i, 1]:M[i, 2]]
itrain = setdiff(1:n, itest)
rs = Rborist(X[itrain,], Y[itrain]) # training
pr = predict(rs, X[itest,])$yPred # prediction
Acc[i, j] = sum(pr == Y[itest]) / m # accuracy
}
}
pracma::toc() # elapsed time is ~100 seconds
cat("Accuracy:", mean(Acc), '\n') # Accuracy: 0.707625
elapsed time is 98.100000 seconds
Accuracy: 0.700625
Let’s display the accuracy matrix to see how big the variations are for an overall accuracy of 70 percent.
round(Acc, 2)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.68 0.68 0.68 0.68 0.68 0.66 0.68 0.67 0.67 0.68
[2,] 0.70 0.69 0.70 0.69 0.69 0.69 0.69 0.69 0.68 0.68
[3,] 0.64 0.64 0.64 0.63 0.64 0.62 0.64 0.64 0.64 0.64
[4,] 0.75 0.74 0.74 0.73 0.72 0.76 0.74 0.75 0.74 0.74
[5,] 0.68 0.66 0.68 0.68 0.68 0.68 0.66 0.68 0.67 0.67
[6,] 0.72 0.69 0.71 0.72 0.72 0.72 0.71 0.69 0.71 0.69
[7,] 0.73 0.75 0.74 0.76 0.76 0.74 0.74 0.74 0.74 0.74
[8,] 0.66 0.68 0.64 0.66 0.65 0.64 0.65 0.64 0.66 0.65
[9,] 0.78 0.77 0.78 0.78 0.78 0.76 0.78 0.80 0.79 0.78
[10,] 0.71 0.70 0.70 0.71 0.71 0.71 0.71 0.70 0.69 0.71
The accuraccy is comparable to that of ‘ranger’, another implementation of the Random Forest algorithm, but much slower.
Package tabnet fits models for classification and regression tasks for tabular data, using the TabNet deep learning architecture (based on pyTorch and the ‘torch’ package). It is fully compatible with the ‘tidymodels’ ecosystem, for examples see the vignettes.
library(tabnet)
# library(rsample) # will be loaded with 'tidymodels'
# library(recipes) # same
# library(yardstick, quietly = TRUE) # same
We collect the same data as before.
winedata = read.table("winequality-red.csv", header = TRUE, sep = ';')
winedata$quality = factor(winedata$quality)
k = 10; n = nrow(winedata); m = round(n/k)
itest = sample(1:n, m)
train = winedata[-itest, ]
test = winedata[itest, ]
Next we define a recipe (similar to the task in ‘mlr3’) fit a deep model.
myrecipe = recipe(quality ~ ., data = train)
fit <- tabnet_fit(myrecipe, train, epochs = 30)
# suppressWarnings(autoplot(fit))
We can take a look at the model with minimal information with
print(fit).
An `nn_module` containing 6,407 parameters.
── Modules ───────────────────────────────────────────────────────────────
• embedder: <embedding_generator> #0 parameters
• tabnet: <tabnet_no_embedding> #6,406 parameters
── Parameters ────────────────────────────────────────────────────────────
• .check: Float [1:1]
We apply the ‘yardstick::metric_set’ function to extract the accuracy,
metrics <- metric_set(accuracy) # , precision, recall
pred = cbind(test, predict(fit, test))
metrics(pred, quality, estimate = .pred_class)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy multiclass 0.55
or we calculate it directly with
acc = sum(pred$quality == pred$.pred_class) / m
cat("Accuracy:", acc, '\n')
Accuracy: 0.55
The time to fit the model with table_fit is too long
(~35 sec) to do a full crossover validation cycle.
tbd.
| Learner | Accuracy | Timing | Remark |
|---|---|---|---|
| classif.rpart | 56.5 % | 0.06 sec | |
| classif.xgboost | 60.5 % | 0.04 sec | |
| classif.C50 | 61.5 % | ~ 0.1 sec | |
| classif.svm | 62.5 % | 0.24 sec | |
| classif::lightgbm | - | - | no reasonable results |
| classif.ranger | 70.0 % | 0.09 sec | |
| mlpack::adaboost | 57.0 % | 0.02 sec | kills R process |
| adabag::boosting | 60.0 % | 1.20 sec | |
| RWeka::adaboost | 42.5 % | ~ 0.1 sec | external software |
| rbooster::rpart | 63.5 % | 2.35 sec | improves rpart |
| Rborist::Rborist | 71.0 % | 1.10 sec | |
| tabnet (torch) | 57.8 % | 36.5 sec | connecting to Torch |