Example R code solutions for the Data Science Module Computer Lab 11B, which uses the caret
R package (Kuhn et al. 2021) and Portuguese wine data obtained from UCI Machine Learning Repository (2009) (originally collected by Cortez et al. (2009)), are presented below.
This computer lab is designed to run alongside the content in the Introduction to Machine Learning in R supplement. It might be helpful to have this material open as you look through these solutions.
# Specify required packages
ml_packages <- c("caret", "gbm", "kernlab", "magrittr", "randomForest", "rpart.plot")
# Install missing packages
install.packages(setdiff(ml_packages, rownames(installed.packages())))
# Load all packages
lapply(ml_packages, library, character.only = TRUE)
No answer required.
The R code below should have been run:
red_wine <- read.csv(file = "winequality_red.csv", header = T)
red_wine$quality <- as.factor(red_wine$quality)
centre_scale <- preProcess(red_wine[, -12],
method = c("center", "scale"))
red_wine_updated <- predict(centre_scale, red_wine)
set.seed(1650)
wine_train_index <- createDataPartition(red_wine_updated$quality,
p = 0.8,
list = FALSE, times = 1)
red_wine_train <- red_wine_updated[wine_train_index, ]
red_wine_validate <- red_wine_updated[-wine_train_index, ]
Please note that for all the models in this section, we run the set.seed(1650)
command prior to training the model, so that the results discussed here are accurate regardless of the number of times this document is generated. If you do not set a seed prior to training your models, your results may appear slightly different.
set.seed(1650)
red_wine_dec_tree <- train(quality ~ .,
data = red_wine_train,
method = "rpart")
red_wine_dec_tree
## CART
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1282, 1282, 1282, 1282, 1282, 1282, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.01221167 0.5737107 0.3033509
## 0.02374491 0.5657458 0.2697084
## 0.25237449 0.4719068 0.1116919
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01221167.
Example R code is provided below:
set.seed(1650)
red_wine_dec_tree_tuned <- train(quality ~.,
data = red_wine_train,
method = "rpart",
tuneGrid = expand.grid(cp = seq(0.001, 0.01, 0.001))
)
red_wine_dec_tree_tuned
## CART
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1282, 1282, 1282, 1282, 1282, 1282, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.001 0.5646999 0.3079162
## 0.002 0.5708469 0.3163015
## 0.003 0.5707549 0.3145723
## 0.004 0.5719805 0.3150799
## 0.005 0.5772597 0.3207074
## 0.006 0.5763066 0.3148442
## 0.007 0.5778125 0.3147239
## 0.008 0.5758359 0.3104433
## 0.009 0.5719269 0.3024742
## 0.010 0.5706992 0.2998836
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.007.
We observe that by varying the cp
value, we have been able to achieve a slightly higher accuracy of 57.78%, for a cp
value of 0.007.
tr_control <- trainControl(method = "cv",
number = 10)
set.seed(1650)
red_wine_dec_tree_tuned_cv <- train(quality ~ .,
data = red_wine_train,
trControl = tr_control,
method = "rpart",
tuneGrid = expand.grid(cp = seq(0.001, 0.01, 0.001))
)
red_wine_dec_tree_tuned_cv
## CART
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1153, 1154, 1153, 1153, 1154, 1155, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.001 0.5779864 0.3280322
## 0.002 0.5936118 0.3491612
## 0.003 0.5865987 0.3380346
## 0.004 0.5874105 0.3362951
## 0.005 0.5765459 0.3164578
## 0.006 0.5773392 0.3155442
## 0.007 0.5687819 0.2978743
## 0.008 0.5601876 0.2816416
## 0.009 0.5570626 0.2726912
## 0.010 0.5633128 0.2799621
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.002.
Note 10 resamples are specified here for the cv
method so that computation time isn’t too long.
The best accuracy achieved by each of our Decision Tree models is presented below:
red_wine_dec_tree
: 57.37% accuracy, cp
= 0.01221167red_wine_dec_tree_tuned
: 57.78% accuracy, cp
= 0.007red_wine_dec_tree_tuned_cv
: 59.36% accuracy, cp
= 0.002The tuned Decision Tree with the cv
resampling method produced the best results. The top accuracy of 59.36% is not exceptional, but by adjusting our code we have been able to increase accuracy by roughly 2%, which is worthwhile.
ggplot(red_wine_dec_tree)
ggplot(red_wine_dec_tree_tuned)
ggplot(red_wine_dec_tree_tuned_cv)
We can see that the best results are achieved when the complexity parameter is small.
rpart.plot(red_wine_dec_tree$finalModel)
rpart.plot(red_wine_dec_tree_tuned$finalModel)
rpart.plot(red_wine_dec_tree_tuned_cv$finalModel)
Only the first two decision tree visualisations are informative - the red_wine_dec_tree_tuned_cv
plot has too many branches to quickly and easily assess. We can see that the models are limited to only being able to predict quality scores of 5,6 or 7. This explains their poor performances.
set.seed(1650)
red_wine_rf <- train(quality ~ .,
data = red_wine_train,
method = "rf")
red_wine_rf
## Random Forest
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1282, 1282, 1282, 1282, 1282, 1282, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.6700665 0.4633930
## 6 0.6627082 0.4551261
## 11 0.6582344 0.4497119
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
The best accuracy achieved by the red_wine_rf
random forest model is 67%, for an mtry
value of 2. This is already much better than our best decision tree model!
set.seed(1650)
red_wine_rf_tuned <- train(quality ~ .,
data = red_wine_train,
method = "rf",
tuneGrid = expand.grid(mtry = c(1,3))
)
red_wine_rf_tuned
## Random Forest
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1282, 1282, 1282, 1282, 1282, 1282, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.6646680 0.4513419
## 3 0.6669872 0.4600541
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
set.seed(1650)
red_wine_rf_tuned_cv <- train(quality ~ .,
data = red_wine_train,
trControl = tr_control,
method = "rf",
tuneGrid = expand.grid(mtry = c(1:3))
)
red_wine_rf_tuned_cv
## Random Forest
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1153, 1154, 1153, 1153, 1154, 1155, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.7012984 0.5123231
## 2 0.6958356 0.5053334
## 3 0.7020552 0.5177138
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
The best accuracy achieved by each of our Random Forest models is presented below:
red_wine_rf
: 67% accuracy, mtry
= 2red_wine_rf_tuned
: 66.70% accuracy, mtry
= 3red_wine_rf_tuned_cv
: 70.02% accuracy, mtry
= 3The tuned Random Forest model with the cv
resampling method produced the best results. The top accuracy of 70.02%, which is much better than any of our previous results.
ggplot(red_wine_rf)
dotPlot(varImp(red_wine_rf))
The feature variables considered most important were alcohol
, followed by volatile.acidity
and total.sulfur.dioxide
.
ggplot(red_wine_rf_tuned)
ggplot(red_wine_rf_tuned_cv)
dotPlot(varImp(red_wine_rf_tuned))
dotPlot(varImp(red_wine_rf_tuned_cv))
Here we can see that the models’ accuracies depended on different numbers of randomly selected predictors (feature variables) being selected, for the different adjustments being made (tuning parameters or changing the resampling method).
The alcohol
variable remained the single most important feature variable to include in a model, while free.sulfur.dioxide
was the least important feature variable across all models.
set.seed(1650)
red_wine_boosted <- train(quality ~ .,
data = red_wine_train,
method = "gbm",
verbose = FALSE)
red_wine_boosted
## Stochastic Gradient Boosting
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1282, 1282, 1282, 1282, 1282, 1282, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.6049492 0.3554880
## 1 100 0.6103872 0.3703725
## 1 150 0.6081707 0.3688512
## 2 50 0.6133524 0.3748142
## 2 100 0.6178042 0.3853569
## 2 150 0.6160044 0.3838989
## 3 50 0.6195430 0.3865898
## 3 100 0.6225492 0.3934048
## 3 150 0.6250933 0.3986685
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 3, shrinkage = 0.1 and n.minobsinnode = 10.
The best accuracy achieved by the red_wine_boosted
model is 62.51%, when interaction depth was 3 and the number of trees was 150.
set.seed(1650)
red_wine_boosted_tuned <- train(quality ~ .,
data = red_wine_train,
method = "gbm",
tuneGrid = expand.grid(interaction.depth = 3:6,
n.trees = seq(50, 200, 50),
shrinkage = 0.1,
n.minobsinnode = 10),
verbose = FALSE)
red_wine_boosted_tuned
## Stochastic Gradient Boosting
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1282, 1282, 1282, 1282, 1282, 1282, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 3 50 0.6198300 0.3862094
## 3 100 0.6213315 0.3914632
## 3 150 0.6283839 0.4033432
## 3 200 0.6290806 0.4059713
## 4 50 0.6223524 0.3915968
## 4 100 0.6266637 0.4001994
## 4 150 0.6325701 0.4106020
## 4 200 0.6331125 0.4120082
## 5 50 0.6262532 0.3985373
## 5 100 0.6320015 0.4094948
## 5 150 0.6372771 0.4186823
## 5 200 0.6411313 0.4249792
## 6 50 0.6313038 0.4050127
## 6 100 0.6391566 0.4204912
## 6 150 0.6472912 0.4339037
## 6 200 0.6475264 0.4345551
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 200, interaction.depth =
## 6, shrinkage = 0.1 and n.minobsinnode = 10.
set.seed(1650)
red_wine_boosted_tuned_cv <- train(quality ~ .,
data = red_wine_train,
method = "gbm",
trControl = tr_control,
tuneGrid = expand.grid(interaction.depth = 3:6,
n.trees = seq(50, 200, 50),
shrinkage = 0.1,
n.minobsinnode = 10),
verbose = FALSE)
red_wine_boosted_tuned_cv
## Stochastic Gradient Boosting
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1153, 1154, 1153, 1153, 1154, 1155, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 3 50 0.6319423 0.4022170
## 3 100 0.6334745 0.4082981
## 3 150 0.6357999 0.4113483
## 3 200 0.6459566 0.4290713
## 4 50 0.6225426 0.3906675
## 4 100 0.6365446 0.4180816
## 4 150 0.6443999 0.4307405
## 4 200 0.6412747 0.4280012
## 5 50 0.6318631 0.4028686
## 5 100 0.6467136 0.4322859
## 5 150 0.6474641 0.4356158
## 5 200 0.6537265 0.4461991
## 6 50 0.6342495 0.4073552
## 6 100 0.6490023 0.4356075
## 6 150 0.6599035 0.4545666
## 6 200 0.6607093 0.4557511
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 200, interaction.depth =
## 6, shrinkage = 0.1 and n.minobsinnode = 10.
The best accuracy achieved by each of our Gradient Boosting Machine models is presented below:
red_wine_boosted
: 62.51% accuracy, n.trees
= 150, interaction.depth
= 3red_wine_boosted_tuned
: 64.75% accuracy, n.trees
= 200, interaction.depth
= 6red_wine_boosted_tuned_cv
: 66.07% accuracy, n.trees
= 200, interaction.depth
= 6The tuned Gradient Boosting Machine model with the cv
resampling method produced the best results, for n.trees
= 200 and interaction.depth
= 6. The top accuracy of 66.07%, which is much better than the 62.51% accuracy achieved by the model with the default settings.
plot(red_wine_boosted)
plot(red_wine_boosted_tuned)
plot(red_wine_boosted_tuned_cv)
These graphs make it easy to see the best combination of tree depth and iterations to use. Generally, the larger the maximum tree depth, the more accurate the model was.
Example R code for the different machine learning models is provided below:
set.seed(1650)
red_wine_lda <- train(quality ~ .,
data = red_wine_train,
method = "lda")
red_wine_lda
## Linear Discriminant Analysis
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1282, 1282, 1282, 1282, 1282, 1282, ...
## Resampling results:
##
## Accuracy Kappa
## 0.5943597 0.3549375
The best accuracy achieved by this method is 59.43%.
set.seed(1650)
red_wine_svm <- train(quality ~ .,
data = red_wine_train,
method = "svmLinear")
red_wine_svm
## Support Vector Machines with Linear Kernel
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1282, 1282, 1282, 1282, 1282, 1282, ...
## Resampling results:
##
## Accuracy Kappa
## 0.5902292 0.3242403
##
## Tuning parameter 'C' was held constant at a value of 1
The best accuracy achieved by this method is 59.02%.
set.seed(1650)
red_wine_knn <- train(quality ~ .,
data = red_wine_train,
method = "knn")
red_wine_knn
## k-Nearest Neighbors
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1282, 1282, 1282, 1282, 1282, 1282, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.5388836 0.2733503
## 7 0.5467209 0.2812282
## 9 0.5518291 0.2852813
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
The best accuracy achieved by this method is 55.18%, for \(k\) = 9.
No answer required.
Example code is provided below.
This is one way to check the predictive accuracies using the validation data.
For this check, we have chosen to use the models:
red_wine_dec_tree_tuned_cv
red_wine_rf_tuned_cv
red_wine_boosted_tuned_cv
red_wine_lda
red_wine_svm
red_wine_knn
validation_numbers <- dim(red_wine_validate)[1]
predict_dec_tree_tuned_cv <- predict(red_wine_dec_tree_tuned_cv,
newdata =red_wine_validate)
predict_rf_tuned_cv <- predict(red_wine_rf_tuned_cv,
newdata =red_wine_validate)
predict_boosted_tuned_cv <- predict(red_wine_boosted_tuned_cv,
newdata =red_wine_validate)
predict_lda <- predict(red_wine_lda,
newdata =red_wine_validate)
predict_svm <- predict(red_wine_svm,
newdata =red_wine_validate)
predict_knn <- predict(red_wine_knn,
newdata =red_wine_validate)
dec_tree_tuned_cv_accuracy <- round(100*sum(predict_dec_tree_tuned_cv == red_wine_validate$quality) / validation_numbers, 2)
rf_tuned_cv_accuracy <- round(100*sum(predict_rf_tuned_cv == red_wine_validate$quality) / validation_numbers, 2)
boosted_tuned_cv_accuracy <- round(100*sum(predict_boosted_tuned_cv == red_wine_validate$quality) / validation_numbers, 2)
lda_accuracy <- round(100*sum(predict_lda == red_wine_validate$quality) / validation_numbers, 2)
svm_accuracy <- round(100*sum(predict_svm == red_wine_validate$quality) / validation_numbers, 2)
knn_accuracy <- round(100*sum(predict_knn == red_wine_validate$quality) / validation_numbers, 2)
dec_tree_tuned_cv_accuracy
## [1] 57.73
rf_tuned_cv_accuracy
## [1] 66.88
boosted_tuned_cv_accuracy
## [1] 62.15
lda_accuracy
## [1] 57.1
svm_accuracy
## [1] 55.21
knn_accuracy
## [1] 55.52
From these results, we can see that the model that performed best when provided with the validation data was the Random Forest model with the tuned parameters and the cv
resampling method. This model achieved an accuracy of 66.88% using the validation data. This is a few percent less than the 70.02% accuracy the model achieved on the training data, but is still quite good.
results_boot <- resamples(list(decision_tree = red_wine_dec_tree,
decision_tree_tuned = red_wine_dec_tree_tuned,
random_forest = red_wine_rf,
random_forest_tuned = red_wine_rf_tuned,
gradient_boosted = red_wine_boosted,
gradient_boosted_tuned = red_wine_boosted_tuned,
linear_disc_analysis = red_wine_lda,
support_vector_machine = red_wine_svm,
k_nearest_neighbours = red_wine_knn
)
)
results_cv <- resamples(list(decision_tree_tuned_cv = red_wine_dec_tree_tuned_cv,
random_forest_tuned_cv = red_wine_rf_tuned_cv,
gradient_boosted_tuned_cv = red_wine_boosted_tuned_cv
)
)
summary(results_boot)
##
## Call:
## summary.resamples(object = results_boot)
##
## Models: decision_tree, decision_tree_tuned, random_forest, random_forest_tuned, gradient_boosted, gradient_boosted_tuned, linear_disc_analysis, support_vector_machine, k_nearest_neighbours
## Number of resamples: 25
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu.
## decision_tree 0.5353319 0.5614407 0.5720430 0.5737107 0.5883621
## decision_tree_tuned 0.5354167 0.5578947 0.5819328 0.5778125 0.5953878
## random_forest 0.6493776 0.6615721 0.6666667 0.6700665 0.6783370
## random_forest_tuned 0.6465517 0.6568421 0.6646091 0.6669872 0.6710240
## gradient_boosted 0.5913978 0.6196581 0.6247241 0.6250933 0.6361656
## gradient_boosted_tuned 0.6158537 0.6353712 0.6427015 0.6475264 0.6622517
## linear_disc_analysis 0.5505376 0.5871965 0.5982340 0.5943597 0.6092437
## support_vector_machine 0.5268817 0.5761589 0.5897959 0.5902292 0.6021505
## k_nearest_neighbours 0.5031983 0.5320088 0.5545852 0.5518291 0.5664488
## Max. NA's
## decision_tree 0.6247241 0
## decision_tree_tuned 0.6120690 0
## random_forest 0.6966527 0
## random_forest_tuned 0.7032258 0
## gradient_boosted 0.6572052 0
## gradient_boosted_tuned 0.6857143 0
## linear_disc_analysis 0.6193416 0
## support_vector_machine 0.6260163 0
## k_nearest_neighbours 0.6004274 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu.
## decision_tree 0.2520370 0.2891515 0.3081993 0.3033509 0.3189378
## decision_tree_tuned 0.2577697 0.2867525 0.3221000 0.3147239 0.3442703
## random_forest 0.4344375 0.4486099 0.4570372 0.4633930 0.4771286
## random_forest_tuned 0.4257512 0.4508698 0.4570846 0.4600541 0.4679275
## gradient_boosted 0.3485645 0.3793564 0.4000525 0.3986685 0.4133925
## gradient_boosted_tuned 0.3799589 0.4106077 0.4280093 0.4345551 0.4640305
## linear_disc_analysis 0.2903666 0.3414276 0.3594764 0.3549375 0.3750534
## support_vector_machine 0.2371308 0.2945116 0.3234587 0.3242403 0.3575354
## k_nearest_neighbours 0.2197641 0.2606944 0.2799916 0.2852813 0.3089489
## Max. NA's
## decision_tree 0.3814905 0
## decision_tree_tuned 0.3683173 0
## random_forest 0.5110820 0
## random_forest_tuned 0.5179211 0
## gradient_boosted 0.4489159 0
## gradient_boosted_tuned 0.4933856 0
## linear_disc_analysis 0.3906424 0
## support_vector_machine 0.3932027 0
## k_nearest_neighbours 0.3707461 0
summary(results_cv)
##
## Call:
## summary.resamples(object = results_cv)
##
## Models: decision_tree_tuned_cv, random_forest_tuned_cv, gradient_boosted_tuned_cv
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu.
## decision_tree_tuned_cv 0.5390625 0.5747638 0.5921506 0.5936118 0.6191406
## random_forest_tuned_cv 0.6562500 0.6803546 0.7004300 0.7020552 0.7304688
## gradient_boosted_tuned_cv 0.6171875 0.6491188 0.6588337 0.6607093 0.6783703
## Max. NA's
## decision_tree_tuned_cv 0.6434109 0
## random_forest_tuned_cv 0.7421875 0
## gradient_boosted_tuned_cv 0.6953125 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu.
## decision_tree_tuned_cv 0.2610568 0.3201660 0.3444430 0.3491612 0.3882586
## random_forest_tuned_cv 0.4379802 0.4828595 0.5202281 0.5177138 0.5578036
## gradient_boosted_tuned_cv 0.3799308 0.4373075 0.4571198 0.4557511 0.4821160
## Max. NA's
## decision_tree_tuned_cv 0.4319900 0
## random_forest_tuned_cv 0.5840063 0
## gradient_boosted_tuned_cv 0.5103002 0
dotplot(results_boot)
dotplot(results_cv)
We trained Decision Tree, Random Forest, Gradient Boosting Machine, Linear Discriminant Analysis, Support Vector Machine and k-Nearest-Neigbour machine learning models on the Portuguese red wine data winequality_red.csv
.
The Random Forest models had the best overall accuracy based on the training data, at 70.02% with tuned parameters and the cv
resampling method. This was supported by the validation data test, for which the selected Random Forest model achieved an accuracy score of 66.88%.
Based on our results, we would recommend using the Random Forest machine learning model for this data. We do note however that the model can take some time to run.
It is worth noting here that there are other more advanced models which we haven’t tried that could lead to higher accuracy scores.
These notes have been prepared by Rupert Kuveke. Please note that some of the content in these notes has been developed from content in Thulin (2021). The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.