Example solutions for the Data Science Computer Lab 11B, which uses the caret
R package (Kuhn et al. 2021) and Portuguese wine data obtained from UCI Machine Learning Repository (2009) (originally collected by Cortez et al. (2009)), are presented below.
This computer lab is designed to run alongside the content in the Introduction to Machine Learning in R supplement. It might be helpful to have this material open as you look through these solutions.
Before we proceed, please make sure you have read all the content in the Introduction to Machine Learning in R supplement and completed Computer Lab 10B. It may also be helpful to:
No answer required.
No answer required.
The following code should have been run at this point:
# Specify required packages
ml_packages <- c("caret", "gbm", "kernlab", "magrittr", "randomForest", "rpart.plot")
# Install missing packages
install.packages(setdiff(ml_packages, rownames(installed.packages())))
# Load all packages
lapply(ml_packages, library, character.only = TRUE)
# Load data
red_wine <- read.csv(file = "winequality_red.csv", header = T)
red_wine$quality <- as.factor(red_wine$quality)
centre_scale <- preProcess(red_wine[, -12],
method = c("center", "scale"))
red_wine_updated <- predict(centre_scale, red_wine)
set.seed(1650)
wine_train_index <- createDataPartition(red_wine_updated$quality,
p = 0.8,
list = FALSE, times = 1)
red_wine_train <- red_wine_updated[wine_train_index, ]
red_wine_validate <- red_wine_updated[-wine_train_index, ]
set.seed(1650)
red_wine_dec_tree <- train(quality ~ .,
data = red_wine_train,
method = "rpart")
set.seed(1650)
red_wine_rf <- train(quality ~ .,
data = red_wine_train,
method = "rf")
Example code is shown below:
set.seed(1650)
red_wine_dec_tree_tuned <- train(quality ~ .,
data = red_wine_train,
method = "rpart",
tuneGrid = expand.grid(cp = seq(0.001, 0.01, 0.001))
)
red_wine_dec_tree_tuned
## CART
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1282, 1282, 1282, 1282, 1282, 1282, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.001 0.5646999 0.3079162
## 0.002 0.5708469 0.3163015
## 0.003 0.5707549 0.3145723
## 0.004 0.5719805 0.3150799
## 0.005 0.5772597 0.3207074
## 0.006 0.5763066 0.3148442
## 0.007 0.5778125 0.3147239
## 0.008 0.5758359 0.3104433
## 0.009 0.5719269 0.3024742
## 0.010 0.5706992 0.2998836
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.007.
The top accuracy of the red_wine_dec_tree_tuned
model is 57.78%, which is a very slight improvement over the original decision tree accuracy.
Example code is shown below:
set.seed(1650)
red_wine_rf_tuned <- train(quality ~ .,
data = red_wine_train,
method = "rf",
tuneGrid = expand.grid(mtry = c(1:3)))
red_wine_rf_tuned
## Random Forest
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1282, 1282, 1282, 1282, 1282, 1282, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.6668683 0.4555277
## 2 0.6660082 0.4567234
## 3 0.6648004 0.4560781
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 1.
No answer required.
Example code is shown below:
tr_control <- trainControl(method = "cv",
number = 25)
set.seed(1650)
red_wine_dec_tree_tuned_cv <- train(quality ~ .,
data = red_wine_train,
trControl = tr_control,
method = "rpart",
tuneGrid = expand.grid(cp = seq(0.001, 0.01, 0.001))
)
red_wine_dec_tree_tuned_cv
## CART
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Cross-Validated (25 fold)
## Summary of sample sizes: 1232, 1230, 1232, 1229, 1231, 1230, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.001 0.5733088 0.3214443
## 0.002 0.5904327 0.3428007
## 0.003 0.5892240 0.3396075
## 0.004 0.5909453 0.3411524
## 0.005 0.5797784 0.3165351
## 0.006 0.5797941 0.3141993
## 0.007 0.5727636 0.3024321
## 0.008 0.5790092 0.3119535
## 0.009 0.5807764 0.3138709
## 0.010 0.5861471 0.3177705
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.004.
set.seed(1650)
red_wine_rf_tuned_cv <- train(quality ~ .,
data = red_wine_train,
trControl = tr_control,
method = "rf",
tuneGrid = expand.grid(mtry = c(1:3))
)
red_wine_rf_tuned_cv
## Random Forest
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Cross-Validated (25 fold)
## Summary of sample sizes: 1232, 1230, 1232, 1229, 1231, 1230, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.7067215 0.5200728
## 2 0.7106424 0.5286154
## 3 0.7022088 0.5163067
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
red_wine_rf
model had a top accuracy of 67.01%red_wine_rf_tuned
model had a top accuracy of 66.69%red_wine_rf_tuned_cv
model had a top accuracy of 71.06%So we can see that spending some time tuning and tweaking our ML model can lead to significant improvements in predictive accuracy.
The red_wine_rf_tuned_cv
model that achieved the accuracy of 71.06% did so for an mtry
value of 2, and used the cv
resampling method.
ggplot(red_wine_rf)
dotPlot(varImp(red_wine_rf))
ggplot(red_wine_rf_tuned)
dotPlot(varImp(red_wine_rf_tuned))
ggplot(red_wine_rf_tuned_cv)
dotPlot(varImp(red_wine_rf_tuned_cv))
Here we can see that the models’ accuracies depended on different numbers of randomly selected predictors (feature variables) being selected, for the different adjustments being made (tuning parameters or changing the resampling method).
We observe that alcohol
remains the most important feature variable across the different random forest models. However the next-most-important feature variables change slightly as we change model. free.sulfur.dioxide
was the least important feature variable across all models.
Example results are shown below.
For this check, we have chosen to use the models:
red_wine_dec_tree_tuned_cv
red_wine_rf_tuned_cv
# Load magrittr package for piping
library(magrittr)
# count number of observations in validation data
validation_numbers <- dim(red_wine_validate)[1]
# Use the fitted model to predict quality values given the validation data
predict_red_wine_dec_tree <- predict(red_wine_dec_tree_tuned_cv,
newdata =red_wine_validate)
# When run, the code below gives us the percentage of correct predictions
dec_tree_accuracy <- sum(predict_red_wine_dec_tree ==
red_wine_validate$quality) / validation_numbers * 100
dec_tree_accuracy %>% round(2)
## [1] 56.47
# Use the fitted model to predict quality values given the validation data
predict_red_wine_rf <- predict(red_wine_rf_tuned_cv,
newdata =red_wine_validate)
# When run, the code below gives us the percentage of correct predictions
rf_accuracy <- sum(predict_red_wine_rf ==
red_wine_validate$quality) / validation_numbers * 100
rf_accuracy %>% round(2)
## [1] 65.93
Example code is shown below:
results_boot <- resamples(list(decision_tree = red_wine_dec_tree,
decision_tree_tuned = red_wine_dec_tree_tuned,
random_forest = red_wine_rf,
random_forest_tuned = red_wine_rf_tuned)
)
summary(results_boot)
##
## Call:
## summary.resamples(object = results_boot)
##
## Models: decision_tree, decision_tree_tuned, random_forest, random_forest_tuned
## Number of resamples: 25
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## decision_tree 0.5353319 0.5614407 0.5720430 0.5737107 0.5883621 0.6247241
## decision_tree_tuned 0.5354167 0.5578947 0.5819328 0.5778125 0.5953878 0.6120690
## random_forest 0.6493776 0.6615721 0.6666667 0.6700665 0.6783370 0.6966527
## random_forest_tuned 0.6410256 0.6559140 0.6673961 0.6668683 0.6724891 0.7008547
## NA's
## decision_tree 0
## decision_tree_tuned 0
## random_forest 0
## random_forest_tuned 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## decision_tree 0.2520370 0.2891515 0.3081993 0.3033509 0.3189378 0.3814905
## decision_tree_tuned 0.2577697 0.2867525 0.3221000 0.3147239 0.3442703 0.3683173
## random_forest 0.4344375 0.4486099 0.4570372 0.4633930 0.4771286 0.5110820
## random_forest_tuned 0.4155740 0.4360557 0.4559183 0.4555277 0.4681016 0.5210316
## NA's
## decision_tree 0
## decision_tree_tuned 0
## random_forest 0
## random_forest_tuned 0
results_cv <- resamples(list(decision_tree_tuned_cv = red_wine_dec_tree_tuned_cv,
random_forest_tuned_cv = red_wine_rf_tuned_cv))
summary(results_cv)
##
## Call:
## summary.resamples(object = results_cv)
##
## Models: decision_tree_tuned_cv, random_forest_tuned_cv
## Number of resamples: 25
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## decision_tree_tuned_cv 0.4339623 0.5576923 0.6078431 0.5909453 0.6470588 0.70
## random_forest_tuned_cv 0.5400000 0.6730769 0.7058824 0.7106424 0.7547170 0.86
## NA's
## decision_tree_tuned_cv 0
## random_forest_tuned_cv 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu.
## decision_tree_tuned_cv 0.08830275 0.2930591 0.3598383 0.3411524 0.4287492
## random_forest_tuned_cv 0.24342105 0.4580012 0.5179584 0.5286154 0.6113931
## Max. NA's
## decision_tree_tuned_cv 0.4932432 0
## random_forest_tuned_cv 0.7695853 0
Note: The column of interest in the output is actually the Mean
column (the average accuracy achieved by the model over all the resamples), not the Max.
column.
dotplot(results_boot)
dotplot(results_cv)
Note: Your results and conclusion may be different - the results discussed in these solutions files are for models trained following the set.seed(1650)
specification.
We trained Decision Tree and Random Forest Machine machine learning models on the Portuguese red wine data winequality_red.csv
.
The Random Forest models had the best overall accuracy based on the training data, at 70.02% with tuned parameters and the cv
resampling method. This was supported by the validation data test, for which the selected Random Forest model achieved an accuracy score of 65.93%.
Based on our results, we would recommend using the Random Forest machine learning model for this data. We do note however that the model can take some time to run.
It is worth noting here that there are other more advanced models which we haven’t tried that could lead to higher accuracy scores.
Example code is shown below:
set.seed(1650)
red_wine_boosted <- train(quality ~ .,
data = red_wine_train,
method = "gbm",
verbose = F
)
red_wine_boosted
## Stochastic Gradient Boosting
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1282, 1282, 1282, 1282, 1282, 1282, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.6049492 0.3554880
## 1 100 0.6103872 0.3703725
## 1 150 0.6081707 0.3688512
## 2 50 0.6133524 0.3748142
## 2 100 0.6178042 0.3853569
## 2 150 0.6160044 0.3838989
## 3 50 0.6195430 0.3865898
## 3 100 0.6225492 0.3934048
## 3 150 0.6250933 0.3986685
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 3, shrinkage = 0.1 and n.minobsinnode = 10.
The best accuracy achieved by your red_wine_boosted
gradient boosting machine model is 62.51%, for an interaction.depth
of 3 and n.trees
= 150.
No answer required.
set.seed(1650)
red_wine_boosted_tuned <- train(quality ~ .,
data = red_wine_train,
method = "gbm",
verbose = FALSE,
tuneGrid = expand.grid(interaction.depth = 3:6,
n.trees = seq(50, 200, 50),
shrinkage = 0.1,
n.minobsinnode = 10)
)
red_wine_boosted_tuned
## Stochastic Gradient Boosting
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1282, 1282, 1282, 1282, 1282, 1282, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 3 50 0.6198300 0.3862094
## 3 100 0.6213315 0.3914632
## 3 150 0.6283839 0.4033432
## 3 200 0.6290806 0.4059713
## 4 50 0.6223524 0.3915968
## 4 100 0.6266637 0.4001994
## 4 150 0.6325701 0.4106020
## 4 200 0.6331125 0.4120082
## 5 50 0.6262532 0.3985373
## 5 100 0.6320015 0.4094948
## 5 150 0.6372771 0.4186823
## 5 200 0.6411313 0.4249792
## 6 50 0.6313038 0.4050127
## 6 100 0.6391566 0.4204912
## 6 150 0.6472912 0.4339037
## 6 200 0.6475264 0.4345551
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 200, interaction.depth =
## 6, shrinkage = 0.1 and n.minobsinnode = 10.
Example code is shown below:
set.seed(1650)
red_wine_boosted_tuned_cv <- train(quality ~ .,
data = red_wine_train,
method = "gbm",
verbose = FALSE,
trControl = tr_control,
tuneGrid = expand.grid(interaction.depth = 3:6,
n.trees = seq(50, 200, 50),
shrinkage = 0.1,
n.minobsinnode = 10)
)
red_wine_boosted_tuned_cv
## Stochastic Gradient Boosting
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Cross-Validated (25 fold)
## Summary of sample sizes: 1232, 1230, 1232, 1229, 1231, 1230, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 3 50 0.6172749 0.3824031
## 3 100 0.6483437 0.4346333
## 3 150 0.6476075 0.4341328
## 3 200 0.6517277 0.4428700
## 4 50 0.6287437 0.4008587
## 4 100 0.6427785 0.4247133
## 4 150 0.6483460 0.4371353
## 4 200 0.6473312 0.4366541
## 5 50 0.6280076 0.4004039
## 5 100 0.6483478 0.4333239
## 5 150 0.6513173 0.4394956
## 5 200 0.6590764 0.4525660
## 6 50 0.6380836 0.4150853
## 6 100 0.6598444 0.4530962
## 6 150 0.6660563 0.4633046
## 6 200 0.6776168 0.4828904
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 200, interaction.depth =
## 6, shrinkage = 0.1 and n.minobsinnode = 10.
red_wine_boosted
model had a top accuracy of 62.51%red_wine_boosted_tuned
model had a top accuracy of 64.73%red_wine_boosted_tuned_cv
model had a top accuracy of 67.76% for interaction.depth
= 6 and n.trees
= 200.We observe that by tweaking the gbm
model, we have increased the predictive accuracy by over 5%, which is a great result.
plot(red_wine_boosted_tuned_cv)
We observe that in general, as the interaction depth increases, so to does the accuracy of the model.
set.seed(1650)
red_wine_lda <- train(quality ~ .,
data = red_wine_train,
method = "lda")
red_wine_lda
## Linear Discriminant Analysis
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1282, 1282, 1282, 1282, 1282, 1282, ...
## Resampling results:
##
## Accuracy Kappa
## 0.5943597 0.3549375
The best accuracy achieved by this method is 59.43% for the training data.
set.seed(1650)
red_wine_svm <- train(quality ~ .,
data = red_wine_train,
method = "svmLinear")
red_wine_svm
## Support Vector Machines with Linear Kernel
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1282, 1282, 1282, 1282, 1282, 1282, ...
## Resampling results:
##
## Accuracy Kappa
## 0.5902292 0.3242403
##
## Tuning parameter 'C' was held constant at a value of 1
The best accuracy achieved by this method is 59.02%.
set.seed(1650)
red_wine_knn <- train(quality ~ .,
data = red_wine_train,
method = "knn")
red_wine_knn
## k-Nearest Neighbors
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1282, 1282, 1282, 1282, 1282, 1282, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.5388836 0.2733503
## 7 0.5467209 0.2812282
## 9 0.5518291 0.2852813
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
The best accuracy achieved by this method is 55.18%, for \(k\) = 9.
Hopefully this lab has enhanced your understanding of how to conduct supervised machine learning in RStudio. This is just the beginning - there are so many different models and methods out there!
These notes have been prepared by Rupert Kuveke. Please note that some of the content in these notes has been developed from content in Thulin (2021). The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.