Data Science Stream

Topic 11B: Machine Learning III


Welcome to the eleventh computer lab for the Data Science stream of STM1001. This will be our third and final lab focusing on machine learning.

In this computer lab we will fit a variety of machine learning models to the Portuguese Vinho Verde wine1 we began analysing in Computer Lab 10B. We will also cover how to tune machine learning parameters in order to achieve better results.

This computer lab is designed to run alongside the content in the Introduction to Machine Learning in R supplement.

By the end of this lab, you will have gained a greater understanding of machine learning, and should feel comfortable conducting the processes involved in training and improving a variety of important machine learning models in RStudio.


1 Machine Learning (ML) Preparations

🏡 Before we proceed, please make sure you have read all the content in the Introduction to Machine Learning in R supplement and completed Computer Lab 10B. It may also be helpful to:

  • Use the same working directory as the one you used when completed Computer Lab 10B
  • Keep the supplement content and the Computer Lab 10B solutions open in separate tabs while you work through this lab material

1.1 Wine Data Review

🏡 Recall that the Portuguese Vinho Verde wine data consists of 11 feature variables relating to physicochemical aspects of the wine, namely:

  • fixed acidity
  • volatile acidity
  • citric acid
  • residual sugar
  • chlorides
  • free sulfur dioxide
  • total sulfur dioxide
  • density
  • pH
  • sulphates
  • alcohol

The only available outcome variable is quality, which is an integer score from 0 to 10 (with 0 denoting a terrible wine and 10 denoting an exceptional wine). Note that no wine actually receives an extreme rating of 0 or 10.

This data is split into two sets - one for red wine and one for white wine. As part of Computer Lab 10B, you should have already downloaded the red wine data set - we will continue to analyse this data set in this lab.

If you do not have this file saved on your device, please download winequality_red.csv from the LMS now.

1.2 ML Aim

💻 In Computer Lab 10B, we pre-processed our red wine data, split the data into training and validation data sets, and trained and validated a simple decision tree model.

The best accuracy achieved was 57.37% (using the training data) and 52.05% (using the validation data).

We next trained and validated a simple random forest model, and this outperformed the decision tree model, achieving an accuracy of 67.01% and 65.93% for the training and validation data respectively. This is a decent result, but we would like to do better.

Our aim now is to use our pre-processed training and validation data sets to train other machine learning models, in the hope that one or more of these new models will achieve a higher predictive accuracy than 67.01%.

We will also extend our skills using the train function, and try tweaking the tuning parameters for the different models in order to further improve our results, rather than simply using the default settings.

Note: The accuracies stated here may be slightly different to what you obtained, depending on your random seed.

1.3 Pre-prepared ML Code to Run

💻 In the interests of time, and since you may not have your code from the previous computer lab to hand, please run the code in the code chunk below. This is all the relevant code to get us up and running for the subsequent steps in this lab.

Note: You must have the winequality_red.csv saved in your current working directory for the code below to work.

Note: If a red warning message about Rtools appears, don’t worry, it is safe to ignore the message.

# Specify required packages
ml_packages <- c("caret", "gbm", "kernlab", "magrittr", "randomForest", "rpart.plot")
# Install missing packages
install.packages(setdiff(ml_packages, rownames(installed.packages())))
# Load all packages
lapply(ml_packages, library, character.only = TRUE)
# Load data 
red_wine <- read.csv(file = "winequality_red.csv", header = T)
red_wine$quality <- as.factor(red_wine$quality)
centre_scale <- preProcess(red_wine[, -12], 
                           method = c("center", "scale"))
red_wine_updated <- predict(centre_scale, red_wine)
set.seed(1650)
wine_train_index <- createDataPartition(red_wine_updated$quality, 
                                        p = 0.8, 
                                        list = FALSE, times = 1) 
red_wine_train <- red_wine_updated[wine_train_index, ]
red_wine_validate <- red_wine_updated[-wine_train_index, ]

set.seed(1650)
red_wine_dec_tree <- train(quality ~ .,
                           data = red_wine_train,
                           method = "rpart")

set.seed(1650)
red_wine_rf <- train(quality ~ .,
                           data = red_wine_train,
                           method = "rf")

2 Fine-Tuning ML Models

💻 So far, we have been using the default tuning parameters for our ML models. While these often do a great job, sometimes intelligently tweaking the tuning parameters can make all the difference.

Each machine learning model has a different set of tuning parameters, and there are dozens of different models, but don’t worry - we’ll just focus on a select few.

We can specify changes to tuning parameters using the optional argument tuneGrid within the train function. This can be a complicated argument, since, as noted above, each machine learning model has its own set of tuning parameters.

Note: The focus here is on having some fun and familiarising ourselves with the code required to fit different machine learning models - you are not expected to understand or explain all the mathematics behind these models.

2.1 Tuning a Decision Tree

💻 While our decision tree model performed poorly in Computer Lab 10B, it is worthwhile to know that there are some simple changes we could make to this model, to potentially improve its predictive accuracy. We will introduce these below.

The main tuning parameter for a decision tree model is cp - the complexity parameter, which has a default value of 0.01.

The complexity parameter penalises the decision tree if it has too many branches.

  • If the cp value is too low, your tree model may be overfitted, with an excessive number of branches.
  • Conversely, if the cp value is too high, your tree model may look more like a stick, i.e. be too simplistic to be of any use.

The code chunk below contains partially completed code, with the tuneGrid argument incorporated into the train function.

set.seed(1650)
red_wine_dec_tree_tuned <- train(... ,
                                 data = ... ,
                                 method = ... , 
                                 tuneGrid = expand.grid(cp = seq(0.001, 0.01, 0.001))
                                 )

Here:

  • We have specified that the cp values to test are \(0.001, 0.002, 0.003, \dots, 0.009, 0.01\).

Fill in the missing ... details in the code chunk above, and once you are happy with your code, try running it.

2.1.1

💻 Check your results for your modified model - as a result of adjusting the cp tuning parameter, has the top accuracy of the model increased from the original 57.37%?

🎧 Online students 💬 Enter your answer next to the question on the shared jamboard.

2.2 Tuning a Random Forest Model

💻 The main tuning parameter for a random forest model is the mtry argument, which represents the number of variables to use in each level. Generally, the default values for this are quite good, but it’s often worth trying out a few different values, just to check.

From your initial random forest model red_wine_rf, you may have found that an mtry value of 2 led to better results than mtry =6 or mtry=10. Therefore, let’s try some values around 2.

Use your code from 1.3 to fit a new random forest model that includes inside the train function the argument tuneGrid = expand.grid(mtry = c(1:3)).

Call this new model red_wine_rf_tuned.

Note: The training of this model might take a few minutes, depending on your device’s hardware.

2.3 Resampling Methods

💻 So far, we have used the default bootstrap resampling method boot when fitting all our machine learning models.

In fact, the train function accepts 14 different resampling options!

Don’t worry though, we will just consider one alternative to the boot method in this lab, namely the cross-validation (cv) method.

The cv method follows the process we are employing with our training and validation data, albeit on a smaller scale - i.e. the machine learning model will take a subset of the red_wine_train data, train the model, and then test it on the remaining portion of the red_wine_train data. We can also specify the number of cross-validation trials to perform - 10 is often considered adequate, but for this lab we will specify a larger number of 25.

2.3.1

💻 We can specify the machine learning resampling method to use within the train function via the argument trControl.

Take a look at the code below, and see if you can create a new decision tree model and random forest model that use the cv resampling method and the range of cp and mtry values respectively that are specified in 2.1 and 2.2 respectively.

Note: You will need to fill in the ... missing parts in the code).

tr_control <- trainControl(method = "cv",
                           number = 25)

set.seed(1650)
red_wine_dec_tree_tuned_cv <- train(quality ~ .,
                              data = ...,
                              trControl = tr_control,
                              method = "rpart",
                              tuneGrid = ...
                              )

set.seed(1650)
red_wine_rf_tuned_cv <- train(quality ~ .,
                              data = ... ,
                              trControl = tr_control,
                              method = "rf",
                              tuneGrid = ...
                              )

Hint: If you are not sure how to proceed, check the code chunk below:

tr_control <- trainControl(method = "cv",
                           number = 25)

set.seed(1650)
red_wine_rf_tuned_cv <- train(quality ~ .,
                              data = red_wine_train,
                              trControl = tr_control,
                              method = "rf",
                              tuneGrid = expand.grid(mtry = c(1:3))
                              )

# You will need a different tuneGrid specification for the decision tree model

2.3.2

💻 Compare your results for your three random forest models red_wine_rf, red_wine_rf_tuned and red_wine_rf_tuned_cv.

Which of your three approaches results in the best performing random forest model?

Note the specific tuning parameter values and resampling method that led to the best results.

🎧 Online students 💬 Enter your answer next to the question on the shared jamboard.

2.3.3

💻 Recall that we can easily produce two helpful plots for random forest models, namely:

  • A plot of the model accuracy based on the number of feature variables used for each decision tree in the random forest model
  • A plot of the importance of each feature variable in achieving an accurate model

Run the following R code to produce these plots now, for your initial random forest model:

ggplot(red_wine_rf)

dotPlot(varImp(red_wine_rf))

2.3.4

💻 Create the plots discussed in 2.3.3 for your other two random forest models, and comment on your results.

🎧 Online students 💬 Enter your answer next to the question on the shared jamboard.

3 Validating Results

💻 In addition to obtaining predictive accuracy estimates for each of our models, it is important to also check how the models perform when presented with our validation data. Recall that we carried out this check in earlier labs2.

An example application of this approach to the decision tree model results is shown below:

# Load magrittr package for piping
library(magrittr)

# count number of observations in validation data
validation_numbers <- dim(red_wine_validate)[1]

# Use the fitted model to predict quality values given the validation data
predict_red_wine_dec_tree <- predict(red_wine_dec_tree, 
                                     newdata =red_wine_validate)
# When run, the code below gives us the percentage of correct predictions
dec_tree_accuracy <- sum(predict_red_wine_dec_tree == 
                         red_wine_validate$quality) / validation_numbers * 100

dec_tree_accuracy %>% round(2)

3.1

💻 Select your best performing decision tree and random forest versions, then, following the process outlined above in 3, use your red_wine_validate data to assess the predictive accuracy of your selected models.

Which of your selected models has the better performance with the validation data?

4 Summarising Results

💻 Now that we have tried fitting and validating a selection of popular machine learning models, we might like to summarise our findings; this can make it easier to compare the results of the different methods.

4.1

💻 Machine learning models trained using the train function test combinations of tuning parameters to try to find a combination leading to a accurate prediction.

Create two lists to summarise the results of your different machine learning models, using the code below as a starting point.

Note: The ... parts of the code need to be filled in.

Note: We need to create separate lists for results obtained via the boot and cv resampling methods.

results_boot <- resamples(list(decision_tree = red_wine_dec_tree, 
                               decision_tree_tuned = red_wine_dec_tree_tuned,
                               random_forest = red_wine_rf, 
                               ...
                               )
                         )
summary(results_boot)

results_cv <- resamples(list(..., random_forest_tuned_cv = red_wine_rf_tuned_cv))

summary(results_cv)

Note: The column of interest in the output is actually the Mean column (the average accuracy achieved by the model over all the resamples), not the Max. column.

4.2

💻 Apply the dotplot function to your two results objects from 4.1 to plot the range of accuracy values for the different methods.

4.3

💻 To conclude, write a brief, simple summary of your findings, explaining your process and the results of your machine learning models. Overall, based on the results obtained and the training processes involved, do you have a preference for one of the machine learning models we have used in this computer lab?


🏡 Reconvene in main room to discuss results


🎧 Online students 💬 Volunteer to discuss your preference for machine learning model, out of those covered in these recent DS computer labs, or leave a comment in the shared jamboard.

5 Extension: Gradient Boosting Machine ML Models

💻 Another ensemble method that combines multiple decision trees is the gradient boosting machine model. There are several options for training a gradient boosting machine model in RStudio. We will use the stochastic gradient boosting method gbm.

Using the gbm argument within the train function, train a gradient boosting machine model, and name your model red_wine_boosted.

Again, don’t worry if it takes a couple of minutes for your code to run - this is normal.

Note: You can also include the argument verbose = F within the train function so no calculations are shown while the model is being fitted. Alternatively, if you would like to see the model fitting in action in the R console, you can leave this argument out.

5.1

💻 What is the best accuracy achieved by your red_wine_boosted gradient boosting machine model?

5.2

💻 For the gradient boosting machine model, the main tuning parameters are:

  • n.trees (the number of iterations i.e. the number of decision trees used) and
  • interaction.depth (the complexity of the trees).

The code chunk below contains partially completed code for a tuned gradient boosting machine model, with the tuneGrid argument incorporated into the train function.

set.seed(1650)
red_wine_boosted_tuned <- train(... ,
                                data = ... ,
                                method = ... , 
                                verbose = FALSE,
                                tuneGrid = expand.grid(interaction.depth = 3:6,
                                                       n.trees = seq(50, 200, 50),
                                                       shrinkage = 0.1,
                                                       n.minobsinnode = 10)
                               )

Here:

  • We have specified that the interaction depth can be between 3 and 6, rather than between 1 and 3 (usually, greater depth leads to better accuracy, although we have to be careful not to overfit).
  • We have also specified that the number of trees can be 50, 100, 150 or 200 (i.e. 200 trees is now our maximum, rather than 150).
  • Note that the arguments shrinkage and n.minobsinnode must be included within the tuneGrid function, otherwise the training will fail. The values for these arguments are simply the default ones.

5.3

💻 Fill in the missing ... details in the code chunk above in 5.2, and once you are happy with your code, try running it.

5.3.1

💻 Using the code in 2.3.1 above as a guide, train a new gradient boosting machine model that uses the cv resampling method and the tuning specifications used in 5.2.

Assign your output to a new object called red_wine_boosted_tuned_cv.

5.4

💻 Compare your results for your three gradient boosting machine models red_wine_boosted, red_wine_boosted_tuned and red_wine_boosted_tuned_cv.

Which of your three approaches results in the best performing gradient boosting machine model? Note down the specific tuning parameter values and resampling method that led to the best results.

🎧 Online students 💬 Enter your answer next to the question on the shared jamboard.

5.5

💻 To conclude our focus on gradient boosted machine tree models, use the plot function to visualise the results of the training process, for your best performing gradient boosting machine model.

🎧 Online students 💬 Enter your answer next to the question on the shared jamboard.

6 Extension: Additional ML Models

💻 Our focus in this lab so far has been on tree-based machine learning models, as these are flexible and can perform well in a variety of contexts. However, there are plenty more models to choose from, and so we will briefly introduce a few new options below:

  • Linear Discriminant Analysis Model
  • Support Vector Machine Model
  • k-Nearest-Neighbour Model

For these models, we will use the default tuning parameters and resampling methods.

6.1 LDA

💻 Fit a linear discriminant analysis model to your red_wine_train data via the method specification lda.

What is the best accuracy achieved by this method for the training data?

6.2 SVM

💻 Fit a support vector machine model to your red_wine_train data via the method specification svmLinear (there are other options, but these are more complicated and take longer to execute).

What is the best accuracy achieved by this method for the training data?

Note: In order to fit a svm model, we need to load the kernlab package. This should have been done already in ??, but if an error appears when fitting the model, just double-check this.

6.3 kNN

💻 The final method we will try is the k-Nearest-Neighbours model, which is selected via the method specification knn. Fit a kNN model to your red_wine_train data.

What is the best accuracy achieved by this method for the training data?


Well done, that concludes our work in machine learning.

Hopefully this lab has enhanced your understanding of how to conduct supervised machine learning in RStudio. This is just the beginning - there are so many different models and methods out there!


References

Cortez, P., A. Cerdeira, F. Almeida, T. Matos, and J. Reis. 2009. “Modeling Wine Preferences by Data Mining from Physicochemical Properties.” Decision Support Systems 47 (4): 547–53.
Thulin, M. 2021. Modern Statistics with R: From Wrangling and Exploring Data to Inference and Predictive Modelling.
UCI Machine Learning Repository. 2009. “Wine Quality Data Set[.csv File].” 2009. https://archive.ics.uci.edu/ml/datasets/Wine+Quality.


These notes have been prepared by Rupert Kuveke. Please note that some of the content in these notes has been developed from content in Thulin (2021). The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.


  1. This data was obtained from the UCI Machine Learning Repository (2009) and originally collected by Cortez et al. (2009).↩︎

  2. This approach was also demonstrated in section 4.1 of the Introduction to Machine Learning in R supplement).↩︎

