Data Science Module

Topic 11B: Machine Learning II


Welcome to the eleventh computer lab for the Data Science module. This will be our second lab focusing on machine learning.

In this computer lab we will fit a variety of machine learning models to the Portuguese Vinho Verde wine1 we began analysing in Computer Lab 10B. We will also cover how to tune machine learning parameters in order to achieve better results.

This computer lab is designed to run alongside the content in the Introduction to Machine Learning in R supplement. The material in this supplement provides all the background information on machine learning and machine learning terminology you will need to complete this lab.

By the end of this lab, you will have gained a greater understanding of machine learning in R, and should feel comfortable conducting the processes involved in training and improving a variety of important machine learning models.


1 Preparations

Before we proceed, please make sure you have read the content in the Introduction to Machine Learning in R supplement and completed Computer Lab 10B. It may also be helpful to:

  • Use the same working directory as the one you used when completed Computer Lab 10B
  • Keep the supplement content and the Computer Lab 10B solutions open in separate tabs while you work through this lab material

1.1 Load Required Packages

Several of the R packages we require for our machine learning processes should be already installed on your PC, but we will also be using some new packages, so please make sure to run the R code in the Code chunk below. This code will install any missing packages, and then load all the packages needed for this lab:

# Specify required packages
ml_packages <- c("caret", "gbm", "kernlab", "magrittr", "randomForest", "rpart.plot")
# Install missing packages
install.packages(setdiff(ml_packages, rownames(installed.packages())))
# Load all packages
lapply(ml_packages, library, character.only = TRUE)

1.2 Wine Data

Recall that the Portuguese Vinho Verde wine data consists of 11 feature variables relating to physicochemical aspects of the wine, namely:

  • fixed acidity
  • volatile acidity
  • citric acid
  • residual sugar
  • chlorides
  • free sulfur dioxide
  • total sulfur dioxide
  • density
  • pH
  • sulphates
  • alcohol

The only available outcome variable is quality, which is an integer score from 0 to 10 (with 0 denoting a terrible wine and 10 denoting an exceptional wine). Note that no wine actually receives an extreme rating of 0 or 10.

This data is split into two sets - one for red wine and one for white wine. As part of Computer Lab 10B, you should have already downloaded the red wine data set - we will continue to analyse this data set in this lab. If you do not have this file saved on your device, please download winequality_red.csv from the LMS now.

1.3 Aim

In Computer Lab 10B, we pre-processed our red wine data, split the data into training and validation data sets, and trained and validated a simple decision tree model.

The best accuracy achieved was 57.37% (using the training data) and 52.05 (using the validation data).

Our aim now is to use our pre-processed training and validation data sets to train other machine learning models, in the hope that one or more of these new models will outperform the decision tree model.

We will also extend our skills using the train function, and try tweaking the tuning parameters for the different models in order to further improve our results, rather than simply using the default settings.

In the interests of time, and since you may not have your code from the previous computer lab to hand, please run the R code in the Code chunk below now. This Code chunk contains all the relevant R code to get us up and running for the subsequent steps.

Note: You must have the winequality_red.csv saved in your current working directory for the code below to work.

red_wine <- read.csv(file = "winequality_red.csv", header = T)
red_wine$quality <- as.factor(red_wine$quality)
centre_scale <- preProcess(red_wine[, -12], 
                           method = c("center", "scale"))
red_wine_updated <- predict(centre_scale, red_wine)
set.seed(1650)
wine_train_index <- createDataPartition(red_wine_updated$quality, 
                                        p = 0.8, 
                                        list = FALSE, times = 1) 
red_wine_train <- red_wine_updated[wine_train_index, ]
red_wine_validate <- red_wine_updated[-wine_train_index, ]

2 Machine Learning Models

The focus here is on having some fun and familiarising ourselves with the R code required to fit different machine learning models - you are not expected to understand or explain all the mathematics behind these models.

We will fit the following types of machine learning model:

  • Decision tree
  • Random Forest
  • Gradient boosting machine
  • LDA
  • SVM
  • k-Nearest-Neighbour

Each of these models can be fitted using the train function from the caret package (although some will also require the use of other packages).

2.1

Let’s take a look at the code we used for our decision tree model from the previous lab, to refresh our memory on the basic train code framework:

set.seed(1650)
red_wine_dec_tree <- train(quality ~ .,
                           data = red_wine_train,
                           method = "rpart")
red_wine_dec_tree

Recall that the three main arguments you will need to include in your train function are:

  • The relationship between the outcome variable and the feature variables (here quality ~.)
  • The data set (here red_wine_train)
  • The method/algorithm to use (here "rpart")

Run this code now, before proceeding to the next question.

2.2 Tuning Parameters

So far, we have been using the default tuning parameters for our decision tree model. While these often do a great job, sometimes intelligently tweaking the tuning parameters can make all the difference. Each machine learning model has a different set of tuning parameters, and there are dozens of different models, but don’t worry - we’ll just focus on a select few.

We can specify changes to tuning parameters using the optional argument tuneGrid within the train function. This can be a complicated argument, since, as noted above, each machine learning model has its own set of tuning parameters.

The main tuning parameter for a decision tree model is cp - the complexity parameter, which has a default value of 0.01. The complexity parameter penalises the decision tree if it has too many branches. If the cp value is too low, your tree model may be overfitted, with an excessive number of branches. Conversely, if the cp value is too high, your tree model may look more like a stick, i.e. be too simplistic to be of any use.

The code chunk below contains partially completed code, with the tuneGrid argument incorporated into the train function.

red_wine_dec_tree_tuned <- train(... ,
                                 data = ... ,
                                 method = ... , 
                                 tuneGrid = expand.grid(cp = seq(0.001, 0.01, 0.001))
                                 )

Here:

  • We have specified that the cp values to test are \(0.001, 0.002, 0.003, \dots, 0.009, 0.01\).

Fill in the missing ... details in the code chunk above, and once you are happy with your code, try running it.

2.2.1

Check your results for your modified model - as a result of adjusting the cp tuning parameter, has the top accuracy of the model increased?

2.3 Resampling Methods

So far, we have used the default bootstrap resampling method boot when fitting all our machine learning models.

In fact, the train function accepts 14 different resampling options! Don’t worry though, we will just consider one alternative to the boot method in this lab, namely the cross-validation (cv) method.

The cv method follows the process we are employing with our training and validation data, albeit on a smaller scale - i.e. the machine learning model will take a subset of the red_wine_train data, train the model, and then test it on the remaining portion of the red_wine_train data. We can also specify the number of cross-validation trials to perform - 10 is often considered adequate.

2.3.1

We can specify the machine learning resampling method to use within the train function via the argument trControl.

Take a look at the code below, and see if you can create a new decision tree model named red_wine_dec_tree_tuned_cv that uses the range of cp values specified in 2.2 and the cv resampling method (you will have to fill in the ... missing parts).

tr_control <- trainControl(method = "cv",
                           number = 10)

red_wine_dec_tree_tuned_cv <- train(quality ~ .,
                                    data = ... ,
                                    trControl = tr_control,
                                    method = ...,
                                    tuneGrid = ...
                                    )

2.4 Decision Tree Models

Compare your results for your decision tree models red_wine_dec_tree, red_wine_dec_tree_tuned and red_wine_dec_tree_tunde_cv. Which of your three approaches results in the best performing decision tree model? Note the specific tuning parameter value and resampling method that led to the best results.

2.4.1

Use the ggplot function to plot the accuracy of your three decision tree models for different cp values.

2.4.2

It might also help to visualise the decision trees - use the rpart.plot function to plot the results.

Do you notice any problems that might be leading to the poor predictive accuracy result?

Hint: Check the code below for a head-start on how to plot your results.

rpart.plot(red_wine_dec_tree$finalModel)

2.5 Random Forest Models

The decision tree is one of the simplest machine learning models we can use. Other methods often produce better results, but they also tend to take longer to train. For the following models, don’t worry if it takes a couple of minutes for your code to run - this is normal.

Using the code in 2.1 above as a guide, train a machine learning model using the random forest method rf.

Assign your output to the object red_wine_rf.

Hint: The method = rpart will need to be changed.

2.5.1

What is the best accuracy achieved by the random forest method?

2.5.2

The main tuning parameter for a random forest model is the mtry argument, which represents the number of variables to use in each level. Generally, the default values for this are quite good, but it’s often worth trying out a few different values, just to check.

From your initial random forest model red_wine_rf, you may have found that an mtry value of 2 led to better results than mtry =6 or mtry=10. Therefore, let’s try some values around 2.

Use your code from 2.5 to fit a new random forest model that includes inside the train function the argument tuneGrid = expand.grid(mtry = c(1,3)) (we can skip mtry=2 since we already have results for this value). Call this new model red_wine_rf_tuned.

Note: The training of this model might take a few minutes, depending on your device’s hardware.

2.5.3

Using the code in 2.3.1 above as a guide, train a new random forest model that uses the cv resampling method and the tuning specifications used in 2.5.2.

Name your output red_wine_rf_tuned_cv.

2.5.4

Compare your results for your three random forest models red_wine_rf, red_wine_rf_tuned and red_wine_rf_tuned_cv. Which of your three approaches results in the best performing random forest model? Note the specific tuning parameter values and resampling method that led to the best results.

2.5.5

For random forest models, we can easily produce two helpful plots:

  • A plot of the model accuracy based on the number of feature variables used for each decision tree in the random forest model\(^{\dagger}\)
  • A plot of the importance of each feature variable in achieving an accurate model

Run the following R code to produce these plots now, for your initial random forest model:

ggplot(red_wine_rf)

dotPlot(varImp(red_wine_rf))

Note for the variable importance graph that the best (most helpful) variable is always given an importance of 100 and the worst is given an importance of 0. This is not to say that the variable with importance ranking 0 is useless - it is just considered less useful than the other variables.

What feature variables are considered most important?

\(^{\dagger}\) Recall from section 4.0.1 of the Introduction to Machine Learning in R supplement that random forest models are a type of ensemble method, and combine multiple decision trees together.

2.5.6

Create the plots discussed in 2.5.5 for your other two random forest models, and comment on your results.

2.6 Gradient Boosting Machine Models

Another ensemble method that combines multiple decision trees is the gradient boosting machine model. There are several options for training a gradient boosting machine model in R. We will use the stochastic gradient boosting method gbm.

Using the gbm argument within the train function, train a gradient boosting machine model, and name your model red_wine_boosted.

Again, don’t worry if it takes a couple of minutes for your code to run - this is normal.

Note: You can also include the argument verbose = F within the train function so no calculations are shown while the model is being fitted. Alternatively, if you would like to see the model fitting in action in the R console, you can leave this argument out.

2.6.1

What is the best accuracy achieved by the gradient boosting machine method?

2.6.2

For the gradient boosting machine model, the main tuning parameters are:

  • n.trees (the number of iterations i.e. the number of decision trees used) and
  • interaction.depth (the complexity of the trees).

The code chunk below contains partially completed code for a tuned gradient boosting machine model, with the tuneGrid argument incorporated into the train function.

red_wine_boosted_tuned <- train(... ,
                                data = ... ,
                                method = ... , 
                                verbose = FALSE,
                                tuneGrid = expand.grid(interaction.depth = 3:6,
                                                       n.trees = seq(50, 200, 50),
                                                       shrinkage = 0.1,
                                                       n.minobsinnode = 10)
                               )

Here:

  • We have specified that the interaction depth can be between 3 and 6, rather than between 1 and 3 (usually, greater depth leads to better accuracy, although we have to be careful not to overfit).
  • We have also specified that the number of trees can be 50, 100, 150 or 200 (i.e. 200 trees is now our maximum, rather than 150).
  • Note that the arguments shrinkage and n.minobsinnode must be included within the tuneGrid function, otherwise the training will fail. The values for these arguments are simply the default ones.

Fill in the missing ... details in the code chunk above, and once you are happy with your code, try running it.

2.6.3

Using the code in 2.3.1 above as a guide, train a new gradient boosting machine model that uses the cv resampling method and the tuning specifications used in 2.6.2.

Name your output red_wine_boosted_tuned_cv.

2.6.4

Compare your results for your three gradient boosting machine models red_wine_boosted, red_wine_boosted_tuned and red_wine_boosted_tuned_cv. Which of your three approaches results in the best performing gradient boosting machine model? Note the specific tuning parameter values and resampling method that led to the best results.

2.6.5

To conclude our focus on gradient boosted machine tree models, use the plot function to visualise the results of the training process, for your best performing gradient boosting machine model.

2.7 Additional Models

Our focus in this lab has been on tree-based machine learning models, as these are flexible and can perform well in a variety of contexts. However, there are plenty more models to choose from, and so we will briefly introduce a few new options below:

  • Linear Discriminant Analysis Model
  • Support Vector Machine Model
  • k-Nearest-Neighbour Model

For these models, we will use the default tuning parameters and resampling methods.

2.7.1 LDA

Fit a linear discriminant analysis model to your red_wine_train data via the method specification lda.

What is the best accuracy achieved by this method?

2.7.2 SVM

Fit a support vector machine model to your red_wine_train data via the method specification svmLinear (there are other options, but these are more complicated and take longer to execute).

What is the best accuracy achieved by this method?

Note: In order to fit a svm model, we need to load the kernlab package. This should have been done already in 1.1, but if an error appears when fitting the model, just double-check this.

2.7.3 kNN

The final method we will try is the k-Nearest-Neighbours model, which is selected via the method specification knn. Fit a kNN model to your red_wine_train data.

What is the best accuracy achieved by this method?

3 Validating Results

Great job! We’ve now fitted 6 different types of machine learning models.

In addition to obtaining predictive accuracy estimates for each of our models, it is important to also check how the models perform when presented with our validation data. If the accuracy of the models remains similar, then we can be more confident in our models’ reported performances.

Recall that we carried out this check in Computer Lab 10B2.

An example application of this approach to the decision tree model results is shown below:

# Load magrittr package for piping
library(magrittr)

# count number of observations in validation data
validation_numbers <- dim(red_wine_validate)[1]

# Use the fitted model to predict quality values given the validation data
predict_red_wine_dec_tree <- predict(red_wine_dec_tree, 
                                     newdata =red_wine_validate)
# When run, the code below gives us the percentage of correct predictions
dec_tree_accuracy <- sum(predict_red_wine_dec_tree == red_wine_validate$quality) / 
                     validation_numbers %>% round(2) * 100
dec_tree_accuracy

3.1

For each of the different types of models you used in 2, select the model that had the best performance. For example, if your red_wine_dec_tree_tuned was the best performing model out of the three decision tree models, select that model. This should lead you to have 6 selected models overall - one for each model type.

Following the process outlined above in 3, use your red_wine_validate data to assess the predictive accuracy of your 6 selected models.

Which of these models has the best performance with the validation data?

4 Summarising Results

Now that we have tried fitting and validating a selection of popular machine learning models, we might like to summarise our findings; this can make it easier to compare the results of the different methods.

4.1

Machine learning models trained using the train function test combinations of tuning parameters to try to find a combination leading to a accurate prediction. By default, 25 resamples are conducted on the training data, using different tuning parameter combinations - you might have noticed the line Resampling: Bootstrapped (25 reps) in the model output.

However, if we have specified the cv resampling method, only 10 resamples will have been taken.

Using the incomplete R code in the Code chunk below, create two lists to summarise the results of your different machine learning models:

Note: We need two lists, one for bootstrap resample results, and one for cross-validation resample results, since the number of resamples used for the two methods differs.

results_boot <- resamples(list(decision_tree = red_wine_dec_tree, 
                               decision_tree_tuned = red_wine_dec_tree_tuned,
                               random_forest = red_wine_rf, 
                               ...
                               )
                          )

results_cv <- resamples(list(decision_tree_tuned_cv = red_wine_dec_tree_tuned_cv,
                             random_forest_tuned_cv = red_wine_rf_tuned_cv,
                             ...
                             )
                        )

summary(results_boot)
...

Note: The column of interest here is actually the Mean column (the average accuracy achieved by the model over all the resamples), not the Max. column.

4.2

Apply the dotplot function to the results_boot and results_cv objects to plot the range of accuracy values for the different methods.

4.3

To conclude, write a brief, simple summary of your findings, explaining your process and the results of your machine learning models. Overall, based on the results obtained and the training processes involved, do you have a preference for one of the machine learning models we have used in this computer lab?


Well done, that concludes our work in machine learning.

Hopefully this lab has enhanced your understanding of how to use R for machine learning. This is just the beginning - there are so many different models and methods out there!


References

Cortez, P., A. Cerdeira, F. Almeida, T. Matos, and J. Reis. 2009. “Modeling Wine Preferences by Data Mining from Physicochemical Properties.” Decision Support Systems 47 (4): 547–53.
Thulin, M. 2021. Modern Statistics with R: From Wrangling and Exploring Data to Inference and Predictive Modelling.
UCI Machine Learning Repository. 2009. “Wine Quality Data Set[.csv File].” 2009. https://archive.ics.uci.edu/ml/datasets/Wine+Quality.


These notes have been prepared by Rupert Kuveke. Please note that some of the content in these notes has been developed from content in Thulin (2021). The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.


  1. This data was obtained from the UCI Machine Learning Repository (2009) and originally collected by Cortez et al. (2009).↩︎

  2. This approach was also demonstrated in section 4.1 of the Introduction to Machine Learning in R supplement).↩︎

---
title: "STM1001: Computer Lab 11B"
output:
  bookdown::html_document2: 
    toc: true
    toc_float: true
    code_download: true
    theme: readable
    code_folding: show
bibliography: STM1001_DS_CL_references.bib 
link-citations: yes
---

<style>
#TOC {
  background: url("https://www.latrobe.edu.au/_media/la-trobe-api/v5/img/logo.svg");
  background-size: contain;
  padding-top: 80px !important;
  background-repeat: no-repeat;
}
</style>

### Data Science Module {-}

### Topic 11B: Machine Learning II {-}

<br>

Welcome to the eleventh computer lab for the Data Science module. This will be our second lab focusing on machine learning.

In this computer lab we will fit a variety of machine learning models to the Portuguese *Vinho Verde* wine^[This data was obtained from the @UCIWine and originally collected by @wine.] we began analysing in [Computer Lab 10B](https://rpubs.com/LTU_STM1001/DSMCL10_S). We will also cover how to tune machine learning parameters in order to achieve better results.

This computer lab is designed to run alongside the content in the [Introduction to Machine Learning in R supplement](https://bookdown.org/rehk/stm1001_dsm_t1_introduction_to_machine_learning_in_r/). The material in this supplement provides all the background information on machine learning and machine learning terminology you will need to complete this lab. 

By the end of this lab, you will have gained a greater understanding of machine learning in R, and should feel comfortable conducting the processes involved in training and improving a variety of important machine learning models.

<br>

# Preparations {#prep}

Before we proceed, please make sure you have read the content in the [Introduction to Machine Learning in R supplement](https://bookdown.org/rehk/stm1001_dsm_t1_introduction_to_machine_learning_in_r/) and completed [Computer Lab 10B](https://rpubs.com/LTU_STM1001/DSMCL10_S). It may also be helpful to:

* Use the same working directory as the one you used when completed Computer Lab 10B
* Keep the supplement content and the [Computer Lab 10B solutions](https://rpubs.com/LTU_STM1001/DSMCL10Sol_S) open in separate tabs while you work through this lab material

## Load Required Packages {#load}

Several of the R packages we require for our machine learning processes should be already installed on your PC, but we will also be using some new packages, so please make sure to run the R code in the `Code` chunk below. This code will install any missing packages, and then load all the packages needed for this lab:

```{r class.source = "fold-show", eval = F, echo = T, warning = F, message = F}
# Specify required packages
ml_packages <- c("caret", "gbm", "kernlab", "magrittr", "randomForest", "rpart.plot")
# Install missing packages
install.packages(setdiff(ml_packages, rownames(installed.packages())))
# Load all packages
lapply(ml_packages, library, character.only = TRUE)

```

```{r class.source = "fold-show", eval = T, include = F}
# Load packages
ml_packages <- c("caret", "gbm", "kernlab", "magrittr", "randomForest", "rpart", "rpart.plot")

install.packages(setdiff(ml_packages, rownames(installed.packages())))
lapply(ml_packages, library, character.only = TRUE)
```

## Wine Data

Recall that the Portuguese *Vinho Verde* wine data consists of 11 feature variables relating to physicochemical aspects of the wine, namely:

* fixed acidity
* volatile acidity
* citric acid
* residual sugar
* chlorides
* free sulfur dioxide
* total sulfur dioxide
* density
* pH
* sulphates
* alcohol

The only available outcome variable is `quality`, which is an integer score from 0 to 10 (with 0 denoting a terrible wine and 10 denoting an exceptional wine). Note that no wine actually receives an extreme rating of 0 or 10.

This data is split into two sets - one for red wine and one for white wine. 
As part of [Computer Lab 10B](https://rpubs.com/LTU_STM1001/DSMCL10_S), you should have already downloaded the red wine data set - we will continue to analyse this data set in this lab. If you do not have this file saved on your device, please download  `winequality_red.csv` from the LMS now.

## Aim

In [Computer Lab 10B](https://rpubs.com/LTU_STM1001/DSMCL10_S), we pre-processed our red wine data, split the data into training and validation data sets, and trained and validated a simple decision tree model. 

The best accuracy achieved was 57.37% (using the training data) and 52.05 (using the validation data).

Our aim now is to use our pre-processed training and validation data sets to train other machine learning models, in the hope that one or more of these new models will outperform the decision tree model.

We will also extend our skills using the `train` function, and try tweaking the tuning parameters for the different models in order to further improve our results, rather than simply using the default settings.

In the interests of time, and since you may not have your code from the previous computer lab to hand, please run the R code in the `Code` chunk below now. This `Code` chunk contains all the relevant R code to get us up and running for the subsequent steps.

*Note: You must have the `winequality_red.csv` saved in your current working directory for the code below to work.*

```{r class.source = "fold-hide", eval = F, echo = T, warning = F, message = F}
red_wine <- read.csv(file = "winequality_red.csv", header = T)
red_wine$quality <- as.factor(red_wine$quality)
centre_scale <- preProcess(red_wine[, -12], 
                           method = c("center", "scale"))
red_wine_updated <- predict(centre_scale, red_wine)
set.seed(1650)
wine_train_index <- createDataPartition(red_wine_updated$quality, 
                                        p = 0.8, 
                                        list = FALSE, times = 1) 
red_wine_train <- red_wine_updated[wine_train_index, ]
red_wine_validate <- red_wine_updated[-wine_train_index, ]
```

# Machine Learning Models {#ml}

The focus here is on having some fun and familiarising ourselves with the R code required to fit different machine learning models - you are not expected to understand or explain all the mathematics behind these models.

We will fit the following types of machine learning model:

* Decision tree
* Random Forest
* Gradient boosting machine
* LDA
* SVM
* k-Nearest-Neighbour

Each of these models can be fitted using the `train` function from the `caret` package (although some will also require the use of other packages). 

## {#dectree}

Let's take a look at the code we used for our decision tree model from the previous lab, to refresh our memory on the basic `train` code framework:

```{r class.source = "fold-show", eval = F, echo = T, warning = F, message = F}
set.seed(1650)
red_wine_dec_tree <- train(quality ~ .,
                           data = red_wine_train,
                           method = "rpart")
red_wine_dec_tree
```

Recall that the three main arguments you will need to include in your `train` function are:

* The relationship between the outcome variable and the feature variables (here `quality ~.`)
* The data set (here `red_wine_train`)
* The method/algorithm to use (here `"rpart"`)

Run this code now, before proceeding to the next question.

## Tuning Parameters {#tp}

So far, we have been using the default tuning parameters for our decision tree model. While these often do a great job, sometimes intelligently tweaking the tuning parameters can make all the difference. Each machine learning model has a different set of tuning parameters, and there are dozens of different models, but don't worry - we'll just focus on a select few.

We can specify changes to tuning parameters using the optional argument `tuneGrid` within the `train` function. This can be a complicated argument, since, as noted above, each machine learning model has its own set of tuning parameters.

The main tuning parameter for a decision tree model is `cp` - the complexity parameter, which has a default value of 0.01. The complexity parameter penalises the decision tree if it has too many branches. If the `cp` value is too low, your tree model may be overfitted, with an excessive number of branches. Conversely, if the `cp` value is too high, your tree model may look more like a stick, i.e. be too simplistic to be of any use.

The code chunk below contains partially completed code, with the `tuneGrid` argument incorporated into the `train` function.

```{r class.source = "fold-show", eval = F, echo = T, warning = F, message = F}
red_wine_dec_tree_tuned <- train(... ,
                                 data = ... ,
                                 method = ... , 
                                 tuneGrid = expand.grid(cp = seq(0.001, 0.01, 0.001))
                                 )
```

Here:

* We have specified that the `cp` values to test are $0.001, 0.002, 0.003, \dots, 0.009, 0.01$.

Fill in the missing `...` details in the code chunk above, and once you are happy with your code, try running it.

###

Check your results for your modified model - as a result of adjusting the `cp` tuning parameter, has the top accuracy of the model increased?

## Resampling Methods {#resample}

So far, we have used the default bootstrap resampling method `boot` when fitting all our machine learning models.

In fact, the `train` function accepts 14 different resampling options! Don't worry though, we will just consider one alternative to the `boot` method in this lab, namely the
 cross-validation (`cv`) method. 
 
The `cv` method follows the process we are employing with our training and validation data, albeit on a smaller scale - i.e. the machine learning model will take a subset of the `red_wine_train` data, train the model, and then test it on the remaining portion of the `red_wine_train` data. We can also specify the number of cross-validation trials to perform - 10 is often considered adequate.

### {#trControl}

We can specify the machine learning resampling method to use within the `train` function via the argument `trControl`.

Take a look at the code below, and see if you can create a new decision tree model named `red_wine_dec_tree_tuned_cv` that uses the range of `cp` values specified in \@ref(tp) and the `cv` resampling method (you will have to fill in the `...` missing parts).

```{r class.source = "fold-show", eval = F, echo = T, warning = F, message = F}
tr_control <- trainControl(method = "cv",
                           number = 10)

red_wine_dec_tree_tuned_cv <- train(quality ~ .,
                                    data = ... ,
                                    trControl = tr_control,
                                    method = ...,
                                    tuneGrid = ...
                                    )
```

## Decision Tree Models

Compare your results for your decision tree models `red_wine_dec_tree`, `red_wine_dec_tree_tuned` and `red_wine_dec_tree_tunde_cv`.
Which of your three approaches results in the best performing decision tree model?
Note the specific tuning parameter value and resampling method that led to the best results.

###

Use the `ggplot` function to plot the accuracy of your three decision tree models for different `cp` values.

###

It might also help to visualise the decision trees - use the `rpart.plot` function to plot the results. 

Do you notice any problems that might be leading to the poor predictive accuracy result? 

*Hint: Check the code below for a head-start on how to plot your results.*

```{r class.source = "fold-hide", eval = F, echo = T, warning = F, message = F}
rpart.plot(red_wine_dec_tree$finalModel)
```

## Random Forest Models {#rf}

The decision tree is one of the simplest machine learning models we can use. Other methods often produce better results, but they also tend to take longer to train.
For the following models, don't worry if it takes a couple of minutes for your code to run - this is normal.

Using the code in \@ref(dectree) above as a guide, train a machine learning model using the random forest method `rf`. 

Assign your output to the object `red_wine_rf`.

*Hint: The `method = rpart` will need to be changed.*

###

What is the best accuracy achieved by the random forest method?

### {#rftuned}

The main tuning parameter for a random forest model is the `mtry` argument, which represents the number of variables to use in each level. 
Generally, the default values for this are quite good, but it's often worth trying out a few different values, just to check. 

From your initial random forest model `red_wine_rf`, you may have found that an `mtry` value of 2 led to better results than `mtry =6` or `mtry=10`. Therefore, let's try some values around 2.

Use your code from \@ref(rf) to fit a new random forest model that includes inside the `train` function the argument
`tuneGrid = expand.grid(mtry = c(1,3))` (we can skip `mtry=2` since we already have results for this value). Call this new model `red_wine_rf_tuned`.

*Note: The training of this model might take a few minutes, depending on your device's hardware.*

### 

Using the code in \@ref(trControl) above as a guide, train a new random forest model that uses the `cv` resampling method and the tuning specifications used in \@ref(rftuned).

Name your output `red_wine_rf_tuned_cv`.

###

Compare your results for your three random forest models `red_wine_rf`, `red_wine_rf_tuned` and `red_wine_rf_tuned_cv`.
Which of your three approaches results in the best performing random forest model? 
Note the specific tuning parameter values and resampling method that led to the best results.

### {#rfplots}

For random forest models, we can easily produce two helpful plots:

* A plot of the model accuracy based on the number of feature variables used for each decision tree in the random forest model$^{\dagger}$
* A plot of the importance of each feature variable in achieving an accurate model

Run the following R code to produce these plots now, for your initial random forest model:

```{r class.source = "fold-show", eval = F, echo = T, warning = F, message = F}
ggplot(red_wine_rf)

dotPlot(varImp(red_wine_rf))
```

Note for the variable importance graph that the best (most helpful) variable is always given an importance of 100 and the worst is given an importance of 0. This is not to say that the variable with importance ranking 0 is useless - it is just considered less useful than the other variables.

What feature variables are considered most important?

$^{\dagger}$ *Recall from section 4.0.1 of the [Introduction to Machine Learning in R supplement](https://bookdown.org/rehk/stm1001_dsm_t1_introduction_to_machine_learning_in_r/machine-learning-model-classes.html#tree-models) that random forest models are a type of ensemble method, and combine multiple decision trees together.*

###

Create the plots discussed in \@ref(rfplots) for your other two random forest models, and comment on your results.

## Gradient Boosting Machine Models

Another ensemble method that combines multiple decision trees is the gradient boosting machine model.
There are several options for training a gradient boosting machine model in R. 
We will use the stochastic gradient boosting method `gbm`. 

Using the `gbm` argument within the `train` function, train a gradient boosting machine model, and name your model `red_wine_boosted`.

Again, don't worry if it takes a couple of minutes for your code to run - this is normal.

*Note: You can also include the argument `verbose = F` within the `train` function so no calculations are shown while the model is being fitted. Alternatively, if you would like to see the model fitting in action in the R console, you can leave this argument out.*

###

What is the best accuracy achieved by the gradient boosting machine method?

### {#gbmtuned}

For the gradient boosting machine model, the main tuning parameters are:

* `n.trees` (the number of iterations i.e. the number of decision trees used) and 
* `interaction.depth` (the complexity of the trees).

The code chunk below contains partially completed code for a tuned gradient boosting machine model, with the `tuneGrid` argument incorporated into the `train` function.

```{r class.source = "fold-show", eval = F, echo = T, warning = F, message = F}
red_wine_boosted_tuned <- train(... ,
                                data = ... ,
                                method = ... , 
                                verbose = FALSE,
                                tuneGrid = expand.grid(interaction.depth = 3:6,
                                                       n.trees = seq(50, 200, 50),
                                                       shrinkage = 0.1,
                                                       n.minobsinnode = 10)
                               )
```

Here:

* We have specified that the interaction depth can be between 3 and 6, rather than between 1 and 3 (usually, greater depth leads to better accuracy, although we have to be careful not to overfit). 
* We have also specified that the number of trees can be 50, 100, 150 or 200 (i.e. 200 trees is now our maximum, rather than 150).
* Note that the arguments `shrinkage` and `n.minobsinnode` must be included within the `tuneGrid` function, otherwise the training will fail.
The values for these arguments are simply the default ones.

Fill in the missing `...` details in the code chunk above, and once you are happy with your code, try running it.

### 

Using the code in \@ref(trControl) above as a guide, train a new gradient boosting machine model that uses the `cv` resampling method and the tuning specifications used in \@ref(gbmtuned).

Name your output `red_wine_boosted_tuned_cv`.

###

Compare your results for your three gradient boosting machine models `red_wine_boosted`, `red_wine_boosted_tuned` and `red_wine_boosted_tuned_cv`.
Which of your three approaches results in the best performing gradient boosting machine model? 
Note the specific tuning parameter values and resampling method that led to the best results.

###

To conclude our focus on gradient boosted machine tree models, use the `plot` function to visualise the results of the training process, for your best performing gradient boosting machine model.

## Additional Models

Our focus in this lab has been on tree-based machine learning models, as these are flexible and can perform well in a variety of contexts. However, there are plenty more models to choose from, and so we will briefly introduce a few new options below:

* Linear Discriminant Analysis Model
* Support Vector Machine Model
* k-Nearest-Neighbour Model

For these models, we will use the default tuning parameters and resampling methods.

### LDA

Fit a linear discriminant analysis model to your `red_wine_train` data via the method specification `lda`.

What is the best accuracy achieved by this method?

### SVM

Fit a support vector machine model to your `red_wine_train` data via the method specification `svmLinear` (there are other options, but these are more complicated and take longer to execute). 

What is the best accuracy achieved by this method?
 
*Note: In order to fit a svm model, we need to load the `kernlab` package. This should have been done already in \@ref(load), but if an error appears when fitting the model, just double-check this.*

### kNN

The final method we will try is the k-Nearest-Neighbours model, which is selected via the method specification `knn`. Fit a kNN model to your `red_wine_train` data.

What is the best accuracy achieved by this method?

# Validating Results {#val}

Great job! We've now fitted 6 different types of machine learning models. 

In addition to obtaining predictive accuracy estimates for each of our models, it is important to also check how the models perform when presented with our validation data.
If the accuracy of the models remains similar, then we can be more confident in our models' reported performances.

Recall that we carried out this check in [Computer Lab 10B](https://rpubs.com/LTU_STM1001/DSMCL10_S)^[This approach was also demonstrated in [section 4.1 of the Introduction to Machine Learning in R supplement](https://bookdown.org/rehk/stm1001_dsm_t1_introduction_to_machine_learning_in_r/machine-learning-model-classes.html#example---gradient-boosting-machine-model)).].

An example application of this approach to the decision tree model results is shown below:

```{r class.source = "fold-show", eval = F, echo = T, warning = F, message = F}
# Load magrittr package for piping
library(magrittr)

# count number of observations in validation data
validation_numbers <- dim(red_wine_validate)[1]

# Use the fitted model to predict quality values given the validation data
predict_red_wine_dec_tree <- predict(red_wine_dec_tree, 
                                     newdata =red_wine_validate)
# When run, the code below gives us the percentage of correct predictions
dec_tree_accuracy <- sum(predict_red_wine_dec_tree == red_wine_validate$quality) / 
                     validation_numbers %>% round(2) * 100
dec_tree_accuracy
```

##

For each of the different types of models you used in \@ref(ml), select the model that had the best performance. For example, if your `red_wine_dec_tree_tuned` was the best performing model out of the three decision tree models, select that model. This should lead you to have 6 selected models overall - one for each model type.

Following the process outlined above in \@ref(val), use your `red_wine_validate` data to assess the predictive accuracy of your 6 selected models.

Which of these models has the best performance with the validation data?

# Summarising Results {#sum}

Now that we have tried fitting and validating a selection of popular machine learning models, we might like to summarise our findings; this can make it easier to compare the results of the different methods.

##

Machine learning models trained using the `train` function test combinations of tuning parameters to try to find a combination leading to a accurate prediction. By default, 25 *resamples* are conducted on the training data, using different tuning parameter combinations - you might have noticed the line `Resampling: Bootstrapped (25 reps)` in the model output.

However, if we have specified the `cv` resampling method, only 10 *resamples* will have been taken.

Using the incomplete R code in the `Code` chunk below, create two lists to summarise the results of your different machine learning models:

**Note: We need two lists, one for bootstrap resample results, and one for cross-validation resample results, since the number of resamples used for the two methods differs.**

```{r class.source = "fold-show", eval = F, echo = T, warning = F, message = F}
results_boot <- resamples(list(decision_tree = red_wine_dec_tree, 
                               decision_tree_tuned = red_wine_dec_tree_tuned,
                               random_forest = red_wine_rf, 
                               ...
                               )
                          )

results_cv <- resamples(list(decision_tree_tuned_cv = red_wine_dec_tree_tuned_cv,
                             random_forest_tuned_cv = red_wine_rf_tuned_cv,
                             ...
                             )
                        )

summary(results_boot)
...
```

*Note: The column of interest here is actually the `Mean` column (the average accuracy achieved by the model over all the resamples), not the `Max.` column.*

##

Apply the `dotplot` function to the `results_boot` and `results_cv` objects to plot the range of accuracy values for the different methods.

##

To conclude, write a brief, simple summary of your findings, explaining your process and the results of your machine learning models.
Overall, based on the results obtained and the training processes involved, do you have a preference for one of the machine learning models we have used in this computer lab?

<br>

#### Well done, that concludes our work in machine learning. #### {-}

Hopefully this lab has enhanced your understanding of how to use R for machine learning. This is just the beginning - there are so many different models and methods out there!

<br>

# References {- #Ref}
<div id="refs"></div>

<br>

<font color = "grey">
These notes have been prepared by Rupert Kuveke. Please note that some of the content in these notes has been developed from content in @ModStat. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License 
<a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/CC" target="_blank"> BY-NC-ND. </a>
</font>