Data Science Module

Topic 10B: Machine Learning I


Welcome to the tenth computer lab for the Data Science module. In this computer lab we will fit our first machine learning model to real data.

This computer lab is designed to run alongside the content in the Introduction to Machine Learning in R supplement. The material in this supplement provides all the background information on machine learning and machine learning terminology you will need to complete this lab.

By the end of this lab, you should be comfortable preparing data for supervised machine learning tasks, and have a better understanding of how to assess the performance of a machine learning model. Let’s get started!


1 Preparations

Before we proceed, please make sure you have read the content in the Introduction to Machine Learning in R supplement - if you haven’t, these lab questions will be unnecessarily difficult to understand and complete. It may also be helpful to keep this content open in a separate tab while you work through the lab material.

1.1 Load Required Packages

In order to conduct our machine learning processes in R in this lab and the subsequent lab, we will need to install and load several R packages, chief among which is the caret package (Kuhn et al. 2021).

Run the R code below to install and load the R packages required for this lab:

# Install packages
install.packages("caret", "magrittr", "rpart.plot")

# Load packages
library(caret)
library(rpart.plot)

1.2 Wine Data

For our machine learning work, we will assess data on Portuguese Vinho Verde wine, obtained from the UCI Machine Learning Repository (2009) and originally collected by Cortez et al. (2009). This is a real data set that has been referenced in dozens of academic research articles1.

This data consists of 11 feature variables relating to physicochemical aspects of the wine, namely:

  • fixed acidity
  • volatile acidity
  • citric acid
  • residual sugar
  • chlorides
  • free sulfur dioxide
  • total sulfur dioxide
  • density
  • pH
  • sulphates
  • alcohol

For privacy reasons, other feature and outcome variables like grape_type, brand and sale price are not available (but would be very interesting to consider!).

The only available outcome variable is quality, which is an integer score from 0 to 10 (with 0 denoting a terrible wine and 10 denoting an exceptional wine).

This data is split into two sets - one for red wine and one for white wine. We will assess only the red wine data set.

1.3

You can download the file winequality_red.csv from LMS - please do so now.

Once you have downloaded this data set, load the red wine data into R and save it in the object red_wine.

Hint: You can check the Code chunk below for the code to do this. This code assumes your data is saved in your local working directory.

red_wine <- read.csv(file = "winequality_red.csv", header = T)

1.4 Aim

Our aim is to train a machine learning model which can accurately predict the quality of a red wine, based on feature variable inputs.

As this is a classification problem, we need to ensure that the quality variable values are treated as factors, rather than as numbers (otherwise our model might end up predicting quality scores of e.g. 7.25).

Therefore, make sure to run the following R code before proceeding further:

red_wine$quality <- as.factor(red_wine$quality)

2 Data Visualisation

Before we start analysing the data, it is prudent to take a quick look at it.

2.1

To begin, use the head, summary and dim R functions to obtain details about the composition of the data. Do you notice any interesting details?

2.2

Use the plot function to produce a bar plot of the quality variable values - what do you notice?

Based on the observed values for this variable, can you think of any potential problems we might encounter when trying to predict certain quality values?

Note: It’s ok if you’re not sure about this yet - after all, we have only just started learning about machine learning.

2.3

We can use the featurePlot function from the caret package to produce box plots for all the feature variables in our data set. Run the code below to do this:

featurePlot(x = red_wine[, -12], 
            y = red_wine$quality, 
            plot = "box")

2.4

You will notice that it’s hard to see most of the box plots, because two feature variables, free.sulfur.dioxide and total.sulfur.dioxide, have much larger values than the other feature variables. While this isn’t necessarily a problem (we can just produce box plots for each of the feature variables in turn), there is a method to address this, which we will discuss in the next section.

For the moment however, try to recreate these box plots, without including the feature variables free.sulfur.dioxide and total.sulfur.dioxide.

Hint: Note how in the featurePlot code above, we have used x = red_wine[, -12] to specify that the 12th column (quality) should not be included in the plotted x variables - you can use this as a guide on how to ignore free.sulfur.dioxide and total.sulfur.dioxide.

3 Pre-Processing

Before we begin fitting a machine learning model, we should conduct some pre-processing checks, as outlined in Section 3.3 of the Introduction to Machine Learning in R supplement.

3.1

First, note that all our feature variables are numeric - therefore there is no need to create any dummy variables.

3.2

We can use the function nearZeroVar from the caret package to obtain details on the freqRatio and percentUnique values for each of the variables in our red_wine data set.

Recall that the nearZeroVar function can include additional arguments, freqCut and uniqueCut, that specify cut-off values for the freqRatio and percentUnique results respectively.

Use the nearZeroVar function to assess the feature variables in the red_wine data set, and specify a cut-off value of 2 for the freqRatio values and a cut-off value of 5 for the percentUnique values. Run this function twice, once with saveMetrics = T and once with saveMetrics = F.

Hint: If you are not sure how to proceed, check section 3.3.2 of the Introduction to Machine Learning in R supplement.

3.3

Based on the nearZeroVar function results, check for potentially problematic variables. Which feature variable has the highest freqRatio value, and which feature variable has the lowest percentUnique value? What do you conclude?

3.4

Next, we should check for correlated feature variables. Generally, some correlation is normal and expected, but often it is beneficial to remove highly correlated feature variables from our data.

Compute a correlation matrix for the red_wine feature variables, check for extreme correlations close to 1 in magnitude, and use the summary function to assess the spread of correlation values.

Hint: You can follow the steps in section 3.3.2.1 of the Introduction to Machine Learning in R supplement for this question.

3.5

What are the largest negative and positive correlation values? Do these seem problematic?

3.6

Based on your calculations, do you think we need to use the findCorrelation function to identify highly correlated feature variables to remove from our data set? Why or why not?

Note: For the remainder of this lab, we will assume no feature variables needed to be removed.

3.7

One aspect of pre-processing that we did not discuss in the supplementary material is the concept of modifying our data values, to ensure smoother comparisons between different feature variables.

Similarly to the normalisation processes conducted in earlier labs, we can scale and center our wine data, using the caret package function preProcess.

Run the R code below, to scale and center our red_wine data:

centre_scale <- preProcess(red_wine[, -12], 
                           method = c("center", "scale"))
red_wine_updated <- predict(centre_scale, red_wine)

3.8

If you now compare the original data to the updated data, using head(red_wine) and head(red_wine_updated) respectively, you should see that the feature variable values are now scaled and centred.

3.9

Try running the R code below. The box plots should now be easier to assess.

featurePlot(x = red_wine_updated[, -12], 
            y = red_wine_updated$quality, 
            plot = "box",
            auto.key = list(columns = 6))

Do any of the features seem to have much association with wine quality? Do you notice a trend or pattern in the box plots for the different quality values within any particular features?

3.10

Even though the box plots produced in 3.9 are clearer than those produced in section 2, it is still quite difficult to distinguish between the quality ratings.

Replace the plot = "box" argument in the code above with plot = "pairs" to produce scatter plots instead of box plots for our pre-processed data.

Do you notice any clear differences between the different quality ratings?

3.11 Training and Validation Data

Our final pre-training step is to split our data into training and validation sets. Use the createDataPartition function from the caret package to split the red_wine_updated data 80/20.

Please note that the data partitioning into training or validation categories is random to an extent, so if you do not run the set.seed(1650) commands shown in theCode` chunks below, your results from this point onwards may differ slightly to those presented in the subsequent question solutions, since your training and validation data sets will most likely contain slightly different sets of observations.

The code below is partially completed, just fill in the ... missing parts:

set.seed(1650)
wine_train_index <- createDataPartition(... , 
                                        p = ... , 
                                        list = FALSE, times = 1) 

Hint: Remember that the argument p denotes the split. If you are stuck, you can check the Code chunk below:

set.seed(1650)
wine_train_index <- createDataPartition(red_wine_updated$quality, 
                                        p = .8, # here p designates the split - 80/20
                                        list = FALSE, times = 1) 

3.12

Next, assign the red_wine_updated data into the training and validation sets, and name these red_wine_train and red_wine_validate respectively. Check the code below for a head start:

red_wine_validate <- red_wine_updated[-wine_train_index, ]

Hint: If you are stuck, you can check the Code chunk below:

# Note here we are using the values in the wine_train_index
# (whereas for the validation set, we select the values not in the wine_train_index)
red_wine_train <- red_wine_updated[wine_train_index, ]

4 Fitting a Decision Tree Machine Learning Model

Now that the preparation phase is complete, we are ready to fit our first machine learning model using our red_wine_updated data.

The focus in this lab will be to introduce you to the train function from the caret package. We can fit a variety of machine learning models using this function (although some will also require other packages). We will start with a simple model, the Decision Tree, and then next week we will expand to looking at Random Forests, k-Nearest Neighbours, and other types of algorithms.

Using the train function, the basic code framework to fit each model is as follows:

object <- train(... ~ ., # specify relationship between outcome and feature variables
                data = ... , # specify training data
                method = "specify method here")

Regardless of what algorithm you use, there will be three main arguments you will need to include in your train function:

  • The relationship between the outcome variable and the feature variables
  • The data set
  • The method/algorithm to use

Let’s cover these in more detail.

In the first argument we specify the relationship between the outcome variable and the feature variables.

For example, if our outcome variable was called outcome, and we had two feature variables, feature1 and feature2, the first part of our code could look like this:

object <- train(outcome ~ feature1 + feature2, 
                ...)

In general however, we will have more than two feature variables to include (sometimes dozens more!). Therefore, we can use the shortcut outcome ~. to specify that all variables in the data set, apart from outcome, should be included as feature variables in the model.

As a result, when training a supervised learning machine learning model using the train function, typically all you will need to do when specifying your first argument is identify the name of your outcome variable, and include this name in place of outcome in outcome ~..

For the data argument, you will need to specify your pre-processed data set, and for the method argument, you will need to specify the machine learning method you would like to use - each has a different name.

Some models will include additional arguments, usually specified within the argument tuneGrid, and we will explain these where relevant.

Let’s begin.

4.1 Decision Tree

One of the simplest machine learning models we can use is a decision tree.

Using the information in 4, and the partially complete code in the Code chunk below, fit a decision tree to your pre-processed red_wine_train training data.

set.seed(1650) 
red_wine_decision_tree <- train(... ~ .,
                            data = ...,
                            method = "rpart")

Note that the decision tree method name is rpart (which is unintuitive).

Once you are happy with your code, run it, and then run the object red_wine_decision_tree to see the output. Your output should look like the output in the Code chunk below:

CART 

1282 samples
  11 predictor
   6 classes: '3', '4', '5', '6', '7', '8' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 1282, 1282, 1282, 1282, 1282, 1282, ... 
Resampling results across tuning parameters:

  cp          Accuracy   Kappa    
  0.01221167  0.5737107  0.3033509
  0.02374491  0.5657458  0.2697084
  0.25237449  0.4719068  0.1116919

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.01221167.

We are mainly interested here in the Accuracy values, for different tuning parameter values (we can ignore the Kappa values). As we can see, the best accuracy achieved was only 57.37%, which is not much better than randomly guessing.

Note: Don’t worry if your results look slightly different - perhaps you did not run all the set.seed(1650) commands?

4.1.1

Run the following R code to visualise this decision tree model:

rpart.plot(red_wine_decision_tree$finalModel)

Note: The rpart.plot package required here should have been installed and loaded in 1.

4.1.2

As you should be able to see from the decision tree diagram, the model did not have enough data on wines of low or high quality, and therefore is only able to make predictions between wines of quality 5 or 6 - this explains the disappointing overall predictive accuracy of the model.

This is the problem we were alluding to in 2.2. The example in the supplement material that used the penguins data was an exception - usually the data is not so nicely partitioned.

5 Validating Results

While we have a predictive accuracy estimate for our decision tree model, it is important to remember that this has been computed using the training data.

We would also like to check how the model performs when presented with new data - i.e. our validation data!

When conducting machine learning, there is a risk of overfitting our models to our training data. This can result in the models having excellent accuracy when assessing the training data, but having subpar performance when presented with new data.

This is why we have put aside some of our data as validation data in 3.11, so that we can perform cross-validation. If the accuracy of the model remains similar when presented with the validation data, then we can be more confident in our model’s reported performance.

5.1

There are several ways to perform cross-validation. One of the simplest is demonstrated in section 4.1 of the Introduction to Machine Learning in R supplement).

An example application of this approach to the decision tree model results is shown below:

# Load magrittr package for piping
library(magrittr)

# count number of observations in validation data
validation_numbers <- dim(red_wine_validate)[1]

# Use the fitted model to predict quality values given the validation data
predict_red_wine_decision_tree <- predict(red_wine_decision_tree, 
                                          newdata =red_wine_validate)
# When run, the code below gives us the percentage of correct predictions
dec_tree_accuracy <- sum(predict_red_wine_decision_tree == red_wine_validate$quality) 
                     / validation_numbers %>% round(2) * 100

dec_tree_accuracy

Run this code now.

5.2

Discuss the results of the cross-validation. Do you think the decision tree model’s reported accuracy in 4.1 is reliable?


Great work, that’s everything for today!

While we have just brushed the surface of machine learning, hopefully this lab has provided you with a better understanding of how to use R for machine learning - as we can see, it’s actually not that complicated to train a machine learning model in R (although getting a highly accurate one is often another story…).

Next week, we will continue our analysis of the red wine data, and see if we can improve our results by using different models and by adjusting training parameters.


References

Cortez, P., A. Cerdeira, F. Almeida, T. Matos, and J. Reis. 2009. “Modeling Wine Preferences by Data Mining from Physicochemical Properties.” Decision Support Systems 47 (4): 547–53.
Kuhn, M., J. Wing, S. Weston, A. Williams, C. Keefer, A. Engelhardt, T. Cooper, et al. 2021. caret: Classification and Regression Training. https://cran.r-project.org/web/packages/caret/index.html.
Thulin, M. 2021. Modern Statistics with R: From Wrangling and Exploring Data to Inference and Predictive Modelling.
UCI Machine Learning Repository. 2009. “Wine Quality Data Set[.csv File].” 2009. https://archive.ics.uci.edu/ml/datasets/Wine+Quality.


These notes have been prepared by Rupert Kuveke. Please note that some of the content in these notes has been developed from content in Thulin (2021). The copyright for the material in these notes resides with the authors named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.

