Preparations
🏡 Before we proceed, please make sure you have completed the first ML Data Science Computer Lab and read most of the content in the Introduction to Machine Learning in R supplement - if you haven’t, please spend some time now completing this material, as otherwise these lab questions will be unnecessarily difficult to understand and complete.
It may also be helpful to keep this content open in a separate tabs while you work through the lab material.
Load Required Packages
🏡 Run the R code below to load the R packages required for this lab:
library(caret)
library(rpart.plot)
Note: You should already have the caret
, magrittr
and rpart.plot
packages installed. If for whatever reason you do not, please install them using the code below:
# Install packages
install.packages("caret", "magrittr", "rpart.plot")
# Load packages
library(caret)
library(rpart.plot)
Wine Data
🏡 In this lab, we will assess a new data set - data on Portuguese Vinho Verde wine, obtained from the UCI Machine Learning Repository (2009) and originally collected by Cortez et al. (2009).
This is a real data set that has been referenced in dozens of academic research articles.
This data consists of 11 feature variables relating to physicochemical aspects of the wine, namely:
- fixed acidity
- volatile acidity
- citric acid
- residual sugar
- chlorides
- free sulfur dioxide
- total sulfur dioxide
- density
- pH
- sulphates
- alcohol
For privacy reasons, other feature and outcome variables like grape_type
, brand
and sale price
are not available (but would be very interesting to consider!).
The only available outcome variable is quality
, which is an integer score from 0 to 10 (with 0 denoting a terrible wine and 10 denoting an exceptional wine).
This data is split into two sets - one for red wine and one for white wine. We will assess only the red wine data set, which is stored in the file winequality_red.csv
.
🏡 You can download the file winequality_red.csv
from LMS - please do so now.
Once you have downloaded this data set, load the winequality_red.csv
data into RStudio and save it in the object red_wine
.
Hint: You can check the code chunk below for guidance. This code assumes your data is saved in your local working directory.
red_wine <- read.csv(file = "winequality_red.csv", header = T)
Aim
🏡 Our aim is to use the winequality_red.csv
data to train a machine learning model which can accurately predict the quality of a red wine, based on the feature variable inputs.
💻 In ML classification problems, it is important that our outcome variable is treated as a factor, rather than a continuous numeric variable.
In previous content, this has not been an issue (i.e. we had the distinct penguin species
).
However, with the red_wine
data, remember that the quality
scores only take integer values. Therefore we need to ensure that the quality
variable values are treated as factors, rather than as numbers (otherwise our model might end up predicting quality scores of e.g. 7.25).
Therefore, make sure to run the code below before proceeding further:
red_wine$quality <- as.factor(red_wine$quality)
Data Visualisation
💻 Before we start analysing the data, it is prudent to take a quick look at it.
💻 To begin, use the head
, summary
and dim
R functions to obtain details about the composition of the data. Do you notice any interesting details?
🎧 Online students
💬 Enter your answer next to the question on the shared jamboard.
💻 Use the plot
function to produce a bar plot of the quality
variable values - what do you notice?
Based on the observed values for this variable, can you think of any potential problems we might encounter when trying to predict certain quality values?
Note: Don’t worry if you’re not sure about this yet - we are still just starting to learn about machine learning.
🎧 Online students
💬 Enter your answer next to the question on the shared jamboard.
💻 Recall that we can use the featurePlot
function from the caret
package to visualise our data.
Rather than creating scatter plots, we can specify plot = "box"
to produce box plots for all the feature variables in our data set.
Run the code below to do this:
featurePlot(x = red_wine[, -12],
y = red_wine$quality,
plot = "box")
Note: The numbers along the bottom of the plots refer to the quality
scores.
💻 You will notice that it’s hard to see most of the box plots, because two feature variables, free.sulfur.dioxide
and total.sulfur.dioxide
, have much larger values than the other feature variables. While this isn’t necessarily a problem (we can just produce box plots for each of the feature variables in turn), there is a method to address this, which we will discuss in the next section.
For the moment however, try to recreate these box plots, without including the feature variables free.sulfur.dioxide
and total.sulfur.dioxide
.
Hint: Note how in the featurePlot
code above, we have used x = red_wine[, -12]
to specify that the 12th column (quality) should not be included in the plotted x variables - you can use this as a guide on how to ignore free.sulfur.dioxide
and total.sulfur.dioxide
. Check the code below if you are stuck:
featurePlot(x = red_wine[, -c(6,7,12)],
y = red_wine$quality,
plot = "box")
Pre-Processing
💻 Before we begin fitting a machine learning model, we should conduct some pre-processing checks. Having completed the first ML Data Science Computer Lab, most of the following steps should be familiar to you.
As we work through this section though, you may like to refer to the content presented in Section 3.2 of the Introduction to Machine Learning in R supplement.
💻 Since all our feature variables are numeric, there is no need to create any dummy variables for our data set.
Highly Influential Samples
💻 Recall that we can use the function nearZeroVar
from the caret
package to obtain details on the freqRatio
and percentUnique
values for each of the variables in our red_wine
data set.
Further, recall that the nearZeroVar
function can include additional arguments, freqCut
and uniqueCut
, that specify cut-off values for the freqRatio
and percentUnique
results respectively.
Use the nearZeroVar
function to assess the feature variables in the red_wine
data set, and specify:
- A cut-off value of 2 for the
freqRatio
values
- A cut-off value of 5 for the
percentUnique
values.
Run the nearZeroVar
function twice, once with saveMetrics = T
and once with saveMetrics = F
.
Hint: If you are not sure how to proceed, check sections 3.2.2 to 3.2.4 of the Introduction to Machine Learning in R supplement.
💻 Based on the nearZeroVar
function results, check for potentially problematic variables.
Which feature variable has the highest freqRatio
value, and which feature variable has the lowest percentUnique
value?
What do you conclude? Are there any problematic feature variables?
🎧 Online students
💬 Enter your answer next to the question on the shared jamboard.
💻 Next, we should check for correlated feature variables. Remember that it is often beneficial to remove highly correlated feature variables from our data.
Use the cor
function to compute a correlation matrix for the red_wine
feature variables. Assess the spread of correlation values, and check for extreme correlations close to 1 in magnitude.
Hint: You can follow the steps in section 3.2.5 of the Introduction to Machine Learning in R supplement for this question.
💻 What are the largest negative and positive correlation values? Do these seem problematic?
💻 Based on your calculations, do you think we need to use the findCorrelation
function to identify highly correlated feature variables to remove from our data set? Why or why not?
Note: For the remainder of this lab, we will assume no feature variables needed to be removed.
🎧 Online students
💬 Enter your answer next to the question on the shared jamboard.
💻 One aspect of pre-processing that we did not discuss in the supplementary material is the concept of modifying our data values, to ensure smoother comparisons between different feature variables.
Similarly to the normalisation processes conducted in earlier core computer labs, we can scale
and center
our wine data, using the caret
package function preProcess
.
Run the code below, to scale and center our red_wine
data, and then assign our updated data to the new object red_wine_updated
:
centre_scale <- preProcess(red_wine[, -12],
method = c("center", "scale"))
red_wine_updated <- predict(centre_scale, red_wine)
💻 Compare the original data to the updated data, using head(red_wine)
and head(red_wine_updated)
respectively. You should observe that the feature variable values are now scaled and centered.
💻 If we create box plots of the updated data red_wine_updated
, we observe that they are much easier to assess than the original box plots from 2.3:
featurePlot(x = red_wine_updated[, -12],
y = red_wine_updated$quality,
plot = "box",
auto.key = list(columns = 6))

Do any of the features seem to have much association with wine quality
? Do you notice a trend or pattern in the box plots for the different quality
values within any particular features?
🎧 Online students
💬 Enter your answer next to the question on the shared jamboard.
💻 Even though the box plots produced in 3.9 are clearer than those produced in section 2, it is still quite difficult to distinguish between the quality
ratings.
Replace the plot = "box"
argument in the code from 3.9 with plot = "pairs"
to produce scatter plots instead of box plots for our pre-processed data.
Do you notice any clear differences between the different quality
ratings?
🎧 Online students
💬 Enter your answer next to the question on the shared jamboard.
Training and Validation Data
💻 Our final preparatory step is to split our data into training and validation sets.
Use the createDataPartition
function from the caret
package to split the red_wine_updated
data 80/20.
Note: The data partitioning into training or validation categories is random to an extent, so if you do not run the set.seed(1650) commands shown in the
Code` chunks below, your results from this point onwards may differ slightly to those presented in the subsequent question solutions, since your training and validation data sets will most likely contain slightly different sets of observations.
The code below is partially completed, just fill in the ...
missing parts:
set.seed(1650)
wine_train_index <- createDataPartition(... ,
p = ... ,
list = FALSE, times = 1)
Hint: Remember that the argument p
denotes the split. If you are stuck, you can check the code chunk below:
set.seed(1650)
wine_train_index <- createDataPartition(red_wine_updated$quality,
p = .8, # here p designates the split - 80/20
list = FALSE, times = 1)
💻 Next, assign the red_wine_updated
data into the training and validation sets, and name these red_wine_train
and red_wine_validate
respectively. Check the code below for a head start:
red_wine_validate <- red_wine_updated[-wine_train_index, ]
Hint: If you are stuck, you can check the code chunk below:
# Note here we are using the values in the wine_train_index
# (whereas for the validation set, we select the values not in the wine_train_index)
red_wine_train <- red_wine_updated[wine_train_index, ]
Fitting a Decision Tree Machine Learning Model
💻 Now that the preparation phase is complete, we are ready to fit machine learning models using our red_wine_updated
data.
Recall that the basic code framework to fit a model using the train
function is as follows:
object <- train(... ~ ., # specify relationship between outcome and feature variables
data = ... , # specify training data
method = "specify method here")
Decision Tree
💻 One of the simplest machine learning models we can use is a decision tree.
Using the information in 4, and the partially complete code in the code chunk below, fit a decision tree to your pre-processed red_wine_train
training data.
set.seed(1650)
red_wine_decision_tree <- train(... ~ .,
data = ...,
method = "rpart")
Once you are happy with your code, run it, and then run the object red_wine_decision_tree
to see the output. Your output should look like the output in the code chunk below:
CART
1282 samples
11 predictor
6 classes: '3', '4', '5', '6', '7', '8'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 1282, 1282, 1282, 1282, 1282, 1282, ...
Resampling results across tuning parameters:
cp Accuracy Kappa
0.01221167 0.5737107 0.3033509
0.02374491 0.5657458 0.2697084
0.25237449 0.4719068 0.1116919
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.01221167.
We are mainly interested here in the Accuracy
values, for different tuning parameter values (we can ignore the Kappa
values).
Unfortunately, the best accuracy achieved was only 57.37%, which is not much better than randomly guessing.
Note: Don’t worry if your results look slightly different - perhaps you did not run all the set.seed(1650)
commands?
💻 Run the following code to visualise this decision tree model:
rpart.plot(red_wine_decision_tree$finalModel)
Note: The rpart.plot
package required here should have been installed and loaded in 1.
💻 As you should be able to see from the decision tree diagram, the model did not have enough data on wines of low or high quality, and therefore is only able to make predictions between wines of quality 5
or 6
- this explains the disappointing overall predictive accuracy of the model.
This is the problem we were alluding to in 2.2. Our previous work using the penguins
data was an exception - usually the data is not so nicely partitioned.
Fitting a Random Forest Machine Learning Model
💻 The decision tree is one of the simplest machine learning models we can use. Other methods often produce better results, but they also tend to take longer to train. Let’s see if we can obtain a better predictive accuracy using an ensemble model tree method.\(^{\dagger}\)
\(^{\dagger}\) Refer to section 4.2.1.2 of the Introduction to Machine Learning in R supplement for details.
Training the Random Forest Model
💻 Training a random forest model in RStudio is quite simple, once we understand the basics of the train
function.
Replace the method = "rpart"
part of your decision tree code from 4.1 with method = "rf"
.
This will tell the train
function to train a machine learning model using the random forest method (rf
).
Assign your output to the object red_wine_rf
.
Note: Don’t worry if it takes a couple of minutes for your code to run - this is normal for more complex models.
💻 Assess the red_wine_rf
object. What is the best accuracy achieved by the random forest method?
💻 Compare your results for your decision tree model and your random forest model. Which model achieves a higher predictive accuracy?
🎧 Online students
💬 Enter your answer next to the question on the shared jamboard.
Random Forest Plots
💻 For random forest models, we can easily produce two helpful plots:
- A plot of the model accuracy based on the number of feature variables used for each decision tree in the random forest model
- A plot of the importance of each feature variable in achieving an accurate model
Note for the variable importance graph that the best (most helpful) variable is always given an importance of 100 and the worst is given an importance of 0. This is not to say that the variable with importance ranking 0 is useless - it is just considered less useful than the other variables.
💻 Run the code below to produce plots for your red_wine_rf
random forest model:
What feature variables are considered most important?
ggplot(red_wine_rf)
dotPlot(varImp(red_wine_rf))
🎧 Online students
💬 Enter your answer next to the question on the shared jamboard.
Validating Results
💻 While we have predictive accuracy estimates for our decision tree and random forest models, it is important to remember that these have been computed using the training data.
We would also like to check how the models perform when presented with new data - i.e. our validation data!
Recall that when conducting machine learning, there is a risk of overfitting our models to our training data. This can result in the models having excellent accuracy when assessing the training data, but having subpar performance when presented with new data.
This is why we have put aside some of our data as validation data in 3.11, so that we can perform cross-validation.
If the accuracy of the model remains similar when presented with the validation data, then we can be more confident in our model’s reported performance.
💻 There are several ways to perform cross-validation. Run the code below to conduct a simple cross-validation check of the accuracy of the decision tree model obtained in 4.1.
# Load magrittr package for piping
library(magrittr)
# count number of observations in validation data
validation_numbers <- dim(red_wine_validate)[1]
# Use the fitted model to predict quality values given the validation data
predict_red_wine_decision_tree <- predict(red_wine_decision_tree,
newdata =red_wine_validate)
# When run, the code below gives us the percentage of correct predictions
dec_tree_accuracy <- sum(predict_red_wine_decision_tree ==
red_wine_validate$quality) / validation_numbers * 100
dec_tree_accuracy %>% round(2)
💻 Using the code in 6.1 as a guide, perform a simple cross-validation check of the accuracy of the random forest model you obtained in 5.1.
Discuss the results of the cross-validation. Do you think your models’ reported accuracies in 4.1 and 5.2 are reliable?
🎧 Online students
💬 Volunteer to share your screen and explain your answers to this question.
Great work, that’s everything for today!
Hopefully you are beginning to feel more comfortable conducting supervised machine learning in RStudio - as we can see, it’s actually not that complicated to train a machine learning model in RStudio (although getting a highly accurate one is often another story…).
In the next data science computer lab, we will continue our analysis of the red wine data, and see if we can improve our results by using different models and by adjusting training parameters.
