Data Science Stream
Topic 9B: Machine Learning I
Welcome to the ninth computer lab for the Data Science stream of STM1001.
In this computer lab we will practice fitting our first machine learning model.
This computer lab is designed to run alongside the content in the Introduction to Machine Learning in R supplement. The material in this supplement provides all the background information on machine learning and machine learning terminology you will need to complete this lab.
The amount of material in this lab is smaller than usual, to ensure that you have plenty of time to read over the different sections of the Introduction to Machine Learning in R supplement as we go.
By the end of this lab, you should have developed a solid foundational understanding of how to conduct simple supervised machine learning tasks in RStudio. You should be starting to feel comfortable preparing data for supervised machine learning tasks, and know how to assess the performance of a machine learning model. In future weeks we will develop your ML skills further.
Let’s get started!
Preparations
🏡 Before we proceed, please make sure you have read at least sections 1 and 2 of the Introduction to Machine Learning in R supplement - if you haven’t, these lab questions will be unnecessarily difficult to understand and complete. It will be helpful to keep this content open in a separate tab while you work through this lab.
Load Required Packages
💻 In order to conduct our machine learning processes in RStudio in this lab and in subsequent labs, we will need to install and load several R packages, chief among which is the caret
package (Kuhn et al. 2021).
Run the code below to install and load the R packages required for this lab:
# Install packages
install.packages(c("caret", "magrittr", "rpart.plot"))
# Load packages
library(caret)
library(rpart.plot)
Predicting Penguin Species
💻 In Section 3 onwards of the Introduction to Machine Learning in R supplement, we introduced various ML pre-processing techniques via an example task involving the familiar penguins
data set from the palmerpenguins
R package (Horst, Hill, and Gorman 2020).
To ease us into learning about machine learning, we will focus on extending this example in this lab. Specifically, we will aim to train a simple machine learning model which can accurately predict the species
of a penguin living in the Palmer Archipelago, using a number of feature variables.
Penguin Data
💻 For our machine learning work, we will use the following feature variables from the penguins
data set:
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
Our outcome variable will remain species
, with the options Adelie
, Chinstrap
and Gentoo
.
💻 Run the code below to load the penguins
data in RStudio, select the chosen variables, and assign them to the object ml_penguins
:
library(palmerpenguins)
ml_penguins <- na.omit(penguins[, -8]) # note we ignore the year variable here
Aim
🏡 Our aim is to train a machine learning model which can accurately predict the species
of penguin, based on inputs from the feature variables specified in 2.1 above.
What type of problem class would this task classify as?
🎧 Online students
💬 Enter your answer next to the question on the shared jamboard.
Initial Data Visualisation
🏡 We can use the featurePlot
function from the caret
package to produce scatter plots of the observed values for all the feature variables in our data set.
Run the code below to do this:
featurePlot(x = ml_penguins[, -1], y = ml_penguins$species,
plot = "pairs", auto.key = list(columns = 3))
🏡 Based on the scatter plots produced in 2.4, can you think of any potential problems we might encounter when trying to predict certain species?
Note: It’s ok if you’re not sure about this yet - after all, we have only just started learning about machine learning. For more details about this scatter plot matrix, refer to Section 3.1.2 of the Introduction to Machine Learning in R supplement.
🎧 Online students
💬 Enter your answer next to the question on the shared jamboard.
Pre-Processing the Penguin data
💻 Before we begin fitting a machine learning model, we should conduct some pre-processing checks, as outlined in Section 3.2 of the Introduction to Machine Learning in R supplement.
Dummy Variables
💻 First, note that two of our feature variables, namely island
and sex
, are categorical. Therefore we will need to create some dummy variables - remember, all the feature variables we use in our ML model need to be numeric in format.
Following the code provided in Section 3.2.1 of the Introduction to Machine Learning in R supplement, reclassify the island
and sex
feature variables as dummy variables.
Name your updated object ml_penguins_updated
.
Hint: If you are not sure how to proceed, check the code chunk below:
# Load a package to help with the restructure of the data
library(tibble)
# Use the dummayVars function to create a full set of dummy variables for the ml_penguins data
dummy_penguins <- dummyVars(species ~ ., data = ml_penguins)
# Use the predict function to update our ml_penguins feature variables with
# both island and sex dummy variables
ml_penguins_updated <- as_tibble(predict(dummy_penguins, newdata = ml_penguins))
# Prepend the outcome variable to our updated data set, otherwise it will be lost
ml_penguins_updated <- cbind(species = ml_penguins$species, ml_penguins_updated)
💻 Note that we have three dummy variables now for island
, as there are three islands, and we have two dummy variables for sex
.
Use the head
command to check the first few rows of data in ml_penguins_updated
, and make sure you understand the new notation before proceeding.
Highly Influential Samples
💻 Our next step is to check for and remove any samples in our data which could exert excessive influence on the fit of our ML model.
We can use the function nearZeroVar
from the caret
package to obtain details on the freqRatio
and percentUnique
values for each of the variables in our data set.
Use the nearZeroVar
function to assess the feature variables in ml_penguins_updated
. Run this function twice, once with saveMetrics = T
and once with saveMetrics = F
.
Hint: If you are not sure how to proceed, check section 3.2.3 of the Introduction to Machine Learning in R supplement.
💻 Recall that the nearZeroVar
function can include additional arguments, freqCut
and uniqueCut
, that specify cut-off values for the freqRatio
and percentUnique
results respectively.
Re-run your code from 3.3, and this time specify a cut-off value of 2 for the freqRatio
values and a cut-off value of 5 for the percentUnique
values.
Note: For details on specifying cut-off values, check section 3.2.4 of the Introduction to Machine Learning in R supplement.
💻 Based on the nearZeroVar
function results from 3.3.1, check for potentially problematic variables.
Which feature variable has the highest freqRatio
value, and which feature variable has the lowest percentUnique
value?
Are there any feature variables which you would recommend removing? Why or why not?
🎧 Online students
💬 Enter your answer next to the question on the shared jamboard.
💻 Next, we should check for correlated feature variables.
Generally, some correlation between feature variables is to be expected, but often it is beneficial to remove highly correlated feature variables from our data.
Run the code below to compute a correlation matrix for the ml_penguins_updated
feature variables, and to check for extreme correlations close to 1 in magnitude.
base_cor <- cor(ml_penguins_updated[, 5:8])
extreme_cor <- sum(abs(base_cor[upper.tri(base_cor)]) > .999)
extreme_cor
Note: We do not consider dummy variables here, since they originate from categorical variables.
Hint: Refer to section 3.2.5 of the Introduction to Machine Learning in R supplement for additional details.
💻 Run the base_cor
object to assess the spread of correlation values.
What are the largest negative and positive correlation values? Do these seem problematic?
🎧 Online students
💬 Enter your answer next to the question on the shared jamboard.
💻 Suppose our correlation limit for highly correlated feature variables is a relatively strict value of 0.7.
If we run the code findCorrelation(base_cor, cutoff = .7)
, we obtain the output 3
.
At first, that can seem somewhat unhelpful. What exactly does the 3
mean here?
Well, it is telling us that, within the subset of feature variables assessed in the base_cor
object, the third variable exceeds our specified cut-off value of 0.7.
If we check our base_cor
object, we see that the feature variable in column 3 is flipper_length_mm
, which has a high correlation of 0.873 with body_mass_g
.
💻 If we would now like to remove the feature variable flipper_length_mm
from our data set, we need to be careful. Column 3 in the base_cor
matrix does not correspond to column 3 in our ml_penguins_updated
object - we only assessed columns 5 to 8 of ml_penguins_updated
when computing correlations.
Therefore we actually want to remove column 7 from our ml_penguins_updated
object. Run the code below to do this now:
ml_penguins_filtered <- ml_penguins_updated[, - 7] # flipper_length_mm has been removed
If we compute a new correlation matrix for the non-dummy feature variables in our filtered data set, we see that the highest magnitude correlation value is now 0.589451 - much better!
Note: When removing a variable from a data set, it is always a good idea to check your new object, e.g. head(ml_penguins_filtered)
to verify you have removed the intended variable.
Training and Validation Data
💻 Before we train our ML model, our final step is to split our data into training and validation sets.
Use the createDataPartition
function from the caret
package to split the ml_penguins_filtered
data 80/20.
Note: The data partitioning into training or validation categories is random to an extent, so if you do not run the set.seed(1650)
commands shown in the code chunks below, your results from this point onwards may differ slightly to those presented in the subsequent question solutions, since your training and validation data sets will most likely contain slightly different sets of observations.
The code below is partially completed, just fill in the ...
missing parts:
set.seed(1650)
train_index <- createDataPartition(... ,
p = ... ,
list = FALSE, times = 1)
Hint: Remember that the argument p
denotes the split. If you are stuck, you can check section 4.1 of the Introduction to Machine Learning in R supplement, and/or the code chunk below:
set.seed(1650)
train_index <- createDataPartition(ml_penguins_filtered$species,
p = .8, # here p designates the split - 80/20
list = FALSE, times = 1)
💻 Next, assign the ml_penguins_filtered
data into the training and validation sets, and name these penguin_train
and penguin_validate
respectively. Check the code below for a head start:
penguin_validate <- ml_penguins_filtered[-train_index, ]
Hint: If you are stuck, you can check the code chunk below:
# Note here we are using the values in the train_index object
# (whereas for the validation set, we select the values not in the train_index object)
penguin_train <- ml_penguins_filtered[train_index, ]
Fitting a Decision Tree Machine Learning Model
💻 Now that the preparation phase is finally complete, we are ready to fit our first machine learning model.
The focus in this lab will be to introduce you to the train
function from the caret
package. We can fit a variety of machine learning models using this function (although some will also require other packages). We will start with a simple model, the Decision Tree, and then extend to other models in future labs.
Using the train
function, the basic code framework to fit each model is as follows:
object <- train(... ~ ., # specify relationship between outcome and feature variables
data = ... , # specify training data
method = "specify method here")
Regardless of what algorithm you use, there will be three main arguments you will need to include in your train
function:
- The relationship between the outcome variable and the feature variables
- The data set
- The method/algorithm to use
Let’s cover these in more detail.
1: In the first argument we specify the relationship between the outcome variable and the feature variables.
For example, if our outcome variable was called outcome
, and we had two feature variables, feature1
and feature2
, the first part of our code could look like this:
object <- train(outcome ~ feature1 + feature2,
...)
In general however, we will have more than two feature variables to include (sometimes dozens more!). Therefore, we can use the shortcut outcome ~.
to specify that all variables in the data set, apart from outcome
, should be included as feature variables in the model.
As a result, when training a supervised learning machine learning model using the train
function, typically all you will need to do when specifying your first argument is identify the name of your outcome variable, and include this name in place of outcome
in outcome ~.
.
2: For the data
argument, you will need to specify your pre-processed data set.
3: For the method
argument, you will need to specify the machine learning method you would like to use - each has a different name.
Some models will include additional arguments, usually specified within the argument tuneGrid
, and we will explain these where relevant.
Let’s begin.
Decision Tree
💻 One of the simplest machine learning models we can use is a decision tree.
Using the information in 5, and the partially complete code in the code chunk below, fit a decision tree to your pre-processed penguin_train
training data.
set.seed(1650)
penguin_decision_tree <- train(... ~ .,
data = ...,
method = "rpart")
Note: The decision tree method name is rpart
(which is unintuitive).
Once you are happy with your code, run it, and then run the object penguin_decision_tree
to see the output. Your output should look like the output in the code chunk below:
CART
268 samples
8 predictor
3 classes: 'Adelie', 'Chinstrap', 'Gentoo'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 268, 268, 268, 268, 268, 268, ...
Resampling results across tuning parameters:
cp Accuracy Kappa
0.01324503 0.9436108 0.9117821
0.35761589 0.8089728 0.6885822
0.56291391 0.5649392 0.2597508
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.01324503.
💻 We are mainly interested here in the Accuracy
values, for different tuning parameter values (we can ignore the Kappa
values).
As we can see, the best accuracy achieved was 94.36%, which is very impressive.
Note: Don’t worry if your results look slightly different - perhaps you did not run all the set.seed(1650)
commands?
💻 Use the rpart.plot
function (as shown below) to visualise the penguin_decision_tree
decision tree model.
rpart.plot(penguin_decision_tree$finalModel)
Recall that the values under the penguins’ species
names in the coloured boxes (nodes) show the percentages of Adelie, Chinstrap and Gentoo penguins respectively, that have been categorised as belonging to that node of the decision tree.
Note: The rpart.plot
package required here should have been installed and loaded in 1.
🎧 Online students
💬 Volunteer to share your screen and explain your answers to this question.
Validating Results
💻 While we have a predictive accuracy estimate for our decision tree model, it is important to remember that this has been computed using the training data.
We would also like to check how the model performs when presented with new data - i.e. our validation data!
When conducting machine learning, there is a risk of overfitting our models to our training data. This can result in the models having excellent accuracy when assessing the training data, but having subpar performance when presented with new data.
This is why we have put aside some of our data as validation data in 4, so that we can perform cross-validation.
If the accuracy of the model remains similar when presented with the validation data, then we can be more confident in our model’s reported performance.
💻 There are several ways to perform cross-validation. One of the simplest is demonstrated in section 4.3 of the Introduction to Machine Learning in R supplement.
An example application of this approach to the penguin_decision_tree
model results is shown below. Inspect and then run this code.
# Load magrittr package for piping
library(magrittr)
# count number of observations in validation data
validation_numbers <- nrow(penguin_validate)
# Use the fitted model to predict quality values given the validation data
predict_penguin_decision_tree <- predict(penguin_decision_tree,
newdata =penguin_validate)
# When run, the code below gives us the percentage of correct predictions
dec_tree_accuracy <- sum(predict_penguin_decision_tree ==
penguin_validate$species) / validation_numbers * 100
dec_tree_accuracy %>% round(2)
💻 Discuss the results of the cross-validation with your class. Do you think the decision tree ML model we have trained is a good model?
🎧 Online students
💬 Enter your answer next to the question on the shared jamboard.
Great work, that’s everything for today. Don’t worry if you did not complete everything in the designated lab time, there is a lot to learn!
Hopefully this lab has provided you with a better understanding of the fundamentals of machine learning - as we can see, it’s actually not that complicated to train a machine learning model in RStudio.
Next week, we will continue learning about machine learning, and focus on a new data set.
References
Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020.
Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data.
https://doi.org/10.5281/zenodo.3960218.
Kuhn, M., J. Wing, S. Weston, A. Williams, C. Keefer, A. Engelhardt, T. Cooper, et al. 2021.
caret: Classification and Regression Training.
https://cran.r-project.org/web/packages/caret/index.html.
Thulin, M. 2021. Modern Statistics with R: From Wrangling and Exploring Data to Inference and Predictive Modelling.
These notes have been prepared by Rupert Kuveke. Please note that some of the content in these notes has been developed from content in Thulin (2021). The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License
BY-NC-ND.
