class: middle background-image: url(data:image/png;base64,#LTU_logo.jpg) background-position: top left background-size: 30% # STM1001 Lecture # Introduction to Machine Learning ## Data Science stream ### La Trobe University --- # Welcome! ### In this lecture we will cover an Introduction to Machine Learning (ML), focusing on how to apply ML within R. -- * By the end of this lecture you will: -- * understand the fundamentals of machine learning theory -- * understand the fundamentals of using the caret package for supervised learning and predictive modelling in R -- We will practice machine learning in Computer Labs 9B - 11B, so by the end of this period, you should have a solid understanding of how to train and assess machine learning models in R. --- # Supplementary Material Note * All of the material we cover in this lecture is available in the LMS in the supplementary material .teal_style[Introduction to Machine Learning in R] ([available here](https://bookdown.org/rehk/stm1001_dsm_introduction_to_machine_learning_in_r/)). -- * Once you have attended this lecture, and gone through the material in the .teal_style[Introduction to Machine Learning in R] supplement in your own time, you will be ready to start Computer Lab 9B. --- # Overview Over the following slides, we will cover: -- * What is Machine Learning? -- * Machine Learning Terminology -- * How does Machine Learning work? -- * Types of Machine Learning Models -- * Machine Learning in R --- class: center, middle # 1. What is Machine Learning? As defined by Grant and Wischik (2020, p.x), “*machine learning is the study of computer systems that use systematic mathematical procedures to find patterns in large datasets and that apply those patterns to make predictions about new situations.*” --- # 1. What is Machine Learning? Before we proceed further, we should make a clear distinction between ML and artificial intelligence (AI), as these can often be conflated. -- ### Machine Learning ML is a mechanical, inductive process, and the ‘learning’ that takes place is strictly via the parameters specified by the human(s) writing the code - the machine itself does not exercise any independent thought or ‘intelligence’, nor does it question what it is learning, or why. -- ### Artificial Intelligence In contrast to ML, Artificial Intelligence can act (hopefully within the parameters defined during its creation) independently once created. --- # 2. Machine Learning Terminology ML is about more than just applying advanced mathematical and statistical algorithms to large data sets. A great deal of thought and care also needs to go into the **choice and pre-processing of the data** to be used by the selected machine learning algorithm. -- ML is also subjective, to a certain extent - as we will see, after running a machine learning algorithm there may be several competing results from which to choose, and your final choice may differ depending on the context. -- Let's introduce some key ML terminology via an example. --- # Exam Score Example Suppose we would like to predict future students' exam scores for STM1001 early in semester, using data we have collected from the current cohort. -- * We may collect data such as: -- * Amount of time spent studying each day -- * Amount of sleep each day -- * Diet type -- * Amount of exercise per week -- * At the end of the semester, we would also record the students' exam scores. --- # Variables We can think of these topics as **variables**. -- Variables can signify different ‘characteristics’, ‘attributes’ or ‘features’ of the phenomenon of interest, and can be continuous, discrete, quantitative or qualitative. -- When conducting machine learning, we typically deal with two types of variables: .teal_style[feature variables] and .teal_style[outcome variables]. -- In simple terms, we use feature variables to model or predict our outcome variable(s). --- # Feature Variables * Feature variables are used in order to fit an ML model, with the aim of trying to model or predict the outcome variable. -- * They are not the reason for conducting the ML process, but are vital to its success - if our feature variables are poorly defined or chosen, then our resultant model may not be very accurate. -- * In our example, the feature variables are topics a-d. When surveyed on topics a-d, each student will provide a different set of responses (e.g. 1 hour study per day, 8 hours sleep per day, vegetarian diet, exercise once per week). --- # Outcome Variable * The outcome variable is the variable in which we are predominantly interested, and motivates the ML process. -- * In our example, the final exam score is the outcome variable (this could be a percentage, or a letter grade). --- # Problem Classes Different machine learning models and methods can be applied to different types of tasks or problems. -- * Broadly speaking there are two main categories of problem class: -- * Classification problems -- * Regression problems --- # Problem Classes * If our exam scores were letter grades, this would be an example of a multi-class classification problem (since there are multiple classes for the outcome variable, i.e. A, B, C, etc). -- * If we simplified the exam scores to pass/fail, our problem would become a binary classification problem. -- * If we were predicting students' numeric exam mark (out of 100), this would change our problem to being a regression problem (since we now have output values that are numeric and continuous). --- # Supervised and Unsupervised Learning There are two main types of ML: -- * Supervised Learning -- * Unsupervised Learning -- In .teal_style[supervised learning], we train our ML model to accurately classify or predict outcomes, using labelled data -i.e. data for which we have details on both the feature variables and the outcome variable. -- In .teal_style[unsupervised learning], we assess fresh data, with no clear outcome variable in mind. Unsupervised learning is often used to uncover hidden patterns in data. -- Problem classes for unsupervised learning include clustering problems - using clustering techniques such as `\(k\)`-means clustering, just like we looked at in Computer Lab 7B. --- # How does Machine Learning work? As we noted initially, the mathematical and statistical algorithms we use for our machine learning are not actually the most important aspect of machine learning. -- “*In fact, the clever part of machine learning is in the .teal_style[training phase]*” (Grant and Wischik 2020, p.35), where we provide the data to be used to train our learning algorithm. -- We can think of this ‘*training*’ process as being similar to the way in which humans develop experience through exposure to stimuli. --- # Training Phase Suppose it is the end of semester, the STM1001 students sit their exam, and we record their results. -- Collectively, the feature variables’ values and the outcome variable’s values we have collected will form our .teal_style[training phase data set]. -- * Note that we don't try to predict the exam scores for the current students - can you think why that is? -- While we want here to predict exam scores, we initially require some observed values for this outcome variable, in order to know whether or not our ML model will produce accurate predictions when presented with new data. --- # 3. How does Machine Learning work? In simple terms, we can break the machine learning process into 5 steps: -- 1. Decide aim and appropriate type of ML to use -- 2. Collect and pre-process data -- 3. Split data into training and validation sets -- 4. Train model -- 5. Validate model --- # Pre-Processing Before we begin training our ML model, it is important to carefully pre-process our data. -- This can involve (amongst other things) converting variables into more appropriate formats, checking for samples and variables that could have excessive influence on the model, and checking for variables that are highly correlated with other variables. -- Any samples or variables identified as being potentially problematic can be removed as part of the pre-processing stage - this can help improve the predictive accuracy of our trained ML model. --- # Training and Validation Data Once our pre-processing is complete, we typically split our data in two. The larger portion will be our training data, with the remainder to be used later as our validation data. -- * Training data is used to train our ML model -- * The training data helps our model 'learn' the relationship between the feature and outcome variables. -- * Validation data is used to check the predictive accuracy of our trained ML model -- * The validation data helps check that our model can provide accurate predictions, given fresh data. -- Generally speaking, the more (good quality) data used to train the ML model, the more accurate the ML model. --- # 4. Types of Machine Learning Models There are **hundreds** of different machine learning models we can use. Don't be alarmed though - we will focus on a select handful of popular classes. -- * The main classes of machine learning models are: -- * Decision Trees -- * Model Trees -- * Random Forests -- * Boosted Trees -- * Linear Discriminant Analysis -- * Support Vector Machines -- * Nearest Neighbour Classifiers --- # Tree Models In STM1001 we will focus predominantly on **tree models**, which encompass several popular classes of ML model. -- We won't go into all the mathematics involved in these models - rather, our focus will be on conducting the ML pre-processing, training and validation in R. --- # Decision Trees The simplest type of tree model is a .teal_style[Decision Tree] model. The easiest way to describe a decision tree model is probably to show one: -- <img src="data:image/png;base64,#penguin_decision_tree.jpg" width="600px" style="display: block; margin: auto;" /> --- # Ensemble Tree Models Rather than using a single decision tree, we can use *ensemble* methods like .teal_style[Random Forests] and .teal_style[Boosted Trees]. -- These ensemble methods combine multiple decision trees, in order to achieve better results than those obtained from any single decision tree. <img src="data:image/png;base64,#tree.jpg" width="450px" style="display: block; margin: auto;" /> *"tree" by Robert Couse-Baker is licensed under [CC BY 2.0](https://creativecommons.org/licenses/by/2.0/)* --- # 5. Machine Learning in R We can use R packages to perform supervised and unsupervised ML tasks in R. There are many ML R packages available - we will focus on the **caret** package (short for **C**lassification **A**nd **RE**gression **T**raining), developed by Kuhn et al. (2021). --- # Predicting Penguin Species To conclude this lecture, let us consider another example. -- We will use the familiar .teal_style[penguins] data set (Horst et al. 2020). -- Suppose we would like to use ML to predict the species of penguins in the Palmer archipelago, based on some of their other characteristics - namely * their flipper length, -- * body mass, and -- * sex measurements -- (for this example we will ignore the other recorded variables in the penguins data set). --- class: center, middle This is a .teal_style[multi-class classification problem], with **feature variables**: -- * flipper\_length\_mm -- * body\_mass\_g -- * sex -- and the **outcome variable** species. -- Given we have recorded species observations already for all the penguins, our ML task can be categorised as a .teal_style[supervised learning task]. --- class: center, middle To begin, we download and install the caret package, and then load the caret and palmerpenguins packages. -- ```r install.packages("caret") library(caret) library(palmerpenguins) ``` -- Next, we create a new data set containing only our chosen feature and outcome variables (and also remove missing values): -- ```r ml_penguins <- na.omit(penguins[, c(1,5:7)]) ``` --- # Dummy Variables One assumption made by the caret package is that all the feature variable data are numeric. -- Since the penguin feature variable sex is categorical rather than numeric, we will have to convert it to a numeric variable before we train our ML model. -- A .teal_style[dummy variable] is a variable that only takes values of either 0 or 1, to indicate the absence or presence of a factor of interest, respectively. So if we have the dummy variable `sex = female`, then this would equal 1 for all female penguins, and 0 for all male penguins. --- The R code process for converting categorical variables to dummy variables is shown in the supplementary material. Following this process, our data looks like this: -- <div style="border: 1px;overflow-x: scroll; width:115%; "><table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> species </th> <th style="text-align:center;"> flipper_length_mm </th> <th style="text-align:center;"> body_mass_g </th> <th style="text-align:center;"> sex.female </th> <th style="text-align:center;"> sex.male </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Adelie </td> <td style="text-align:center;"> 181 </td> <td style="text-align:center;"> 3750 </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Adelie </td> <td style="text-align:center;"> 195 </td> <td style="text-align:center;"> 3325 </td> <td style="text-align:center;"> 1 </td> <td style="text-align:center;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Adelie </td> <td style="text-align:center;"> 185 </td> <td style="text-align:center;"> 3000 </td> <td style="text-align:center;"> 1 </td> <td style="text-align:center;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Adelie </td> <td style="text-align:center;"> 191 </td> <td style="text-align:center;"> 3700 </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Gentoo </td> <td style="text-align:center;"> 218 </td> <td style="text-align:center;"> 5700 </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Gentoo </td> <td style="text-align:center;"> 220 </td> <td style="text-align:center;"> 5150 </td> <td style="text-align:center;"> 1 </td> <td style="text-align:center;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Gentoo </td> <td style="text-align:center;"> 217 </td> <td style="text-align:center;"> 4900 </td> <td style="text-align:center;"> 1 </td> <td style="text-align:center;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Gentoo </td> <td style="text-align:center;"> 210 </td> <td style="text-align:center;"> 4700 </td> <td style="text-align:center;"> 1 </td> <td style="text-align:center;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Chinstrap </td> <td style="text-align:center;"> 195 </td> <td style="text-align:center;"> 3600 </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Chinstrap </td> <td style="text-align:center;"> 196 </td> <td style="text-align:center;"> 3675 </td> <td style="text-align:center;"> 1 </td> <td style="text-align:center;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Chinstrap </td> <td style="text-align:center;"> 187 </td> <td style="text-align:center;"> 3350 </td> <td style="text-align:center;"> 1 </td> <td style="text-align:center;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Chinstrap </td> <td style="text-align:center;"> 202 </td> <td style="text-align:center;"> 3400 </td> <td style="text-align:center;"> 1 </td> <td style="text-align:center;"> 0 </td> </tr> </tbody> </table></div> -- Note that the outcome variable can remain categorical. --- # Identifying Samples Exerting Excessive Influence Before we train our ML model, we should also do some checks to ensure the quality of our training data is high. -- For instance, we should check to ensure that: -- * Our data is balanced, with a large number of unique values for each feature variable -- * There are no samples that might have an excessive influence on the model -- * We do not have any highly correlated feature variables --- If any feature variables have zero or near-zero variance, this can cause problems when we split our data into training and validation sets. -- If we use the `nearZeroVar` function as follows: ```r nearZeroVar(ml_penguins_updated, saveMetrics = T) ``` -- We obtain this output: ``` ## freqRatio percentUnique zeroVar nzv ## species 1.226891 0.9009009 FALSE FALSE ## flipper_length_mm 1.235294 16.2162162 FALSE FALSE ## body_mass_g 1.200000 27.9279279 FALSE FALSE ## sex.female 1.018182 0.6006006 FALSE FALSE ## sex.male 1.018182 0.6006006 FALSE FALSE ``` -- None of the variables have zero or near zero variance (see columns 3 & 4). --- # freqRatio Interpretation In the previously shown output, the first two columns were .teal_style[`freqRatio`] and .teal_style[`percentUnique`]. -- The `freqRatio` column computes the frequency of the most prevalent value recorded for that variable, divided by the frequency of the second most prevalent value. -- * `freqRatio` values close to 1 are good - this means we don’t have an unbalanced data set where one value is being recorded significantly more frequently than other values. --- # percentUnique Interpretation The `percentUnique` column shows the number of unique values recorded for each variable, divided by the total number of samples, and expressed as a percentage. -- If we only have a few unique values (i.e. the feature variable has near-zero variance) then the percentUnique value will be small. Therefore, higher values are considered better. -- As our data set increases in size, this percentage will naturally decrease. -- * Dummy variables often have low percentUnique values, and this is fine - can you think why? --- class: center, middle # Cut-off Specifications If we have certain pre-determined requirements for the `freqRatio` and `percentUnique` values, we can specify .teal_style[cut-off values]. -- Full details on how to do this are included in the supplementary material. --- # Checking for Correlated Feature Variables We should also check that our feature variables are not too highly correlated. For this, we ignore dummy variables, since they originate from categorical, not numeric, data. -- We can compute a correlation matrix, and summarise the correlation values, using the R code: ```r base_cor <- cor(ml_penguins_updated[, 2:3]) summary(base_cor[upper.tri(base_cor)]) ``` -- Just as for the `freqRatio` and `percentUnique` results, we can specify an arbitrary cut-off for the correlation, for example: ```r findCorrelation(base_cor, cutoff = .9) ``` --- # Data Splitting The final step before we train our model is to split our processed data into training and validation data sets. -- We can use the `createDataPartition` function to intelligently split the data into these two sets. One benefit of using this function is that if our outcome variable is a factor (like `species`) the random sampling employed by the `createDataPartition` function will occur within each class. -- In other words, if we have a data set comprised of roughly: * 50% Adelie penguin data, * 20% Chinstrap data and * 30% Gentoo data, -- the `createDataPartition` sampling will preserve this overall class distribution of 50/20/30. --- # Final Preparations Here we have split the training/validation 80/20, via the argument `\(p = 0.8\)`. ```r createDataPartition(ml_penguins_updated$species, p = .8, # here p designates the split - 80/20 list = FALSE, times = 1) # times specifies how many splits to perform ``` -- The `train_index` object may contain the partition details, but we still need to assign our pre-processed data to specific training and validation objects in RStudio: ```r ml_penguin_train <- ml_penguins_updated[train_index, ] ml_penguin_validate <- ml_penguins_updated[-train_index, ] ``` -- We are now ready to train our ML model! --- # Decision Tree ML Results Then, using a decision tree model, we would obtain output similar to this: <img src="data:image/png;base64,#penguin_decision_tree.jpg" width="500px" style="display: block; margin: auto;" /> -- You will be able to produce this decision tree as part of Computer Lab 9B! We can use the validation data to check the model's predictive accuracy. --- # End That concludes our Introduction to Machine Learning lecture. -- What to do next: * Before Computer Lab 9B, please read over the supplementary material - there were some details we were not able to cover due to time constraints. * If you have any questions, we can resolve them in the computer labs. --- background-image: url(data:image/png;base64,#computerlab.jpg) background-position: bottom background-size: 75% class: center # See you in the computer labs! --- # References * Delua, J. (IBM). 2021. “Supervised Vs. Unsupervised Learning: What’s the Difference?” 2021. [https://www.ibm.com/cloud/blog/supervised-vs-unsupervised-learning](https://www.ibm.com/cloud/blog/supervised-vs-unsupervised-learning). * Grant, Thomas D, and Damon J Wischik. 2020. *On the Path to AI: Law’s Prophecies and the Conceptual Foundations of the Machine Learning Age*. Cham: Springer International Publishing AG. * Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. *Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data*. [https://doi.org/10.5281/zenodo.3960218](https://doi.org/10.5281/zenodo.3960218). * Kuhn, M. 2019. The Caret Package. [https://topepo.github.io/caret/index.html](https://topepo.github.io/caret/index.html). --- # References * Kuhn, M., J. Wing, S. Weston, A. Williams, C. Keefer, A. Engelhardt, T. Cooper, et al. 2021. *caret: Classification and Regression Training*. [https://cran.r-project.org/web/packages/caret/index.html](https://cran.r-project.org/web/packages/caret/index.html). * Thulin, M. 2021. *Modern Statistics with R: From Wrangling and Exploring Data to Inference and Predictive Modelling*. * Zhou, Zhi-Hua. 2021. *Machine Learning*. --- class: middle <font color = "grey"> These notes have been prepared by Rupert Kuveke. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License <a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/" target="_blank"> BY-NC-ND. </a> </font>