Data Science Stream

Topic 9B: Machine Learning I


Example R code solutions for the Data Science Computer Lab 9 are presented below.


1 Preparations

1.1 Load Required Packages

# Specify required packages
ml_packages <- c("caret", "magrittr", "rpart.plot")
# Install missing packages
install.packages(setdiff(ml_packages, rownames(installed.packages())))
# Load all packages
lapply(ml_packages, library, character.only = TRUE)

2 Predicting Penguin Species

2.1 Penguin Data

No answer required.

2.2

library(palmerpenguins)

ml_penguins <- na.omit(penguins[, -8]) # note we ignore the year variable here

2.3 Aim

Since we have multiple feature variables, and a known outcome variable, we are dealing with a supervised learning multi-class classification problem.

2.4 Data Visualisation

featurePlot(x = ml_penguins[, -1], y = ml_penguins$species, 
            plot = "pairs", auto.key = list(columns = 3))

2.4.1

We observe in several of the scatter plots that the observations for the Adelie and Chinstrap penguins overlap - this could potentially mean that it will be difficult to distinguish between them when classifying penguins.

3 Pre-Processing the Penguin data

3.1 Dummy Variables

# Load a package to help with the restructure of the data
library(tibble) 

# Use the dummayVars function to create a full set of dummy variables for the ml_penguins data
dummy_penguins <- dummyVars(species ~ ., data = ml_penguins)

# Use the predict function to update our ml_penguins feature variables with 
# both island and sex dummy variables
ml_penguins_updated <- as_tibble(predict(dummy_penguins, newdata = ml_penguins))

# Prepend the outcome variable to our updated data set, otherwise it will be lost
ml_penguins_updated <- cbind(species = ml_penguins$species, ml_penguins_updated)

3.2

head(ml_penguins_updated)
##   species island.Biscoe island.Dream island.Torgersen bill_length_mm
## 1  Adelie             0            0                1           39.1
## 2  Adelie             0            0                1           39.5
## 3  Adelie             0            0                1           40.3
## 4  Adelie             0            0                1           36.7
## 5  Adelie             0            0                1           39.3
## 6  Adelie             0            0                1           38.9
##   bill_depth_mm flipper_length_mm body_mass_g sex.female sex.male
## 1          18.7               181        3750          0        1
## 2          17.4               186        3800          1        0
## 3          18.0               195        3250          1        0
## 4          19.3               193        3450          1        0
## 5          20.6               190        3650          0        1
## 6          17.8               181        3625          1        0

3.3 Highly Influential Samples

nearZeroVar(ml_penguins_updated, saveMetrics = F)
## integer(0)
nearZeroVar(ml_penguins_updated, saveMetrics = T)
##                   freqRatio percentUnique zeroVar   nzv
## species            1.226891     0.9009009   FALSE FALSE
## island.Biscoe      1.042945     0.6006006   FALSE FALSE
## island.Dream       1.707317     0.6006006   FALSE FALSE
## island.Torgersen   6.085106     0.6006006   FALSE FALSE
## bill_length_mm     1.166667    48.9489489   FALSE FALSE
## bill_depth_mm      1.200000    23.7237237   FALSE FALSE
## flipper_length_mm  1.235294    16.2162162   FALSE FALSE
## body_mass_g        1.200000    27.9279279   FALSE FALSE
## sex.female         1.018182     0.6006006   FALSE FALSE
## sex.male           1.018182     0.6006006   FALSE FALSE

3.3.1

nearZeroVar(ml_penguins_updated, saveMetrics = T, freqCut = 2, uniqueCut = 5)
##                   freqRatio percentUnique zeroVar   nzv
## species            1.226891     0.9009009   FALSE FALSE
## island.Biscoe      1.042945     0.6006006   FALSE FALSE
## island.Dream       1.707317     0.6006006   FALSE FALSE
## island.Torgersen   6.085106     0.6006006   FALSE  TRUE
## bill_length_mm     1.166667    48.9489489   FALSE FALSE
## bill_depth_mm      1.200000    23.7237237   FALSE FALSE
## flipper_length_mm  1.235294    16.2162162   FALSE FALSE
## body_mass_g        1.200000    27.9279279   FALSE FALSE
## sex.female         1.018182     0.6006006   FALSE FALSE
## sex.male           1.018182     0.6006006   FALSE FALSE

3.3.2

There do not appear to be any feature variables which need to be removed due to their freqCut and uniqueCut values. While the variables for the different islands and sexes have low percentUnique values, and in the case of the island.Torgersen variable a high freqRatio value, this is not cause for concern as these are dummy variables.

3.3.3

base_cor <-  cor(ml_penguins_updated[, 5:8])
extreme_cor <- sum(abs(base_cor[upper.tri(base_cor)]) > .999)
extreme_cor
## [1] 0

The result of 0 here tells us that no feature variables have extremely high correlation with other feature variables.

3.3.4

base_cor 
##                   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm         1.0000000    -0.2286256         0.6530956   0.5894511
## bill_depth_mm         -0.2286256     1.0000000        -0.5777917  -0.4720157
## flipper_length_mm      0.6530956    -0.5777917         1.0000000   0.8729789
## body_mass_g            0.5894511    -0.4720157         0.8729789   1.0000000

The largest negative correlation value is -0.5777917, between flipper_length_mm and bill_depth_mm. The largest positive correlation value is 0.8729789, between flipper_length_mm and body_mass_g.

The correlation value of could potentially be problematic.

3.3.5

No answer required.

3.3.6

ml_penguins_filtered <- ml_penguins_updated[, - 7] # flipper_length_mm has been removed

4 Training and Validation Data

set.seed(1650)
train_index <- createDataPartition(ml_penguins_filtered$species,
                                   p = .8, # here p designates the split - 80/20
                                   list = FALSE, times = 1) 

4.1

penguin_train <- ml_penguins_filtered[train_index, ]
penguin_validate <- ml_penguins_filtered[-train_index, ]

5 Fitting a Decision Tree Machine Learning Model

5.1 Decision Tree

set.seed(1650) 
penguin_decision_tree <- train(species ~ .,
                               data = penguin_train,
                               method = "rpart")
penguin_decision_tree
## CART 
## 
## 268 samples
##   8 predictor
##   3 classes: 'Adelie', 'Chinstrap', 'Gentoo' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 268, 268, 268, 268, 268, 268, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.01324503  0.9436108  0.9117821
##   0.35761589  0.8089728  0.6885822
##   0.56291391  0.5649392  0.2597508
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01324503.

5.1.1

No answer required.

5.1.2

rpart.plot(penguin_decision_tree$finalModel)

6 Validating Results

6.1

# Load magrittr package for piping
library(magrittr)

# count number of observations in validation data
validation_numbers <- nrow(penguin_validate)

# Use the fitted model to predict quality values given the validation data
predict_penguin_decision_tree <- predict(penguin_decision_tree, 
                                          newdata =penguin_validate)
# When run, the code below gives us the percentage of correct predictions
dec_tree_accuracy <- sum(predict_penguin_decision_tree == 
                         penguin_validate$species) / validation_numbers * 100

dec_tree_accuracy %>% round(2) 
## [1] 92.31

6.2

The decision tree ML model we have trained appears to do an excellent job of predicting the penguin species, for both the training and validation data sets (94.36% accuracy and 92.31% accuracy respectively).


Great work, that’s everything for today. Don’t worry if you did not complete everything in the designated lab time, there is a lot to learn!


References

Thulin, M. 2021. Modern Statistics with R: From Wrangling and Exploring Data to Inference and Predictive Modelling.


These notes have been prepared by Rupert Kuveke. Please note that some of the content in these notes has been developed from content in Thulin (2021). The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.

---
title: "STM1001: Computer Lab 9B Solutions"
output:
  bookdown::html_document2: 
    toc: true
    toc_float: true
    code_download: true
    theme: readable
    code_folding: show
bibliography: STM1001_DS_CL_references.bib 
link-citations: yes
---

<style>
#TOC {
  background: url("https://www.latrobe.edu.au/_media/la-trobe-api/v5/img/logo.svg");
  background-size: contain;
  padding-top: 80px !important;
  background-repeat: no-repeat;
}
</style>

### Data Science Stream {-}

### Topic 9B: Machine Learning I {-}

<br>

Example R code solutions for the [Data Science Computer Lab 9](https://rpubs.com/LTU_STM1001/DSMCL9) are presented below.

<br>

# Preparations {#prep}

## Load Required Packages {#load}

```{r class.source = "fold-show", eval = F, include = T, warning = F, message = F}
# Specify required packages
ml_packages <- c("caret", "magrittr", "rpart.plot")
# Install missing packages
install.packages(setdiff(ml_packages, rownames(installed.packages())))
# Load all packages
lapply(ml_packages, library, character.only = TRUE)
```

```{r class.source = "fold-show", eval = T, echo = F, include = F, warning = F, message = F}
# Specify required packages
ml_packages <- c("caret", "magrittr", "rpart.plot")
# Install missing packages
install.packages(setdiff(ml_packages, rownames(installed.packages())))
# Load all packages
suppressPackageStartupMessages(lapply(ml_packages, library, character.only = TRUE))
```

# Predicting Penguin Species

## Penguin Data {#pnegin}

No answer required.

##

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
library(palmerpenguins)

ml_penguins <- na.omit(penguins[, -8]) # note we ignore the year variable here
```

## Aim

Since we have multiple feature variables, and a known outcome variable, we are dealing with a supervised learning multi-class classification problem.

## Data Visualisation  {#problem}

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F, fig.dim = c(12, 12), fig.align='center'}
featurePlot(x = ml_penguins[, -1], y = ml_penguins$species, 
            plot = "pairs", auto.key = list(columns = 3))
```

###

We observe in several of the scatter plots that the observations for the Adelie and Chinstrap penguins overlap - this could potentially mean that it will be difficult to distinguish between them when classifying penguins.

# Pre-Processing the Penguin data {#prepro}

## Dummy Variables {#dummy}


```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
# Load a package to help with the restructure of the data
library(tibble) 

# Use the dummayVars function to create a full set of dummy variables for the ml_penguins data
dummy_penguins <- dummyVars(species ~ ., data = ml_penguins)

# Use the predict function to update our ml_penguins feature variables with 
# both island and sex dummy variables
ml_penguins_updated <- as_tibble(predict(dummy_penguins, newdata = ml_penguins))

# Prepend the outcome variable to our updated data set, otherwise it will be lost
ml_penguins_updated <- cbind(species = ml_penguins$species, ml_penguins_updated)
```

##

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
head(ml_penguins_updated)
```

## Highly Influential Samples {#excessiveinfluence}

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
nearZeroVar(ml_penguins_updated, saveMetrics = F)
```

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
nearZeroVar(ml_penguins_updated, saveMetrics = T)
```

### {#cutoff}

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
nearZeroVar(ml_penguins_updated, saveMetrics = T, freqCut = 2, uniqueCut = 5)
```

###

There do not appear to be any feature variables which need to be removed due to their `freqCut` and `uniqueCut` values. While the variables for the different islands and sexes have low `percentUnique` values, and in the case of the `island.Torgersen` variable a high `freqRatio` value,
this is not cause for concern as these are dummy variables.

###

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
base_cor <-  cor(ml_penguins_updated[, 5:8])
extreme_cor <- sum(abs(base_cor[upper.tri(base_cor)]) > .999)
extreme_cor
```

The result of 0 here tells us that no feature variables have extremely high correlation with other feature variables.

### 

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
base_cor 
```

The largest negative correlation value is -0.5777917, between `flipper_length_mm` and `bill_depth_mm`.
The largest positive correlation value is 0.8729789, between `flipper_length_mm` and `body_mass_g`.

The correlation value of could potentially be problematic.


### {#corlimit}

No answer required.

###

```{r class.source = "fold-show", eval = T, echo = T, warning = F}
ml_penguins_filtered <- ml_penguins_updated[, - 7] # flipper_length_mm has been removed
```

# Training and Validation Data {#train}

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
set.seed(1650)
train_index <- createDataPartition(ml_penguins_filtered$species,
                                   p = .8, # here p designates the split - 80/20
                                   list = FALSE, times = 1) 
```

##

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
penguin_train <- ml_penguins_filtered[train_index, ]
penguin_validate <- ml_penguins_filtered[-train_index, ]
```

# Fitting a Decision Tree Machine Learning Model {#fit}

## Decision Tree {#dectree}

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F, cache = T}
set.seed(1650) 
penguin_decision_tree <- train(species ~ .,
                               data = penguin_train,
                               method = "rpart")
penguin_decision_tree
```

###

No answer required.

###

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F, fig.dim = c(8,8), fig.align='center'}
rpart.plot(penguin_decision_tree$finalModel)
```

# Validating Results {#val}

##

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
# Load magrittr package for piping
library(magrittr)

# count number of observations in validation data
validation_numbers <- nrow(penguin_validate)

# Use the fitted model to predict quality values given the validation data
predict_penguin_decision_tree <- predict(penguin_decision_tree, 
                                          newdata =penguin_validate)
# When run, the code below gives us the percentage of correct predictions
dec_tree_accuracy <- sum(predict_penguin_decision_tree == 
                         penguin_validate$species) / validation_numbers * 100

dec_tree_accuracy %>% round(2) 
```

##

The decision tree ML model we have trained appears to do an excellent job of predicting the penguin `species`, for both the training and validation data sets (`r round(100*max(penguin_decision_tree$results$Accuracy), 2)`% accuracy and `r dec_tree_accuracy %>% round(2)`% accuracy respectively).

<br>

#### Great work, that's everything for today. Don't worry if you did not complete everything in the designated lab time, there is a lot to learn! #### {-}

<br>

# References {- #Ref}
<div id="refs"></div>

<br>

<font color = "grey">
These notes have been prepared by Rupert Kuveke. Please note that some of the content in these notes has been developed from content in @ModStat. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License 
<a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/CC" target="_blank"> BY-NC-ND. </a>
</font>