Data Science Stream

Topic 10B: Machine Learning II


Example R code solutions for the Data Science Module Computer Lab 10B, which uses the caret R package (Kuhn et al. 2021) and Portuguese wine data obtained from UCI Machine Learning Repository (2009) (originally collected by Cortez et al. (2009)), are presented below.

This computer lab is designed to run alongside the content in the Introduction to Machine Learning in R supplement. It might be helpful to have this material open as you look through these solutions.


1 Preparations

1.1 Load Required Packages

library(caret)
library(rpart.plot)

1.2 Wine Data

No answer required.

1.3

Example code for loading the red wine data is shown below:

red_wine <- read.csv(file = "winequality_red.csv", header = T)

1.4 Aim

No answer required.

1.5

red_wine$quality <- as.factor(red_wine$quality)

2 Data Visualisation

2.1

The head function provides an overview of the different variables. The summary function provides information on the spread of observed values for each variable, and the dim function tells us that we have observations for 1599 wine samples.

head(red_wine)
##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5
summary(red_wine)
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol      quality
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   3: 10  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   4: 53  
##  Median :3.310   Median :0.6200   Median :10.20   5:681  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   6:638  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   7:199  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   8: 18
dim(red_wine)
## [1] 1599   12

2.2

We notice that the majority of quality scores are either 5 or 6. Very few scores of 3 or 8 are recorded - this might make it difficult to accurately predict such scores.

plot(red_wine$quality)

2.3

As we can see from the featurePlot box plots, the free.sulfur.dioxide and total.sulfur.dioxide have much larger values than the other feature variables and lead to the box plot being relatively uninformative.

featurePlot(x = red_wine[, -12], 
            y = red_wine$quality, 
            plot = "box")

2.4

Example code is provide below:

featurePlot(x = red_wine[, -c(6, 7, 12)], 
            y = red_wine$quality, 
            plot = "box")

Many of the box plots are still not informative - we will try to address this in the next section.

3 Pre-Processing

3.1

No answer required.

3.2 Highly Influential Samples

nearZeroVar(red_wine, saveMetrics = T, freqCut = 2, uniqueCut = 5)
##                      freqRatio percentUnique zeroVar   nzv
## fixed.acidity         1.175439     6.0037523   FALSE FALSE
## volatile.acidity      1.021739     8.9430894   FALSE FALSE
## citric.acid           1.941176     5.0031270   FALSE FALSE
## residual.sugar        1.190840     5.6910569   FALSE FALSE
## chlorides             1.200000     9.5684803   FALSE FALSE
## free.sulfur.dioxide   1.326923     3.7523452   FALSE FALSE
## total.sulfur.dioxide  1.194444     9.0056285   FALSE FALSE
## density               1.028571    27.2670419   FALSE FALSE
## pH                    1.017857     5.5659787   FALSE FALSE
## sulphates             1.014706     6.0037523   FALSE FALSE
## alcohol               1.349515     4.0650407   FALSE FALSE
## quality               1.067398     0.3752345   FALSE FALSE
nearZeroVar(red_wine, saveMetrics = F, freqCut = 2, uniqueCut = 5)
## integer(0)

3.3

None of the feature variables are flagged as exceeding our specified cut-off values.

The citric acid feature variable has the highest freqRatio value, at 1.94, and also has the lowest percentUnique value (5.003) of the feature variables (just missing our specified cut-off).

Neither of these values are too alarming - it should be ok to leave the citric acid feature variable in our data set for the time being.

We conclude that there do not appear to be any problematic variables.

3.4

base_cor <- cor(red_wine[, -12])
base_cor
##                      fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity           1.00000000     -0.256130895  0.67170343    0.114776724
## volatile.acidity       -0.25613089      1.000000000 -0.55249568    0.001917882
## citric.acid             0.67170343     -0.552495685  1.00000000    0.143577162
## residual.sugar          0.11477672      0.001917882  0.14357716    1.000000000
## chlorides               0.09370519      0.061297772  0.20382291    0.055609535
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813    0.187048995
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302    0.203027882
## density                 0.66804729      0.022026232  0.36494718    0.355283371
## pH                     -0.68297819      0.234937294 -0.54190414   -0.085652422
## sulphates               0.18300566     -0.260986685  0.31277004    0.005527121
## alcohol                -0.06166827     -0.202288027  0.10990325    0.042075437
##                         chlorides free.sulfur.dioxide total.sulfur.dioxide
## fixed.acidity         0.093705186        -0.153794193          -0.11318144
## volatile.acidity      0.061297772        -0.010503827           0.07647000
## citric.acid           0.203822914        -0.060978129           0.03553302
## residual.sugar        0.055609535         0.187048995           0.20302788
## chlorides             1.000000000         0.005562147           0.04740047
## free.sulfur.dioxide   0.005562147         1.000000000           0.66766645
## total.sulfur.dioxide  0.047400468         0.667666450           1.00000000
## density               0.200632327        -0.021945831           0.07126948
## pH                   -0.265026131         0.070377499          -0.06649456
## sulphates             0.371260481         0.051657572           0.04294684
## alcohol              -0.221140545        -0.069408354          -0.20565394
##                          density          pH    sulphates     alcohol
## fixed.acidity         0.66804729 -0.68297819  0.183005664 -0.06166827
## volatile.acidity      0.02202623  0.23493729 -0.260986685 -0.20228803
## citric.acid           0.36494718 -0.54190414  0.312770044  0.10990325
## residual.sugar        0.35528337 -0.08565242  0.005527121  0.04207544
## chlorides             0.20063233 -0.26502613  0.371260481 -0.22114054
## free.sulfur.dioxide  -0.02194583  0.07037750  0.051657572 -0.06940835
## total.sulfur.dioxide  0.07126948 -0.06649456  0.042946836 -0.20565394
## density               1.00000000 -0.34169933  0.148506412 -0.49617977
## pH                   -0.34169933  1.00000000 -0.196647602  0.20563251
## sulphates             0.14850641 -0.19664760  1.000000000  0.09359475
## alcohol              -0.49617977  0.20563251  0.093594750  1.00000000
extreme_cor <- sum(abs(base_cor[upper.tri(base_cor)]) > .999)
extreme_cor
## [1] 0
summary(base_cor[upper.tri(base_cor)])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.68298 -0.09942  0.04295  0.02285  0.16576  0.67170

3.5

The largest negative correlation value is -0.68297819, between pH and fixed.acidity. The largest positive correlation value is 0.67170343, between citric.acid and fixed.acidity.

These correlation values are not too large in magnitude, so do not seem too problematic.

3.6

Using the the findCorrelation function to identify highly correlated feature variables to remove from our data set would depend on what we term an acceptable level of correlation. If we are happy with correlations of magnitude 0.69 and under, then we do not need to run this function. If we would like to set the maximum acceptable correlation magnitude to e.g. 0.67, then the findCorrelation function would identify the fixed.acidity feature variable for removal.

Note: For the remainder of this lab, we will assume no feature variables needed to be removed.

3.7

centre_scale <- preProcess(red_wine[, -12], 
                           method = c("center", "scale"))
red_wine_updated <- predict(centre_scale, red_wine)

3.8

If we compare the original data to the updated data, we see that the feature variable values are now scaled and centred.

head(red_wine)
##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5
head(red_wine_updated)
##   fixed.acidity volatile.acidity citric.acid residual.sugar   chlorides
## 1    -0.5281944        0.9615758   -1.391037    -0.45307667 -0.24363047
## 2    -0.2984541        1.9668271   -1.391037     0.04340257  0.22380518
## 3    -0.2984541        1.2966596   -1.185699    -0.16937425  0.09632273
## 4     1.6543385       -1.3840105    1.483689    -0.45307667 -0.26487754
## 5    -0.5281944        0.9615758   -1.391037    -0.45307667 -0.24363047
## 6    -0.5281944        0.7381867   -1.391037    -0.52400227 -0.26487754
##   free.sulfur.dioxide total.sulfur.dioxide    density         pH   sulphates
## 1         -0.46604672           -0.3790141 0.55809987  1.2882399 -0.57902538
## 2          0.87236532            0.6241680 0.02825193 -0.7197081  0.12891007
## 3         -0.08364328            0.2289750 0.13422152 -0.3310730 -0.04807379
## 4          0.10755844            0.4113718 0.66406945 -0.9787982 -0.46103614
## 5         -0.46604672           -0.3790141 0.55809987  1.2882399 -0.57902538
## 6         -0.27484500           -0.1966174 0.55809987  1.2882399 -0.57902538
##      alcohol quality
## 1 -0.9599458       5
## 2 -0.5845942       5
## 3 -0.5845942       5
## 4 -0.5845942       6
## 5 -0.9599458       5
## 6 -0.9599458       5

3.9

The featurePlot box plots are now more informative, compared to those created in 2.3. For all variables, we are now able to see more clearly how the feature variables’ values differ across the different quality scores.

We observe that for several feature variables (sulphates, chlorides, volatile.acidity and residual.sugar) there are more extreme observations for quality scores of 5 and 6 than for the other scores.

featurePlot(x = red_wine_updated[, -12], 
            y = red_wine_updated$quality, 
            plot = "box")

3.10

featurePlot(x = red_wine_updated[, -12], 
            y = red_wine_updated$quality, 
            plot = "pairs",
            auto.key = list(columns = 6))

We observe that for most pairs the data is too closely clumped to clearly distinguish between the quality ratings. Perhaps our machine learning model can help.

3.11 Training and Validation Data

Example code is provided below.

Please note that the data partitioning into training or validation categories is random to an extent, so your results from this point onwards may differ slightly to those presented in the subsequent question solutions, since your training and validation data sets will most likely contain slightly different sets of observations.

set.seed(1650)
wine_train_index <- createDataPartition(red_wine_updated$quality, 
                                        p = .8, # here p designates the split - 80/20
                                        list = FALSE, times = 1) 

3.12

red_wine_train <- red_wine_updated[wine_train_index, ]
red_wine_validate <- red_wine_updated[-wine_train_index, ]

4 Fitting a Decision Tree Machine Learning Model

Please note that we run the set.seed(1650) command prior to training the decision tree model, so that the results discussed here are accurate regardless of the number of times this document is generated. If you do not set a seed prior to training your models, your results may appear slightly different.

4.1 Decision Tree

set.seed(1650)
red_wine_decision_tree <- train(quality ~ .,
                                data = red_wine_train,
                                method = "rpart")
red_wine_decision_tree
## CART 
## 
## 1282 samples
##   11 predictor
##    6 classes: '3', '4', '5', '6', '7', '8' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1282, 1282, 1282, 1282, 1282, 1282, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.01221167  0.5737107  0.3033509
##   0.02374491  0.5657458  0.2697084
##   0.25237449  0.4719068  0.1116919
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01221167.

As we can see, the best accuracy achieved was only 57.37%, which is not much better than randomly guessing.

4.1.1

rpart.plot(red_wine_decision_tree$finalModel)

4.1.2

No answer required.

5 Fitting a Random Forest Machine Learning Model

Please note that we run the set.seed(1650) command prior to training the random forest model, so that the results discussed here are accurate regardless of the number of times this document is generated. If you do not set a seed prior to training your models, your results may appear slightly different.

5.1 Training the Random Forest Model

set.seed(1650)
red_wine_rf <- train(quality ~ .,
                     data = red_wine_train,
                     method = "rf")

5.2

red_wine_rf
## Random Forest 
## 
## 1282 samples
##   11 predictor
##    6 classes: '3', '4', '5', '6', '7', '8' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1282, 1282, 1282, 1282, 1282, 1282, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.6700665  0.4633930
##    6    0.6627082  0.4551261
##   11    0.6582344  0.4497119
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

The best accuracy achieved by the random forest model is 67.01%.

5.3

The random forest model has a much higher predictive accuracy of 67.01% for this specific data set.

5.4 Random Forest Plots

No answer required.

5.5

ggplot(red_wine_rf)

dotPlot(varImp(red_wine_rf))

We observe that the feature variables considered most important are alcohol (not surprisingly), followed by volatile.acidity and total.sulfur.dioxide.

6 Validating Results

No answer required.

6.1

The code below computes the cross-validation accuracy check for the decision tree model.

# Load magrittr package for piping
library(magrittr)

# count number of observations in validation data
validation_numbers <- dim(red_wine_validate)[1]

# Use the fitted model to predict quality values given the validation data
predict_red_wine_decision_tree <- predict(red_wine_decision_tree, 
                                          newdata =red_wine_validate)
# When run, the code below gives us the percentage of correct predictions
dec_tree_accuracy <- sum(predict_red_wine_decision_tree == 
                           red_wine_validate$quality) / validation_numbers * 100

dec_tree_accuracy %>% round(2)
## [1] 52.05

6.2

The code below computes the cross-validation accuracy check for the random forest model.

# Load magrittr package for piping
library(magrittr)

# count number of observations in validation data
validation_numbers <- dim(red_wine_validate)[1]

# Use the fitted model to predict quality values given the validation data
predict_red_wine_rf <- predict(red_wine_rf, 
                               newdata =red_wine_validate)
# When run, the code below gives us the percentage of correct predictions
rf_accuracy <- sum(predict_red_wine_rf == 
                     red_wine_validate$quality) / validation_numbers * 100

rf_accuracy %>% round(2)
## [1] 65.93

We observe that the accuracy of the decision tree model using the validation data is only 52.05%, which is much lower than the 57.37% accuracy achieved using the training data.

The accuracy of the random forest model using the validation data is 65.93%, which is decent and quite close to the accuracy of the random forest when using the training data (67.01%).

It would appear that our decision tree model may not perform as well as we anticipated, when presented with new data. This highlights the importance of cross-validating your machine learning models. The random forest model produces better results for both the training and cross-validation data sets, and so there is no competition - we would choose to use the random forest model over the decision tree model here.


Great work, that’s everything for today!


References

Cortez, P., A. Cerdeira, F. Almeida, T. Matos, and J. Reis. 2009. “Modeling Wine Preferences by Data Mining from Physicochemical Properties.” Decision Support Systems 47 (4): 547–53.
Kuhn, M., J. Wing, S. Weston, A. Williams, C. Keefer, A. Engelhardt, T. Cooper, et al. 2021. caret: Classification and Regression Training. https://cran.r-project.org/web/packages/caret/index.html.
Thulin, M. 2021. Modern Statistics with R: From Wrangling and Exploring Data to Inference and Predictive Modelling.
UCI Machine Learning Repository. 2009. “Wine Quality Data Set[.csv File].” 2009. https://archive.ics.uci.edu/ml/datasets/Wine+Quality.


These notes have been prepared by Rupert Kuveke. Please note that some of the content in these notes has been developed from content in Thulin (2021). The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.

---
title: "STM1001: Computer Lab 10B Solutions"
output:
  bookdown::html_document2: 
    toc: true
    toc_float: true
    code_download: true
    theme: readable
    code_folding: show
bibliography: STM1001_DS_CL_references.bib 
link-citations: yes
---

<style>
#TOC {
  background: url("https://www.latrobe.edu.au/_media/la-trobe-api/v5/img/logo.svg");
  background-size: contain;
  padding-top: 80px !important;
  background-repeat: no-repeat;
}
</style>

### Data Science Stream {-}

### Topic 10B: Machine Learning II {-}

<br>

Example R code solutions for the [Data Science Module Computer Lab 10B](https://rpubs.com/LTU_STM1001/DSMCL10), which uses the `caret` R package [@caret] and Portuguese wine data obtained from @UCIWine (originally collected by @wine), are presented below.

This computer lab is designed to run alongside the content in the [Introduction to Machine Learning in R supplement](https://bookdown.org/rehk/stm1001_dsm_introduction_to_machine_learning_in_r/). It might be helpful to have this material open as you look through these solutions.


<br>

# Preparations {#prep}

## Load Required Packages {#load}

```{r class.source = "fold-show", eval = F, echo = T, warning = F, message = F}
library(caret)
library(rpart.plot)
```

```{r class.source = "fold-show", eval = T, include = F}
# Specify required packages
ml_packages <- c("caret", "magrittr", "rpart.plot")
# Install missing packages
install.packages(setdiff(ml_packages, rownames(installed.packages())))
# Load all packages
lapply(ml_packages, library, character.only = TRUE)
```

## Wine Data

No answer required.

##

Example code for loading the red wine data is shown below:

```{r class.source = "fold-hide", eval = F, echo = T, warning = F, message = F}
red_wine <- read.csv(file = "winequality_red.csv", header = T)
```

```{r class.source = "fold-hide", eval = T, include = F}
red_wine <- read.csv(file = "data/winequality_red.csv", header = T)
```

## Aim

No answer required.

## 

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
red_wine$quality <- as.factor(red_wine$quality)
```


# Data Visualisation {#dataviz}

##

The `head` function provides an overview of the different variables. The `summary` function provides information on the spread of observed values for each variable, and the `dim` function tells us that we have observations for 1599 wine samples.

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
head(red_wine)
summary(red_wine)
dim(red_wine)
```

## {#problem}

We notice that the majority of quality scores are either 5 or 6. Very few scores of 3 or 8 are recorded - this might make it difficult to accurately predict such scores.

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
plot(red_wine$quality)
```

## {#featureplot}

As we can see from the `featurePlot` box plots, the `free.sulfur.dioxide` and `total.sulfur.dioxide` have much larger values than the other feature variables and lead to the box plot being relatively uninformative.

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F, fig.dim = c(8, 8)}
featurePlot(x = red_wine[, -12], 
            y = red_wine$quality, 
            plot = "box")
```

## {#featureplotsimplified}

Example code is provide below:

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F, fig.dim = c(8, 8)}
featurePlot(x = red_wine[, -c(6, 7, 12)], 
            y = red_wine$quality, 
            plot = "box")
```

Many of the box plots are still not informative - we will try to address this in the next section.

# Pre-Processing {#prepro}

##

No answer required.

## Highly Influential Samples

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
nearZeroVar(red_wine, saveMetrics = T, freqCut = 2, uniqueCut = 5)
nearZeroVar(red_wine, saveMetrics = F, freqCut = 2, uniqueCut = 5)
```

##

None of the feature variables are flagged as exceeding our specified cut-off values.

The `citric acid` feature variable has the highest `freqRatio` value, at 1.94, and also has the lowest `percentUnique` value (5.003) of the feature variables (just missing our specified cut-off).

Neither of these values are too alarming - it should be ok to leave the `citric acid` feature variable in our data set for the time being. 

We conclude that there do not appear to be any problematic variables.

##

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
base_cor <- cor(red_wine[, -12])
base_cor

extreme_cor <- sum(abs(base_cor[upper.tri(base_cor)]) > .999)
extreme_cor

summary(base_cor[upper.tri(base_cor)])
```

## 

The largest negative correlation value is -0.68297819, between `pH` and `fixed.acidity`.
The largest positive correlation value is 0.67170343, between `citric.acid` and `fixed.acidity`.

These correlation values are not too large in magnitude, so do not seem too problematic.

## 

Using the the `findCorrelation` function to identify highly correlated feature variables to remove from our data set would depend on what we term an acceptable level of correlation. If we are happy with correlations of magnitude 0.69 and under, then we do not need to run this function.
If we would like to set the maximum acceptable correlation magnitude to e.g. 0.67, then the `findCorrelation` function would identify the `fixed.acidity` feature variable for removal.

*Note: For the remainder of this lab, we will assume no feature variables needed to be removed.*

##

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
centre_scale <- preProcess(red_wine[, -12], 
                           method = c("center", "scale"))
red_wine_updated <- predict(centre_scale, red_wine)
```

##

If we compare the original data to the updated data, we see that the feature variable values are now scaled and centred.

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
head(red_wine)
head(red_wine_updated)
```

## {#boxplotnew}

The `featurePlot` box plots are now more informative, compared to those created in \@ref(featureplot). For all variables, we are now able to see more clearly how the feature variables' values differ across the different `quality` scores.

We observe that for several feature variables (`sulphates`, `chlorides`, `volatile.acidity` and `residual.sugar`) there are more extreme observations for `quality` scores of 5 and 6 than for the other scores.

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F, fig.dim = c(8,8), cache = T}
featurePlot(x = red_wine_updated[, -12], 
            y = red_wine_updated$quality, 
            plot = "box")
```

##

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F, fig.dim = c(10,10), cache = T}
featurePlot(x = red_wine_updated[, -12], 
            y = red_wine_updated$quality, 
            plot = "pairs",
            auto.key = list(columns = 6))
```

We observe that for most pairs the data is too closely clumped to clearly distinguish between the quality ratings. Perhaps our machine learning model can help.

## Training and Validation Data {#train}

Example code is provided below. 

**Please note that the data partitioning into training or validation categories is random to an extent, so your results from this point onwards may differ slightly to those presented in the subsequent question solutions, since your training and validation data sets will most likely contain slightly different sets of observations.**

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
set.seed(1650)
wine_train_index <- createDataPartition(red_wine_updated$quality, 
                                        p = .8, # here p designates the split - 80/20
                                        list = FALSE, times = 1) 
```

##

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
red_wine_train <- red_wine_updated[wine_train_index, ]
red_wine_validate <- red_wine_updated[-wine_train_index, ]
```

# Fitting a Decision Tree Machine Learning Model {#fit}

Please note that we run the `set.seed(1650)` command prior to training the decision tree model, so that the results discussed here are accurate regardless of the number of times this document is generated. If you do not set a seed prior to training your models, your results may appear slightly different.


## Decision Tree {#dectree}

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F, cache = T}
set.seed(1650)
red_wine_decision_tree <- train(quality ~ .,
                                data = red_wine_train,
                                method = "rpart")
red_wine_decision_tree
```

As we can see, the best accuracy achieved was only 57.37%, which is not much better than randomly guessing.

###

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F, fig.dim = c(8,6)}
rpart.plot(red_wine_decision_tree$finalModel)
```

###

No answer required.

# Fitting a Random Forest Machine Learning Model {#randomforest}

Please note that we run the `set.seed(1650)` command prior to training the random forest model, so that the results discussed here are accurate regardless of the number of times this document is generated. If you do not set a seed prior to training your models, your results may appear slightly different.

## Training the Random Forest Model {#rf}

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F, cache = T}
set.seed(1650)
red_wine_rf <- train(quality ~ .,
                     data = red_wine_train,
                     method = "rf")
```

## {#rfaccuracy}

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
red_wine_rf
```

The best accuracy achieved by the random forest model is `r max(round(red_wine_rf$results$Accuracy *100, 2))`%.

##

The random forest model has a much higher predictive accuracy of `r max(round(red_wine_rf$results$Accuracy *100, 2))`% for this specific data set.

## Random Forest Plots {#rfplots}

No answer required.

##

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
ggplot(red_wine_rf)

dotPlot(varImp(red_wine_rf))
```

We observe that the feature variables considered most important are `alcohol` (not surprisingly), followed by `volatile.acidity` and  `total.sulfur.dioxide`.

# Validating Results {#val}

No answer required.

## {#validationcheck}

The code below computes the cross-validation accuracy check for the decision tree model.

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
# Load magrittr package for piping
library(magrittr)

# count number of observations in validation data
validation_numbers <- dim(red_wine_validate)[1]

# Use the fitted model to predict quality values given the validation data
predict_red_wine_decision_tree <- predict(red_wine_decision_tree, 
                                          newdata =red_wine_validate)
# When run, the code below gives us the percentage of correct predictions
dec_tree_accuracy <- sum(predict_red_wine_decision_tree == 
                           red_wine_validate$quality) / validation_numbers * 100

dec_tree_accuracy %>% round(2)
```

##

The code below computes the cross-validation accuracy check for the random forest model.

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
# Load magrittr package for piping
library(magrittr)

# count number of observations in validation data
validation_numbers <- dim(red_wine_validate)[1]

# Use the fitted model to predict quality values given the validation data
predict_red_wine_rf <- predict(red_wine_rf, 
                               newdata =red_wine_validate)
# When run, the code below gives us the percentage of correct predictions
rf_accuracy <- sum(predict_red_wine_rf == 
                     red_wine_validate$quality) / validation_numbers * 100

rf_accuracy %>% round(2)
```

We observe that the accuracy of the decision tree model using the validation data is only `r dec_tree_accuracy %>% round(2)`%, which is much lower than the `r max(round(red_wine_decision_tree$results$Accuracy *100, 2))`% accuracy achieved using the training data. 

The accuracy of the random forest model using the validation data is `r rf_accuracy %>% round(2)`%, which is decent and quite close to the accuracy of the random forest when using the training data (`r max(round(red_wine_rf$results$Accuracy *100, 2))`%).

It would appear that our decision tree model may not perform as well as we anticipated, when presented with new data. This highlights the importance of cross-validating your machine learning models. The random forest model produces better results for both the training and cross-validation data sets, and so there is no competition - we would choose to use the random forest model over the decision tree model here.

<br>

#### Great work, that's everything for today! #### {-}

<br>

# References {- #Ref}
<div id="refs"></div>

<br>

<font color = "grey">
These notes have been prepared by Rupert Kuveke. Please note that some of the content in these notes has been developed from content in @ModStat. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License 
<a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/CC" target="_blank"> BY-NC-ND. </a>
</font>