Example R code solutions for the Data Science Module Computer Lab 10B, which uses the caret
R package (Kuhn et al. 2021) and Portuguese wine data obtained from UCI Machine Learning Repository (2009) (originally collected by Cortez et al. (2009)), are presented below.
This computer lab is designed to run alongside the content in the Introduction to Machine Learning in R supplement. It might be helpful to have this material open as you look through these solutions.
# Install packages
install.packages("caret", "magrittr", "rpart.plot")
# Load packages
library(caret)
library(rpart.plot)
No answer required.
Example R code for loading the red wine data is shown below:
red_wine <- read.csv(file = "winequality_red.csv", header = T)
red_wine$quality <- as.factor(red_wine$quality)
The head
function provides an overview of the different variables. The summary
function provides information on the spread of observed values for each variable, and the dim
function tells us that we have observations for 1599 wine samples.
head(red_wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
summary(red_wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 3: 10
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 4: 53
## Median :3.310 Median :0.6200 Median :10.20 5:681
## Mean :3.311 Mean :0.6581 Mean :10.42 6:638
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 7:199
## Max. :4.010 Max. :2.0000 Max. :14.90 8: 18
dim(red_wine)
## [1] 1599 12
We notice that the majority of quality scores are either 5 or 6. Very few scores of 3 or 8 are recorded - this might make it difficult to accurately predict such scores.
plot(red_wine$quality)
As we can see from the featurePlot
box plots, the free.sulfur.dioxide
and total.sulfur.dioxide
have much larger values than the other feature variables and lead to the box plot being relatively uninformative.
featurePlot(x = red_wine[, -12],
y = red_wine$quality,
plot = "box")
Example R code is provide below:
featurePlot(x = red_wine[, -c(6, 7, 12)],
y = red_wine$quality,
plot = "box")
Many of the box plots are still not informative - we will try to address this in the next section.
No answer required.
nearZeroVar(red_wine, saveMetrics = T, freqCut = 2, uniqueCut = 5)
## freqRatio percentUnique zeroVar nzv
## fixed.acidity 1.175439 6.0037523 FALSE FALSE
## volatile.acidity 1.021739 8.9430894 FALSE FALSE
## citric.acid 1.941176 5.0031270 FALSE FALSE
## residual.sugar 1.190840 5.6910569 FALSE FALSE
## chlorides 1.200000 9.5684803 FALSE FALSE
## free.sulfur.dioxide 1.326923 3.7523452 FALSE FALSE
## total.sulfur.dioxide 1.194444 9.0056285 FALSE FALSE
## density 1.028571 27.2670419 FALSE FALSE
## pH 1.017857 5.5659787 FALSE FALSE
## sulphates 1.014706 6.0037523 FALSE FALSE
## alcohol 1.349515 4.0650407 FALSE FALSE
## quality 1.067398 0.3752345 FALSE FALSE
nearZeroVar(red_wine, saveMetrics = F, freqCut = 2, uniqueCut = 5)
## integer(0)
None of the feature variables are flagged as exceeding our specified cut-off values.
The citric acid
feature variable has the highest freqRatio
value, at 1.94, and also has the lowest percentUnique
value (5.003) of the feature variables (just missing our specified cut-off).
Neither of these values are too alarming - it should be ok to leave the citric acid
feature variable in our data set for the time being.
We conclude that there do not appear to be any problematic variables.
base_cor <- cor(red_wine[, -12])
base_cor
## fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity 1.00000000 -0.256130895 0.67170343 0.114776724
## volatile.acidity -0.25613089 1.000000000 -0.55249568 0.001917882
## citric.acid 0.67170343 -0.552495685 1.00000000 0.143577162
## residual.sugar 0.11477672 0.001917882 0.14357716 1.000000000
## chlorides 0.09370519 0.061297772 0.20382291 0.055609535
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813 0.187048995
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302 0.203027882
## density 0.66804729 0.022026232 0.36494718 0.355283371
## pH -0.68297819 0.234937294 -0.54190414 -0.085652422
## sulphates 0.18300566 -0.260986685 0.31277004 0.005527121
## alcohol -0.06166827 -0.202288027 0.10990325 0.042075437
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## fixed.acidity 0.093705186 -0.153794193 -0.11318144
## volatile.acidity 0.061297772 -0.010503827 0.07647000
## citric.acid 0.203822914 -0.060978129 0.03553302
## residual.sugar 0.055609535 0.187048995 0.20302788
## chlorides 1.000000000 0.005562147 0.04740047
## free.sulfur.dioxide 0.005562147 1.000000000 0.66766645
## total.sulfur.dioxide 0.047400468 0.667666450 1.00000000
## density 0.200632327 -0.021945831 0.07126948
## pH -0.265026131 0.070377499 -0.06649456
## sulphates 0.371260481 0.051657572 0.04294684
## alcohol -0.221140545 -0.069408354 -0.20565394
## density pH sulphates alcohol
## fixed.acidity 0.66804729 -0.68297819 0.183005664 -0.06166827
## volatile.acidity 0.02202623 0.23493729 -0.260986685 -0.20228803
## citric.acid 0.36494718 -0.54190414 0.312770044 0.10990325
## residual.sugar 0.35528337 -0.08565242 0.005527121 0.04207544
## chlorides 0.20063233 -0.26502613 0.371260481 -0.22114054
## free.sulfur.dioxide -0.02194583 0.07037750 0.051657572 -0.06940835
## total.sulfur.dioxide 0.07126948 -0.06649456 0.042946836 -0.20565394
## density 1.00000000 -0.34169933 0.148506412 -0.49617977
## pH -0.34169933 1.00000000 -0.196647602 0.20563251
## sulphates 0.14850641 -0.19664760 1.000000000 0.09359475
## alcohol -0.49617977 0.20563251 0.093594750 1.00000000
extreme_cor <- sum(abs(base_cor[upper.tri(base_cor)]) > .999)
extreme_cor
## [1] 0
summary(base_cor[upper.tri(base_cor)])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.68298 -0.09942 0.04295 0.02285 0.16576 0.67170
The largest negative correlation value is -0.68297819, between pH
and fixed.acidity
.
The largest positive correlation value is 0.67170343, between citric.acid
and fixed.acidity
.
These correlation values are not too large in magnitude, so do not seem too problematic.
Using the the findCorrelation
function to identify highly correlated feature variables to remove from our data set would depend on what we term an acceptable level of correlation. If we are happy with correlations of magnitude 0.69 and under, then we do not need to run this function.
If we would like to set the maximum acceptable correlation magnitude to e.g. 0.67, then the findCorrelation
function would identify the fixed.acidity
feature variable for removal.
Note: For the remainder of this lab, we will assume no feature variables needed to be removed.
centre_scale <- preProcess(red_wine[, -12],
method = c("center", "scale"))
red_wine_updated <- predict(centre_scale, red_wine)
If we compare the original data to the updated data, we see that the feature variable values are now scaled and centred.
head(red_wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
head(red_wine_updated)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 -0.5281944 0.9615758 -1.391037 -0.45307667 -0.24363047
## 2 -0.2984541 1.9668271 -1.391037 0.04340257 0.22380518
## 3 -0.2984541 1.2966596 -1.185699 -0.16937425 0.09632273
## 4 1.6543385 -1.3840105 1.483689 -0.45307667 -0.26487754
## 5 -0.5281944 0.9615758 -1.391037 -0.45307667 -0.24363047
## 6 -0.5281944 0.7381867 -1.391037 -0.52400227 -0.26487754
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 1 -0.46604672 -0.3790141 0.55809987 1.2882399 -0.57902538
## 2 0.87236532 0.6241680 0.02825193 -0.7197081 0.12891007
## 3 -0.08364328 0.2289750 0.13422152 -0.3310730 -0.04807379
## 4 0.10755844 0.4113718 0.66406945 -0.9787982 -0.46103614
## 5 -0.46604672 -0.3790141 0.55809987 1.2882399 -0.57902538
## 6 -0.27484500 -0.1966174 0.55809987 1.2882399 -0.57902538
## alcohol quality
## 1 -0.9599458 5
## 2 -0.5845942 5
## 3 -0.5845942 5
## 4 -0.5845942 6
## 5 -0.9599458 5
## 6 -0.9599458 5
The featurePlot
box plots are now more informative, compared to those created in 2.3. For all variables, we are now able to see more clearly how the feature variables’ values differ across the different quality
scores.
We observe that for several feature variables (sulphates
, chlorides
, volatile.acidity
and residual.sugar
) there are more extreme observations for quality
scores of 5 and 6 than for the other scores.
featurePlot(x = red_wine_updated[, -12],
y = red_wine_updated$quality,
plot = "box")
featurePlot(x = red_wine_updated[, -12],
y = red_wine_updated$quality,
plot = "pairs",
auto.key = list(columns = 6))
We observe that for most pairs the data is too closely clumped to clearly distinguish between the quality ratings. Perhaps our machine learning model can help.
Example R code is provided below.
Please note that the data partitioning into training or validation categories is random to an extent, so your results from this point onwards may differ slightly to those presented in the subsequent question solutions, since your training and validation data sets will most likely contain slightly different sets of observations.
set.seed(1650)
wine_train_index <- createDataPartition(red_wine_updated$quality,
p = 0.8,
list = FALSE, times = 1)
red_wine_train <- red_wine_updated[wine_train_index, ]
red_wine_validate <- red_wine_updated[-wine_train_index, ]
Please note that we run the set.seed(1650)
command prior to training the decision tree model, so that the results discussed here are accurate regardless of the number of times this document is generated. If you do not set a seed prior to training your models, your results may appear slightly different.
set.seed(1650)
red_wine_decision_tree <- train(quality ~ .,
data = red_wine_train,
method = "rpart")
red_wine_decision_tree
## CART
##
## 1282 samples
## 11 predictor
## 6 classes: '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1282, 1282, 1282, 1282, 1282, 1282, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.01221167 0.5737107 0.3033509
## 0.02374491 0.5657458 0.2697084
## 0.25237449 0.4719068 0.1116919
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01221167.
As we can see, the best accuracy achieved was only 57.37%, which is not much better than randomly guessing.
rpart.plot(red_wine_decision_tree$finalModel)
No answer required.
# Load magrittr package for piping
library(magrittr)
# count number of observations in validation data
validation_numbers <- dim(red_wine_validate)[1]
# Use the fitted model to predict quality values given the validation data
predict_red_wine_decision_tree <- predict(red_wine_decision_tree,
newdata =red_wine_validate)
# When run, the code below gives us the percentage of correct predictions
dec_tree_accuracy <- sum(predict_red_wine_decision_tree == red_wine_validate$quality) /
validation_numbers %>% round(2) * 100
dec_tree_accuracy
## [1] 52.05047
We observe that the accuracy of the model using the validation data is only 52.05%, which is much lower than the 57.37% accuracy achieved using the training data. It would appear that our model may not perform as well as we anticipated, when presented with new data. This highlights the importance of cross-validating your machine learning models.
These notes have been prepared by Rupert Kuveke. Please note that some of the content in these notes has been developed from content in Thulin (2021). The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.