library(tidyverse)
library(GGally)
library(MLmetrics)
library(caret)
Assignment4
Introduction
For this assignment, I will work with the white wine data set from the Wine Quality data set from the UC Irvine Machine Learning Repository. The goal for this data set is to predict wine quality based on a variety of physicochemical tests. From a business perspective, we can assume that we are the owner of the winery. If we are able to accurately predict wine quality based on the input features, we may have a better understanding of how to tailor our production methods to create a wine of a desired quality. Additionally, we can use this information in pricing our wines which may come in handy for estimating future revenues.
Exploratory Data analysis
First I import the data-set
<- read_delim('winequality-white.csv',
wine_data delim = ';',
col_names = T)
<- as.data.frame(wine_data) wine_data
First, I examine the structure of the data frame.
str(wine_data)
'data.frame': 4898 obs. of 12 variables:
$ fixed acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
$ volatile acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
$ citric acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
$ residual sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
$ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
$ free sulfur dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
$ total sulfur dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
$ density : num 1.001 0.994 0.995 0.996 0.996 ...
$ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
$ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
$ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
$ quality : num 6 6 6 6 6 6 6 6 6 6 ...
The target variable is quality, which ranges from 0 to 10, based on the description of the data set. Below, I check the values that are present:
unique(wine_data$quality)
[1] 6 5 7 8 4 3 9
As seen above, the values 0, 1, 2, and 10 are not present in the data set, which suggests wines of this quality are rare. This presents an issue, as quality can take a value from 0 to 10 but the data set does not have the full range of possible quality values.
table(wine_data$quality)
3 4 5 6 7 8 9
20 163 1457 2198 880 175 5
prop.table(table(wine_data$quality))
3 4 5 6 7 8
0.004083299 0.033278889 0.297468354 0.448754594 0.179665169 0.035728869
9
0.001020825
barplot(prop.table(table(wine_data$quality)),
main = 'Class proportions for targert variable',
xlab = 'Quality',
ylab = 'Proportion of observations')
I do not know enough about the data to synthetically create observations of quality 0, 1, 2, or 10. Additionally, the extremely low frequency of quality 3 and 9 will present a problem for cross validation as these classes may not be in included in a cross validation fold or the model may not predict them, which causes NAs for metrics like F1, Specificity, Accuracy, etc. Thus, I will combine class 3 into class 4 and class 9 into class 8.
<- wine_data |> mutate(quality=replace(quality, quality == 3, 4)) |>
wine_data mutate(quality=replace(quality, quality == 9, 8))
Next, I convert the target to a factor with the simplifying assumption that the only possible values are 4 to 8. Because these quality wines did not appear in the data set, the business risk from classifying an unseen observation that has one of these qualities is low. For example, even if we classified a 10 quality wine as a different quality we are likely not making many bottles of that quality so the revenue lost from under pricing the bottle would be low. That said, a conversation with management may lead to a different modelling approach.
I convert to a factor and treat this problem as a multiclass classification problem.
$quality <- factor(wine_data$quality,levels = seq(4,8)) wine_data
First, I check whether there are any missing values in the data:
colSums(is.na(wine_data))
fixed acidity volatile acidity citric acid
0 0 0
residual sugar chlorides free sulfur dioxide
0 0 0
total sulfur dioxide density pH
0 0 0
sulphates alcohol quality
0 0 0
Next, I will remove spaces from the feature names:
colnames(wine_data) <- str_replace_all(colnames(wine_data),pattern = " ", replacement = '_')
I visualize the correlations between the features
::corrplot(cor(wine_data[,-12]), method = 'number', type = 'lower') corrplot
Correlations are low for the most part, though the below pairs have fairly high correlation:
density and residual sugar
density and alcohol
Feature selection may be necessary for certain algorithms to avoid multicollinearity.
Next, I visualize the distribution of the features:
|> select(-quality) |> pivot_longer(cols = everything(),
wine_data names_to = 'variable',
values_to = 'value') |>
ggplot(aes(x=value)) + geom_histogram() + facet_wrap(vars(variable),
scales = 'free')
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
::describe(wine_data[,-12]) psych
vars n mean sd median trimmed mad min max
fixed_acidity 1 4898 6.85 0.84 6.80 6.82 0.74 3.80 14.20
volatile_acidity 2 4898 0.28 0.10 0.26 0.27 0.09 0.08 1.10
citric_acid 3 4898 0.33 0.12 0.32 0.33 0.09 0.00 1.66
residual_sugar 4 4898 6.39 5.07 5.20 5.80 5.34 0.60 65.80
chlorides 5 4898 0.05 0.02 0.04 0.04 0.01 0.01 0.35
free_sulfur_dioxide 6 4898 35.31 17.01 34.00 34.36 16.31 2.00 289.00
total_sulfur_dioxide 7 4898 138.36 42.50 134.00 136.96 43.00 9.00 440.00
density 8 4898 0.99 0.00 0.99 0.99 0.00 0.99 1.04
pH 9 4898 3.19 0.15 3.18 3.18 0.15 2.72 3.82
sulphates 10 4898 0.49 0.11 0.47 0.48 0.10 0.22 1.08
alcohol 11 4898 10.51 1.23 10.40 10.43 1.48 8.00 14.20
range skew kurtosis se
fixed_acidity 10.40 0.65 2.17 0.01
volatile_acidity 1.02 1.58 5.08 0.00
citric_acid 1.66 1.28 6.16 0.00
residual_sugar 65.20 1.08 3.46 0.07
chlorides 0.34 5.02 37.51 0.00
free_sulfur_dioxide 287.00 1.41 11.45 0.24
total_sulfur_dioxide 431.00 0.39 0.57 0.61
density 0.05 0.98 9.78 0.00
pH 1.10 0.46 0.53 0.00
sulphates 0.86 0.98 1.59 0.00
alcohol 6.20 0.49 -0.70 0.02
Some of the features have positive skew and excess kurtosis, though the mean and median for most of the features are fairly close. For interpretability purposes, I will train models without doing transformations on the features (besides centering and scaling), though if the models perform poorly, it could be worthwhile to try a log transformation of the features before training the models.
Next, I check the distribution of the target variable
prop.table(table(wine_data$quality))
4 5 6 7 8
0.03736219 0.29746835 0.44875459 0.17966517 0.03674969
barplot(prop.table(table(wine_data$quality)),
main = 'Class proportions for targert variable',
xlab = 'Quality',
ylab = 'Proportion of observations')
The target variable is imbalanced, with low instances for quality 4 and 8.
EvalAlgorithm selection
The task at hand is a multi-class classification problem with an imbalanced data set.
I will train and test four different algorithms to compare performance: random forest with xgboost, support vector machines using a one vs rest approach, and a neural network with a one vs all approach.
Additionally, because the data set is imbalanced, I will use the smote algorithm during cross validation to create synthetically balanced training sets. All of the features are continuous, so there is no need for dummy variables.
The data set is imbalanced, so looking at accuracy alone will be misleading. Thus, I will use mean balanced accuracy (the average of sensitivity, or the true positive rate, and specificity, the true negative rate) as the evaluation metric for training. Additionally, comparing the models performance on the test set by looking at balanced accuracy, mean F1, sensitivity, specificity, and precision may provide further insight, especially when looking at class by class performance.
Data Preparation
Train - Test split
First, I create train-test split, using an 80%/20% ratio.
set.seed(123)
<- sample(nrow(wine_data), round(nrow(wine_data)*0.8), replace = FALSE)
sample_set
<- wine_data[sample_set,]
train_set <- wine_data[-sample_set,] test_set
Below are the proportions of the target variable in the train and test set
Train:
round(prop.table(table(select(train_set, quality), exclude = NULL)), 4) * 100
quality
4 5 6 7 8
3.88 29.68 44.79 18.07 3.57
Test:
round(prop.table(table(select(test_set, quality), exclude = NULL)), 4) * 100
quality
4 5 6 7 8
3.16 30.00 45.20 17.55 4.08
Performance Tracking Dataframes
Below I create data frames to track performance:
<- data.frame(Model= character(),
performance_metrics_training Accuracy = numeric(),
Mean_Balanced_Accuracy = numeric(),
No_Information_Rate = numeric(),
Mean_F1 = numeric(),
Sensitivity = numeric(),
Specificity = numeric(),
Precision = numeric())
<- data.frame(Model= character(),
performance_metrics_test Accuracy = numeric(),
Mean_Balanced_Accuracy = numeric(),
No_Information_Rate = numeric(),
Mean_F1 = numeric(),
Sensitivity = numeric(),
Specificity = numeric(),
Precision = numeric())
xgbTree
First, I train a random forest model with xgBoost
library(themis)
Loading required package: recipes
Attaching package: 'recipes'
The following object is masked from 'package:stringr':
fixed
The following object is masked from 'package:stats':
step
I will use parallel computing to improve run time
library(foreach)
library(doParallel)
modelLookup('xgbTree')
model parameter label forReg forClass
1 xgbTree nrounds # Boosting Iterations TRUE TRUE
2 xgbTree max_depth Max Tree Depth TRUE TRUE
3 xgbTree eta Shrinkage TRUE TRUE
4 xgbTree gamma Minimum Loss Reduction TRUE TRUE
5 xgbTree colsample_bytree Subsample Ratio of Columns TRUE TRUE
6 xgbTree min_child_weight Minimum Sum of Instance Weight TRUE TRUE
7 xgbTree subsample Subsample Percentage TRUE TRUE
probModel
1 TRUE
2 TRUE
3 TRUE
4 TRUE
5 TRUE
6 TRUE
7 TRUE
There are seven tunable parameters for the xgbTree model. Tuning all of these parameters may take a significant amount of time and computing power, thus I will try tuning two parameters: nrounds, which is the number of boosting iterations, and max_depth, which is the maximum tree depth. I will use default values for the other parameters in order to manage run time.
<- expand.grid(
tune_grid_xgb nrounds = c(100, 200, 300),
max_depth = c(3, 6, 9, 12),
eta = 0.3,
gamma = 0.01,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
Below I set the train control parameters. I will do 5 fold cross validation and I will use the smote technique to create artificially balanced training sets within each fold.
<- trainControl(
ctrl method = 'repeatedcv',
number = 5,
repeats = 5,
summaryFunction = multiClassSummary,
sampling = 'smote',
classProbs = F
)
<- makeCluster(detectCores()-2) cluster1
registerDoParallel(cluster1)
set.seed(456)
<- train(
xgbTree ~ .,
quality data = train_set,
preProcess = c('center','scale'),
metric = 'Mean_Balanced_Accuracy',
method = 'xgbTree',
trControl = ctrl,
tuneGrid = tune_grid_xgb
)
stopCluster(cluster1)
xgbTree
eXtreme Gradient Boosting
3918 samples
11 predictor
5 classes: '4', '5', '6', '7', '8'
Pre-processing: centered (11), scaled (11)
Resampling: Cross-Validated (5 fold, repeated 5 times)
Summary of sample sizes: 3135, 3134, 3133, 3136, 3134, 3134, ...
Addtional sampling using SMOTE prior to pre-processing
Resampling results across tuning parameters:
max_depth nrounds Accuracy Kappa Mean_F1 Mean_Sensitivity
3 100 0.5460434 0.3534250 0.4702838 0.5158949
3 200 0.5882083 0.3979457 0.5139890 0.5311809
3 300 0.6035220 0.4138226 0.5297622 0.5343977
6 100 0.6295575 0.4499886 0.5477090 0.5461920
6 200 0.6343524 0.4541545 0.5514231 0.5444607
6 300 0.6340459 0.4540595 0.5512820 0.5451630
9 100 0.6398655 0.4638203 0.5537106 0.5517001
9 200 0.6374146 0.4604411 0.5517901 0.5504608
9 300 0.6366482 0.4593506 0.5498213 0.5487264
12 100 0.6365443 0.4610502 0.5477002 0.5503784
12 200 0.6366480 0.4604401 0.5486310 0.5502158
12 300 0.6354231 0.4588203 0.5472706 0.5492307
Mean_Specificity Mean_Pos_Pred_Value Mean_Neg_Pred_Value Mean_Precision
0.8713705 0.4496733 0.8690382 0.4496733
0.8789158 0.5025965 0.8779797 0.5025965
0.8813866 0.5284779 0.8813905 0.5284779
0.8887219 0.5523054 0.8891568 0.5523054
0.8892217 0.5622326 0.8902830 0.5622326
0.8892055 0.5609152 0.8901894 0.5609152
0.8913866 0.5592199 0.8922196 0.5592199
0.8906791 0.5561252 0.8914296 0.5561252
0.8904995 0.5540436 0.8912409 0.5540436
0.8911785 0.5483484 0.8915643 0.5483484
0.8909383 0.5505806 0.8915086 0.5505806
0.8906278 0.5487203 0.8911511 0.5487203
Mean_Recall Mean_Detection_Rate Mean_Balanced_Accuracy
0.5158949 0.1092087 0.6936327
0.5311809 0.1176417 0.7050483
0.5343977 0.1207044 0.7078921
0.5461920 0.1259115 0.7174570
0.5444607 0.1268705 0.7168412
0.5451630 0.1268092 0.7171843
0.5517001 0.1279731 0.7215433
0.5504608 0.1274829 0.7205699
0.5487264 0.1273296 0.7196129
0.5503784 0.1273089 0.7207784
0.5502158 0.1273296 0.7205771
0.5492307 0.1270846 0.7199293
Tuning parameter 'eta' was held constant at a value of 0.3
Tuning
Tuning parameter 'min_child_weight' was held constant at a value of 1
Tuning parameter 'subsample' was held constant at a value of 1
Mean_Balanced_Accuracy was used to select the optimal model using the
largest value.
The final values used for the model were nrounds = 100, max_depth = 9, eta
= 0.3, gamma = 0.01, colsample_bytree = 1, min_child_weight = 1 and
subsample = 1.
Recording training set performance:
nrow(performance_metrics_training) + 1, ] <- c('xgbTree', round(getTrainPerf(xgbTree)[1],4),
performance_metrics_training[round(getTrainPerf(xgbTree)[11],4),
round(max(prop.table(table(train_set$quality))),4),
round(getTrainPerf(xgbTree)[3],4),
round(getTrainPerf(xgbTree)[4],4),
round(getTrainPerf(xgbTree)[5],4),
round(getTrainPerf(xgbTree)[8],4))
Next, I try predicting the test set:
<- predict(xgbTree, test_set, type = 'raw') xgbTree_predictions
Below are the performance statistics:
<- confusionMatrix(xgbTree_predictions, test_set$quality) xgbTreeCM
xgbTreeCM
Confusion Matrix and Statistics
Reference
Prediction 4 5 6 7 8
4 9 19 6 0 0
5 10 194 58 7 0
6 11 74 337 58 10
7 1 5 40 98 15
8 0 2 2 9 15
Overall Statistics
Accuracy : 0.6663
95% CI : (0.6358, 0.6958)
No Information Rate : 0.452
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4951
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
Sensitivity 0.290323 0.6599 0.7607 0.5698 0.37500
Specificity 0.973656 0.8907 0.7151 0.9245 0.98617
Pos Pred Value 0.264706 0.7212 0.6878 0.6164 0.53571
Neg Pred Value 0.976744 0.8594 0.7837 0.9099 0.97374
Prevalence 0.031633 0.3000 0.4520 0.1755 0.04082
Detection Rate 0.009184 0.1980 0.3439 0.1000 0.01531
Detection Prevalence 0.034694 0.2745 0.5000 0.1622 0.02857
Balanced Accuracy 0.631990 0.7753 0.7379 0.7471 0.68059
Recording training set performance:
nrow(performance_metrics_test) + 1, ] <- c('xgbTree', round(xgbTreeCM$overall[1],4),
performance_metrics_test[round(mean(xgbTreeCM$byClass[ , "Balanced Accuracy"], na.rm = TRUE),4),
round(max(prop.table(table(test_set$quality))),4),
round(mean(xgbTreeCM$byClass[ , "F1"], na.rm = TRUE),4),
round(mean(xgbTreeCM$byClass[ , "Sensitivity"], na.rm = TRUE),4),
round(mean(xgbTreeCM$byClass[ , "Specificity"], na.rm = TRUE),4),
round(mean(xgbTreeCM$byClass[ , "Pos Pred Value"], na.rm = TRUE),4))
SVM
Linear
Next, I will train an SVM model with a linear kernel and a radial kernel.
<- makeCluster(detectCores()-2) cluster2
registerDoParallel(cluster2)
set.seed(789)
<- train(
svm_lin ~ .,
quality data = train_set,
metric = 'Mean_Balanced_Accuracy',
method = 'svmLinear',
preProcess = c('center','scale'),
trControl = ctrl,
tuneGrid = expand.grid(.C=c(0.1, 1, 10, 100))
)
stopCluster(cluster2)
svm_lin
Support Vector Machines with Linear Kernel
3918 samples
11 predictor
5 classes: '4', '5', '6', '7', '8'
Pre-processing: centered (11), scaled (11)
Resampling: Cross-Validated (5 fold, repeated 5 times)
Summary of sample sizes: 3135, 3133, 3134, 3135, 3135, 3135, ...
Addtional sampling using SMOTE prior to pre-processing
Resampling results across tuning parameters:
C Accuracy Kappa Mean_F1 Mean_Sensitivity Mean_Specificity
0.1 0.3476856 0.1720657 0.3037014 0.4170020 0.8370933
1.0 0.3482446 0.1724033 0.3040978 0.4159401 0.8373207
10.0 0.3466602 0.1712505 0.3029620 0.4156929 0.8370180
100.0 0.3481936 0.1732629 0.3045583 0.4193555 0.8374660
Mean_Pos_Pred_Value Mean_Neg_Pred_Value Mean_Precision Mean_Recall
0.3256822 0.8360688 0.3256822 0.4170020
0.3267691 0.8360737 0.3267691 0.4159401
0.3256305 0.8359359 0.3256305 0.4156929
0.3272256 0.8363491 0.3272256 0.4193555
Mean_Detection_Rate Mean_Balanced_Accuracy
0.06953712 0.6270477
0.06964892 0.6266304
0.06933205 0.6263555
0.06963872 0.6284108
Mean_Balanced_Accuracy was used to select the optimal model using the
largest value.
The final value used for the model was C = 100.
Recording training set performance:
nrow(performance_metrics_training) + 1, ] <- c('svm_lin', round(getTrainPerf(svm_lin)[1],4),
performance_metrics_training[round(getTrainPerf(svm_lin)[11],4),
round(max(prop.table(table(train_set$quality))),4),
round(getTrainPerf(svm_lin)[3],4),
round(getTrainPerf(svm_lin)[4],4),
round(getTrainPerf(svm_lin)[5],4),
round(getTrainPerf(svm_lin)[8],4))
Next, I predict the test set
<- predict(svm_lin, test_set, type = 'raw') svmLin_predictions
Below are the performance metrics:
<- confusionMatrix(svmLin_predictions, test_set$quality) svm_linCM
svm_linCM
Confusion Matrix and Statistics
Reference
Prediction 4 5 6 7 8
4 17 75 42 9 0
5 8 149 134 20 3
6 2 38 105 29 2
7 1 17 90 54 8
8 3 15 72 60 27
Overall Statistics
Accuracy : 0.3592
95% CI : (0.3291, 0.3901)
No Information Rate : 0.452
P-Value [Acc > NIR] : 1
Kappa : 0.1787
Mcnemar's Test P-Value : <2e-16
Statistics by Class:
Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
Sensitivity 0.54839 0.5068 0.2370 0.3140 0.67500
Specificity 0.86723 0.7595 0.8678 0.8564 0.84043
Pos Pred Value 0.11888 0.4745 0.5966 0.3176 0.15254
Neg Pred Value 0.98327 0.7823 0.5796 0.8543 0.98381
Prevalence 0.03163 0.3000 0.4520 0.1755 0.04082
Detection Rate 0.01735 0.1520 0.1071 0.0551 0.02755
Detection Prevalence 0.14592 0.3204 0.1796 0.1735 0.18061
Balanced Accuracy 0.70781 0.6331 0.5524 0.5852 0.75771
Recording test set performance:
nrow(performance_metrics_test) + 1, ] <- c('svm_lin', round(svm_linCM$overall[1],4),
performance_metrics_test[round(mean(svm_linCM$byClass[ , "Balanced Accuracy"], na.rm = TRUE),4),
round(max(prop.table(table(test_set$quality))),4),
round(mean(svm_linCM$byClass[ , "F1"], na.rm = TRUE),4),
round(mean(svm_linCM$byClass[ , "Sensitivity"], na.rm = TRUE),4),
round(mean(svm_linCM$byClass[ , "Specificity"], na.rm = TRUE),4),
round(mean(svm_linCM$byClass[ , "Pos Pred Value"], na.rm = TRUE),4))
Radial
Next, I fit the SVM Radial algo
<- makeCluster(detectCores()-2) cluster3
registerDoParallel(cluster3)
set.seed(1011)
<- train(
svm_rad ~ .,
quality data = train_set,
metric = 'Mean_Balanced_Accuracy',
method = 'svmRadial',
preProcess = c('center','scale'),
trControl = ctrl,
tuneGrid = expand.grid(.sigma= c(0.08, 0.10, 0.12, 0.14, 0.16),
.C=c(0.1, 1, 10, 100))
)
stopCluster(cluster3)
svm_rad
Support Vector Machines with Radial Basis Function Kernel
3918 samples
11 predictor
5 classes: '4', '5', '6', '7', '8'
Pre-processing: centered (11), scaled (11)
Resampling: Cross-Validated (5 fold, repeated 5 times)
Summary of sample sizes: 3135, 3133, 3134, 3136, 3134, 3134, ...
Addtional sampling using SMOTE prior to pre-processing
Resampling results across tuning parameters:
sigma C Accuracy Kappa Mean_F1 Mean_Sensitivity
0.08 0.1 0.4114344 0.2234105 0.3537104 0.4517640
0.08 1.0 0.4595784 0.2730224 0.3936105 0.4749593
0.08 10.0 0.5023543 0.3126533 0.4264011 0.4851240
0.08 100.0 0.5533030 0.3627748 0.4665501 0.4971781
0.10 0.1 0.4177669 0.2288141 0.3589034 0.4543680
0.10 1.0 0.4669802 0.2795204 0.3977700 0.4731818
0.10 10.0 0.5210385 0.3320918 0.4429570 0.4949940
0.10 100.0 0.5676941 0.3784069 0.4808702 0.5060747
0.12 0.1 0.4259870 0.2366942 0.3647275 0.4565722
0.12 1.0 0.4763761 0.2879692 0.4045453 0.4751925
0.12 10.0 0.5365095 0.3488076 0.4554304 0.5011411
0.12 100.0 0.5774935 0.3887005 0.4910379 0.5096058
0.14 0.1 0.4322640 0.2421452 0.3698843 0.4588175
0.14 1.0 0.4862743 0.2977017 0.4125949 0.4799947
0.14 10.0 0.5460517 0.3588230 0.4605076 0.5003664
0.14 100.0 0.5805028 0.3908717 0.4938216 0.5095642
0.16 0.1 0.4392085 0.2485639 0.3740577 0.4593188
0.16 1.0 0.4951546 0.3071976 0.4195712 0.4832846
0.16 10.0 0.5557510 0.3672635 0.4668386 0.4991272
0.16 100.0 0.5866818 0.3973105 0.4997673 0.5128032
Mean_Specificity Mean_Pos_Pred_Value Mean_Neg_Pred_Value Mean_Precision
0.8474994 0.3549575 0.8453332 0.3549575
0.8573258 0.3849821 0.8544649 0.3849821
0.8644394 0.4101334 0.8621169 0.4101334
0.8733854 0.4504084 0.8716356 0.4504084
0.8484050 0.3576076 0.8461988 0.3576076
0.8586029 0.3882527 0.8557243 0.3882527
0.8680845 0.4245917 0.8657850 0.4245917
0.8762287 0.4663586 0.8746591 0.4663586
0.8500557 0.3620661 0.8476815 0.3620661
0.8601544 0.3929025 0.8573072 0.3929025
0.8711660 0.4364696 0.8689867 0.4364696
0.8779635 0.4796898 0.8765989 0.4796898
0.8510056 0.3655828 0.8485395 0.3655828
0.8619481 0.3990621 0.8591832 0.3990621
0.8731047 0.4425554 0.8708704 0.4425554
0.8782505 0.4842151 0.8770822 0.4842151
0.8523036 0.3686943 0.8497458 0.3686943
0.8637963 0.4049953 0.8611072 0.4049953
0.8745299 0.4500078 0.8725317 0.4500078
0.8792966 0.4925219 0.8783407 0.4925219
Mean_Recall Mean_Detection_Rate Mean_Balanced_Accuracy
0.4517640 0.08228688 0.6496317
0.4749593 0.09191567 0.6661425
0.4851240 0.10047086 0.6747817
0.4971781 0.11066060 0.6852817
0.4543680 0.08355339 0.6513865
0.4731818 0.09339605 0.6658924
0.4949940 0.10420770 0.6815393
0.5060747 0.11353882 0.6911517
0.4565722 0.08519739 0.6533139
0.4751925 0.09527522 0.6676735
0.5011411 0.10730189 0.6861535
0.5096058 0.11549870 0.6937847
0.4588175 0.08645281 0.6549116
0.4799947 0.09725485 0.6709714
0.5003664 0.10921035 0.6867355
0.5095642 0.11610056 0.6939073
0.4593188 0.08784170 0.6558112
0.4832846 0.09903092 0.6735404
0.4991272 0.11115019 0.6868285
0.5128032 0.11733636 0.6960499
Mean_Balanced_Accuracy was used to select the optimal model using the
largest value.
The final values used for the model were sigma = 0.16 and C = 100.
Recording training set performance:
nrow(performance_metrics_training) + 1, ] <- c('svm_rad', round(getTrainPerf(svm_rad)[1],4),
performance_metrics_training[round(getTrainPerf(svm_rad)[11],4),
round(max(prop.table(table(train_set$quality))),4),
round(getTrainPerf(svm_rad)[3],4),
round(getTrainPerf(svm_rad)[4],4),
round(getTrainPerf(svm_rad)[5],4),
round(getTrainPerf(svm_rad)[8],4))
Next, I predict the test set
<- predict(svm_rad, test_set, type = 'raw') svmRad_predictions
Below are the performance metrics
<- confusionMatrix(svmRad_predictions, test_set$quality) svm_radCM
svm_radCM
Confusion Matrix and Statistics
Reference
Prediction 4 5 6 7 8
4 9 15 14 0 0
5 12 197 89 11 3
6 8 68 279 52 9
7 2 13 50 96 14
8 0 1 11 13 14
Overall Statistics
Accuracy : 0.6071
95% CI : (0.5758, 0.6379)
No Information Rate : 0.452
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4209
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
Sensitivity 0.290323 0.6701 0.6298 0.55814 0.35000
Specificity 0.969442 0.8324 0.7449 0.90223 0.97340
Pos Pred Value 0.236842 0.6314 0.6707 0.54857 0.35897
Neg Pred Value 0.976645 0.8548 0.7092 0.90559 0.97237
Prevalence 0.031633 0.3000 0.4520 0.17551 0.04082
Detection Rate 0.009184 0.2010 0.2847 0.09796 0.01429
Detection Prevalence 0.038776 0.3184 0.4245 0.17857 0.03980
Balanced Accuracy 0.629882 0.7512 0.6873 0.73018 0.66170
Recording test set performance:
nrow(performance_metrics_test) + 1, ] <- c('svm_rad', round(svm_radCM$overall[1],4),
performance_metrics_test[round(mean(svm_radCM$byClass[ , "Balanced Accuracy"], na.rm = TRUE),4),
round(max(prop.table(table(test_set$quality))),4),
round(mean(svm_radCM$byClass[ , "F1"], na.rm = TRUE),4),
round(mean(svm_radCM$byClass[ , "Sensitivity"], na.rm = TRUE),4),
round(mean(svm_radCM$byClass[ , "Specificity"], na.rm = TRUE),4),
round(mean(svm_radCM$byClass[ , "Pos Pred Value"], na.rm = TRUE),4))
Neural Network
Next, I fit a multilayer perceptron neural network
For the neural network, I check to see which features are highly correlated and remove them to avoid multi-collinearity issues
<- findCorrelation(cor(train_set[,-12]), cutoff = 0.75)
tooHigh <- train_set[,-tooHigh]
trainNNet <- test_set[,-tooHigh] testNNet
<- makeCluster(detectCores()-2) cluster4
registerDoParallel(cluster4)
set.seed(5000)
<- train(
nn_mlp ~ .,
quality data = trainNNet,
metric = 'Mean_Balanced_Accuracy',
method = 'mlpWeightDecay',
preProcess = c('center','scale'),
trControl = ctrl,
tuneGrid = expand.grid(.size= c(1,2,3,4),
.decay=c(0.00, 0.01, 0.1))
)
Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
stopCluster(cluster4)
nn_mlp
Multi-Layer Perceptron
3918 samples
10 predictor
5 classes: '4', '5', '6', '7', '8'
Pre-processing: centered (10), scaled (10)
Resampling: Cross-Validated (5 fold, repeated 5 times)
Summary of sample sizes: 3135, 3135, 3133, 3135, 3134, 3135, ...
Addtional sampling using SMOTE prior to pre-processing
Resampling results across tuning parameters:
size decay Accuracy Kappa Mean_F1 Mean_Sensitivity
1 0.00 0.1789790 0.05851172 0.1526592 0.3385801
1 0.01 0.2276945 0.00000000 NaN 0.2000000
1 0.10 0.1731827 0.00000000 NaN 0.2000000
2 0.00 0.3378315 0.15710392 0.2535107 0.4052123
2 0.01 0.1845955 0.00000000 NaN 0.2000000
2 0.10 0.1819732 0.00000000 NaN 0.2000000
3 0.00 0.3773336 0.18316556 0.3034585 0.4263705
3 0.01 0.1775491 0.00000000 NaN 0.2000000
3 0.10 0.2545405 0.00000000 NaN 0.2000000
4 0.00 0.3667079 0.18871705 0.3152960 0.4340435
4 0.01 0.1894467 0.00000000 NaN 0.2000000
4 0.10 0.1729334 0.00000000 NaN 0.2000000
Mean_Specificity Mean_Pos_Pred_Value Mean_Neg_Pred_Value Mean_Precision
0.8123659 0.2472842 0.8159094 0.2472842
0.8000000 NaN NaN NaN
0.8000000 NaN NaN NaN
0.8331571 0.3041440 0.8379841 0.3041440
0.8000000 NaN NaN NaN
0.8000000 NaN NaN NaN
0.8392445 0.3381684 0.8386515 0.3381684
0.8000000 NaN NaN NaN
0.8000000 NaN NaN NaN
0.8400116 0.3356239 0.8403945 0.3356239
0.8000000 NaN NaN NaN
0.8000000 NaN NaN NaN
Mean_Recall Mean_Detection_Rate Mean_Balanced_Accuracy
0.3385801 0.03579580 0.5754730
0.2000000 0.04553891 0.5000000
0.2000000 0.03463654 0.5000000
0.4052123 0.06756630 0.6191847
0.2000000 0.03691911 0.5000000
0.2000000 0.03639463 0.5000000
0.4263705 0.07546672 0.6328075
0.2000000 0.03550982 0.5000000
0.2000000 0.05090811 0.5000000
0.4340435 0.07334158 0.6370275
0.2000000 0.03788934 0.5000000
0.2000000 0.03458668 0.5000000
Mean_Balanced_Accuracy was used to select the optimal model using the
largest value.
The final values used for the model were size = 4 and decay = 0.
Recording training set performance:
nrow(performance_metrics_training) + 1, ] <- c('nn_mlp', round(getTrainPerf(nn_mlp)[1],4),
performance_metrics_training[round(getTrainPerf(nn_mlp)[11],4),
round(max(prop.table(table(trainNNet$quality))),4),
round(getTrainPerf(nn_mlp)[3],4),
round(getTrainPerf(nn_mlp)[4],4),
round(getTrainPerf(nn_mlp)[5],4),
round(getTrainPerf(nn_mlp)[8],4))
library(NeuralNetTools)
plotnet(nn_mlp$finalModel,
circle_col = list(input = "skyblue", hidden = "orange", output = "tomato"),
alpha = 0.8,
cex_val = 0.75)
Next, I predict the test set:
<- predict(nn_mlp, testNNet, type = 'raw') nn_mlp_predictions
Below are the performance metrics:
<- confusionMatrix(nn_mlp_predictions, testNNet$quality) nn_mlpCM
nn_mlpCM
Confusion Matrix and Statistics
Reference
Prediction 4 5 6 7 8
4 11 43 18 3 0
5 12 164 123 19 0
6 2 51 136 30 4
7 2 29 106 61 9
8 4 7 60 59 27
Overall Statistics
Accuracy : 0.4071
95% CI : (0.3762, 0.4387)
No Information Rate : 0.452
P-Value [Acc > NIR] : 0.9979
Kappa : 0.2135
Mcnemar's Test P-Value : <2e-16
Statistics by Class:
Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
Sensitivity 0.35484 0.5578 0.3070 0.35465 0.67500
Specificity 0.93256 0.7755 0.8380 0.81931 0.86170
Pos Pred Value 0.14667 0.5157 0.6099 0.29469 0.17197
Neg Pred Value 0.97790 0.8036 0.5945 0.85640 0.98420
Prevalence 0.03163 0.3000 0.4520 0.17551 0.04082
Detection Rate 0.01122 0.1673 0.1388 0.06224 0.02755
Detection Prevalence 0.07653 0.3245 0.2276 0.21122 0.16020
Balanced Accuracy 0.64370 0.6667 0.5725 0.58698 0.76835
Recording test set performance:
nrow(performance_metrics_test) + 1, ] <- c('nn_mlp', round(nn_mlpCM$overall[1],4),
performance_metrics_test[round(mean(nn_mlpCM$byClass[ , "Balanced Accuracy"], na.rm = TRUE),4),
round(max(prop.table(table(testNNet$quality))),4),
round(mean(nn_mlpCM$byClass[ , "F1"], na.rm = TRUE),4),
round(mean(nn_mlpCM$byClass[ , "Sensitivity"], na.rm = TRUE),4),
round(mean(nn_mlpCM$byClass[ , "Specificity"], na.rm = TRUE),4),
round(mean(nn_mlpCM$byClass[ , "Pos Pred Value"], na.rm = TRUE),4))
Conclusion
Below is the training set performance
performance_metrics_training
Model Accuracy Mean_Balanced_Accuracy No_Information_Rate Mean_F1
1 xgbTree 0.6399 0.7215 0.4479 0.5537
2 svm_lin 0.3482 0.6284 0.4479 0.3046
3 svm_rad 0.5867 0.6960 0.4479 0.4998
4 nn_mlp 0.3667 0.6370 0.4479 0.3153
Sensitivity Specificity Precision
1 0.5517 0.8914 0.5592
2 0.4194 0.8375 0.3272
3 0.5128 0.8793 0.4925
4 0.4340 0.8400 0.3356
And the test set performance
performance_metrics_test
Model Accuracy Mean_Balanced_Accuracy No_Information_Rate Mean_F1
1 xgbTree 0.6663 0.7146 0.452 0.5444
2 svm_lin 0.3592 0.6473 0.452 0.3179
3 svm_rad 0.6071 0.6921 0.452 0.4937
4 nn_mlp 0.4071 0.6476 0.452 0.3496
Sensitivity Specificity Precision
1 0.5311 0.898 0.5651
2 0.4562 0.8383 0.332
3 0.4997 0.8845 0.4893
4 0.4499 0.8454 0.3478
The xgbTree model has the best performance (accuracy higher than the no information rate, highest mean balanced accuracy) on both the training and test sets, though svm_rad was close. Surprisingly, the neural net had the worst performance.
While xgbTree had decent balanced accuracy, it still had fairly low average precision and average sensitivity across classes. Looking at the class by by class performance, the model had decent performance on classes 5, 6, and 7 (sensitivity and precision) but struggled on classes 4 and 8. This can present a business issue: while we can predict the average classes (5, 6, 7) with some confidence (and use this to price our wines), we struggle at predicting the outlier classes (4 and 8) and thus might have a scenario where we overprice a client for inferior wine (worst outcome) or under price for a high quality wine (tolerable as client is pleasantly surprised). Lastly, performance on the test set for xgbTree was similar to performance on the training set, which suggests a decent balance between bias and variance (or low variance but possibly high bias as performance is decent but can be better).
Further improvements can involve using outlier detection algorithms to classify the high and low quality wines, expanded hyper-parameter tuning for the xgbTree model, utilizing a more complex neural network, and obtaining more data on the outlier wines.