Assignment4

Author

Semyon Toybis

Introduction

For this assignment, I will work with the white wine data set from the Wine Quality data set from the UC Irvine Machine Learning Repository. The goal for this data set is to predict wine quality based on a variety of physicochemical tests. From a business perspective, we can assume that we are the owner of the winery. If we are able to accurately predict wine quality based on the input features, we may have a better understanding of how to tailor our production methods to create a wine of a desired quality. Additionally, we can use this information in pricing our wines which may come in handy for estimating future revenues.

Exploratory Data analysis

library(tidyverse)
library(GGally)
library(MLmetrics)
library(caret)

First I import the data-set

wine_data <- read_delim('winequality-white.csv',
                         delim = ';',
                         col_names = T)
wine_data <- as.data.frame(wine_data)

First, I examine the structure of the data frame.

str(wine_data)
'data.frame':   4898 obs. of  12 variables:
 $ fixed acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
 $ volatile acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
 $ citric acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
 $ residual sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
 $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
 $ free sulfur dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
 $ total sulfur dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
 $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
 $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
 $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
 $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
 $ quality             : num  6 6 6 6 6 6 6 6 6 6 ...

The target variable is quality, which ranges from 0 to 10, based on the description of the data set. Below, I check the values that are present:

unique(wine_data$quality)
[1] 6 5 7 8 4 3 9

As seen above, the values 0, 1, 2, and 10 are not present in the data set, which suggests wines of this quality are rare. This presents an issue, as quality can take a value from 0 to 10 but the data set does not have the full range of possible quality values.

table(wine_data$quality)

   3    4    5    6    7    8    9 
  20  163 1457 2198  880  175    5 
prop.table(table(wine_data$quality))

          3           4           5           6           7           8 
0.004083299 0.033278889 0.297468354 0.448754594 0.179665169 0.035728869 
          9 
0.001020825 
barplot(prop.table(table(wine_data$quality)), 
        main = 'Class proportions for targert variable',
        xlab = 'Quality',
        ylab = 'Proportion of observations')

I do not know enough about the data to synthetically create observations of quality 0, 1, 2, or 10. Additionally, the extremely low frequency of quality 3 and 9 will present a problem for cross validation as these classes may not be in included in a cross validation fold or the model may not predict them, which causes NAs for metrics like F1, Specificity, Accuracy, etc. Thus, I will combine class 3 into class 4 and class 9 into class 8.

wine_data <- wine_data |> mutate(quality=replace(quality, quality == 3, 4)) |>
  mutate(quality=replace(quality, quality == 9, 8))

Next, I convert the target to a factor with the simplifying assumption that the only possible values are 4 to 8. Because these quality wines did not appear in the data set, the business risk from classifying an unseen observation that has one of these qualities is low. For example, even if we classified a 10 quality wine as a different quality we are likely not making many bottles of that quality so the revenue lost from under pricing the bottle would be low. That said, a conversation with management may lead to a different modelling approach.

I convert to a factor and treat this problem as a multiclass classification problem.

wine_data$quality <- factor(wine_data$quality,levels = seq(4,8))

First, I check whether there are any missing values in the data:

colSums(is.na(wine_data))
       fixed acidity     volatile acidity          citric acid 
                   0                    0                    0 
      residual sugar            chlorides  free sulfur dioxide 
                   0                    0                    0 
total sulfur dioxide              density                   pH 
                   0                    0                    0 
           sulphates              alcohol              quality 
                   0                    0                    0 

Next, I will remove spaces from the feature names:

colnames(wine_data) <- str_replace_all(colnames(wine_data),pattern = " ", replacement = '_')

I visualize the correlations between the features

corrplot::corrplot(cor(wine_data[,-12]), method = 'number', type = 'lower')

Correlations are low for the most part, though the below pairs have fairly high correlation:

  • density and residual sugar

  • density and alcohol

Feature selection may be necessary for certain algorithms to avoid multicollinearity.

Next, I visualize the distribution of the features:

wine_data |> select(-quality) |> pivot_longer(cols = everything(),
                                              names_to = 'variable',
                                              values_to = 'value') |>
  ggplot(aes(x=value)) + geom_histogram() + facet_wrap(vars(variable),
                                                       scales = 'free')
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

psych::describe(wine_data[,-12])
                     vars    n   mean    sd median trimmed   mad  min    max
fixed_acidity           1 4898   6.85  0.84   6.80    6.82  0.74 3.80  14.20
volatile_acidity        2 4898   0.28  0.10   0.26    0.27  0.09 0.08   1.10
citric_acid             3 4898   0.33  0.12   0.32    0.33  0.09 0.00   1.66
residual_sugar          4 4898   6.39  5.07   5.20    5.80  5.34 0.60  65.80
chlorides               5 4898   0.05  0.02   0.04    0.04  0.01 0.01   0.35
free_sulfur_dioxide     6 4898  35.31 17.01  34.00   34.36 16.31 2.00 289.00
total_sulfur_dioxide    7 4898 138.36 42.50 134.00  136.96 43.00 9.00 440.00
density                 8 4898   0.99  0.00   0.99    0.99  0.00 0.99   1.04
pH                      9 4898   3.19  0.15   3.18    3.18  0.15 2.72   3.82
sulphates              10 4898   0.49  0.11   0.47    0.48  0.10 0.22   1.08
alcohol                11 4898  10.51  1.23  10.40   10.43  1.48 8.00  14.20
                      range skew kurtosis   se
fixed_acidity         10.40 0.65     2.17 0.01
volatile_acidity       1.02 1.58     5.08 0.00
citric_acid            1.66 1.28     6.16 0.00
residual_sugar        65.20 1.08     3.46 0.07
chlorides              0.34 5.02    37.51 0.00
free_sulfur_dioxide  287.00 1.41    11.45 0.24
total_sulfur_dioxide 431.00 0.39     0.57 0.61
density                0.05 0.98     9.78 0.00
pH                     1.10 0.46     0.53 0.00
sulphates              0.86 0.98     1.59 0.00
alcohol                6.20 0.49    -0.70 0.02

Some of the features have positive skew and excess kurtosis, though the mean and median for most of the features are fairly close. For interpretability purposes, I will train models without doing transformations on the features (besides centering and scaling), though if the models perform poorly, it could be worthwhile to try a log transformation of the features before training the models.

Next, I check the distribution of the target variable

prop.table(table(wine_data$quality))

         4          5          6          7          8 
0.03736219 0.29746835 0.44875459 0.17966517 0.03674969 
barplot(prop.table(table(wine_data$quality)), 
        main = 'Class proportions for targert variable',
        xlab = 'Quality',
        ylab = 'Proportion of observations')

The target variable is imbalanced, with low instances for quality 4 and 8.

EvalAlgorithm selection

The task at hand is a multi-class classification problem with an imbalanced data set.

I will train and test four different algorithms to compare performance: random forest with xgboost, support vector machines using a one vs rest approach, and a neural network with a one vs all approach.

Additionally, because the data set is imbalanced, I will use the smote algorithm during cross validation to create synthetically balanced training sets. All of the features are continuous, so there is no need for dummy variables.

The data set is imbalanced, so looking at accuracy alone will be misleading. Thus, I will use mean balanced accuracy (the average of sensitivity, or the true positive rate, and specificity, the true negative rate) as the evaluation metric for training. Additionally, comparing the models performance on the test set by looking at balanced accuracy, mean F1, sensitivity, specificity, and precision may provide further insight, especially when looking at class by class performance.

Data Preparation

Train - Test split

First, I create train-test split, using an 80%/20% ratio.

set.seed(123)

sample_set <- sample(nrow(wine_data), round(nrow(wine_data)*0.8), replace = FALSE)

train_set <- wine_data[sample_set,]
test_set <- wine_data[-sample_set,]

Below are the proportions of the target variable in the train and test set

Train:

round(prop.table(table(select(train_set, quality), exclude = NULL)), 4) * 100
quality
    4     5     6     7     8 
 3.88 29.68 44.79 18.07  3.57 

Test:

round(prop.table(table(select(test_set, quality), exclude = NULL)), 4) * 100
quality
    4     5     6     7     8 
 3.16 30.00 45.20 17.55  4.08 

Performance Tracking Dataframes

Below I create data frames to track performance:

performance_metrics_training <- data.frame(Model= character(),
                                     Accuracy = numeric(),
                                     Mean_Balanced_Accuracy = numeric(),
                                     No_Information_Rate = numeric(),
                                     Mean_F1 = numeric(),
                                     Sensitivity = numeric(),
                                     Specificity = numeric(),
                                     Precision = numeric())
performance_metrics_test <- data.frame(Model= character(),
                                     Accuracy = numeric(),
                                     Mean_Balanced_Accuracy = numeric(),
                                     No_Information_Rate = numeric(),
                                     Mean_F1 = numeric(),
                                     Sensitivity = numeric(),
                                     Specificity = numeric(),
                                     Precision = numeric())

xgbTree

First, I train a random forest model with xgBoost

library(themis)
Loading required package: recipes

Attaching package: 'recipes'
The following object is masked from 'package:stringr':

    fixed
The following object is masked from 'package:stats':

    step

I will use parallel computing to improve run time

library(foreach)
library(doParallel)
modelLookup('xgbTree')
    model        parameter                          label forReg forClass
1 xgbTree          nrounds          # Boosting Iterations   TRUE     TRUE
2 xgbTree        max_depth                 Max Tree Depth   TRUE     TRUE
3 xgbTree              eta                      Shrinkage   TRUE     TRUE
4 xgbTree            gamma         Minimum Loss Reduction   TRUE     TRUE
5 xgbTree colsample_bytree     Subsample Ratio of Columns   TRUE     TRUE
6 xgbTree min_child_weight Minimum Sum of Instance Weight   TRUE     TRUE
7 xgbTree        subsample           Subsample Percentage   TRUE     TRUE
  probModel
1      TRUE
2      TRUE
3      TRUE
4      TRUE
5      TRUE
6      TRUE
7      TRUE

There are seven tunable parameters for the xgbTree model. Tuning all of these parameters may take a significant amount of time and computing power, thus I will try tuning two parameters: nrounds, which is the number of boosting iterations, and max_depth, which is the maximum tree depth. I will use default values for the other parameters in order to manage run time.

tune_grid_xgb <- expand.grid(
  nrounds = c(100, 200, 300),       
  max_depth = c(3, 6, 9, 12),         
  eta = 0.3,            
  gamma = 0.01,              
  colsample_bytree = 1,
  min_child_weight = 1,   
  subsample = 1        
)

Below I set the train control parameters. I will do 5 fold cross validation and I will use the smote technique to create artificially balanced training sets within each fold.

ctrl <- trainControl(
  method = 'repeatedcv',
  number = 5,
  repeats = 5,
  summaryFunction = multiClassSummary,
  sampling = 'smote',
  classProbs = F
)
cluster1 <- makeCluster(detectCores()-2)
registerDoParallel(cluster1)

set.seed(456)

xgbTree <- train(
  quality ~ .,
  data = train_set,
  preProcess = c('center','scale'),
  metric = 'Mean_Balanced_Accuracy',
  method = 'xgbTree',
  trControl = ctrl,
  tuneGrid = tune_grid_xgb
)

stopCluster(cluster1)
xgbTree
eXtreme Gradient Boosting 

3918 samples
  11 predictor
   5 classes: '4', '5', '6', '7', '8' 

Pre-processing: centered (11), scaled (11) 
Resampling: Cross-Validated (5 fold, repeated 5 times) 
Summary of sample sizes: 3135, 3134, 3133, 3136, 3134, 3134, ... 
Addtional sampling using SMOTE prior to pre-processing

Resampling results across tuning parameters:

  max_depth  nrounds  Accuracy   Kappa      Mean_F1    Mean_Sensitivity
   3         100      0.5460434  0.3534250  0.4702838  0.5158949       
   3         200      0.5882083  0.3979457  0.5139890  0.5311809       
   3         300      0.6035220  0.4138226  0.5297622  0.5343977       
   6         100      0.6295575  0.4499886  0.5477090  0.5461920       
   6         200      0.6343524  0.4541545  0.5514231  0.5444607       
   6         300      0.6340459  0.4540595  0.5512820  0.5451630       
   9         100      0.6398655  0.4638203  0.5537106  0.5517001       
   9         200      0.6374146  0.4604411  0.5517901  0.5504608       
   9         300      0.6366482  0.4593506  0.5498213  0.5487264       
  12         100      0.6365443  0.4610502  0.5477002  0.5503784       
  12         200      0.6366480  0.4604401  0.5486310  0.5502158       
  12         300      0.6354231  0.4588203  0.5472706  0.5492307       
  Mean_Specificity  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value  Mean_Precision
  0.8713705         0.4496733            0.8690382            0.4496733     
  0.8789158         0.5025965            0.8779797            0.5025965     
  0.8813866         0.5284779            0.8813905            0.5284779     
  0.8887219         0.5523054            0.8891568            0.5523054     
  0.8892217         0.5622326            0.8902830            0.5622326     
  0.8892055         0.5609152            0.8901894            0.5609152     
  0.8913866         0.5592199            0.8922196            0.5592199     
  0.8906791         0.5561252            0.8914296            0.5561252     
  0.8904995         0.5540436            0.8912409            0.5540436     
  0.8911785         0.5483484            0.8915643            0.5483484     
  0.8909383         0.5505806            0.8915086            0.5505806     
  0.8906278         0.5487203            0.8911511            0.5487203     
  Mean_Recall  Mean_Detection_Rate  Mean_Balanced_Accuracy
  0.5158949    0.1092087            0.6936327             
  0.5311809    0.1176417            0.7050483             
  0.5343977    0.1207044            0.7078921             
  0.5461920    0.1259115            0.7174570             
  0.5444607    0.1268705            0.7168412             
  0.5451630    0.1268092            0.7171843             
  0.5517001    0.1279731            0.7215433             
  0.5504608    0.1274829            0.7205699             
  0.5487264    0.1273296            0.7196129             
  0.5503784    0.1273089            0.7207784             
  0.5502158    0.1273296            0.7205771             
  0.5492307    0.1270846            0.7199293             

Tuning parameter 'eta' was held constant at a value of 0.3
Tuning

Tuning parameter 'min_child_weight' was held constant at a value of 1

Tuning parameter 'subsample' was held constant at a value of 1
Mean_Balanced_Accuracy was used to select the optimal model using the
 largest value.
The final values used for the model were nrounds = 100, max_depth = 9, eta
 = 0.3, gamma = 0.01, colsample_bytree = 1, min_child_weight = 1 and
 subsample = 1.

Recording training set performance:

performance_metrics_training[nrow(performance_metrics_training) + 1, ] <- c('xgbTree',                                                                round(getTrainPerf(xgbTree)[1],4),
                                                                round(getTrainPerf(xgbTree)[11],4),
                                                                round(max(prop.table(table(train_set$quality))),4),
                                                               round(getTrainPerf(xgbTree)[3],4),
                                                               round(getTrainPerf(xgbTree)[4],4),
                                                               round(getTrainPerf(xgbTree)[5],4),
                                                               round(getTrainPerf(xgbTree)[8],4))

Next, I try predicting the test set:

xgbTree_predictions <- predict(xgbTree, test_set, type = 'raw')

Below are the performance statistics:

xgbTreeCM <- confusionMatrix(xgbTree_predictions, test_set$quality)
xgbTreeCM
Confusion Matrix and Statistics

          Reference
Prediction   4   5   6   7   8
         4   9  19   6   0   0
         5  10 194  58   7   0
         6  11  74 337  58  10
         7   1   5  40  98  15
         8   0   2   2   9  15

Overall Statistics
                                          
               Accuracy : 0.6663          
                 95% CI : (0.6358, 0.6958)
    No Information Rate : 0.452           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.4951          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
Sensitivity          0.290323   0.6599   0.7607   0.5698  0.37500
Specificity          0.973656   0.8907   0.7151   0.9245  0.98617
Pos Pred Value       0.264706   0.7212   0.6878   0.6164  0.53571
Neg Pred Value       0.976744   0.8594   0.7837   0.9099  0.97374
Prevalence           0.031633   0.3000   0.4520   0.1755  0.04082
Detection Rate       0.009184   0.1980   0.3439   0.1000  0.01531
Detection Prevalence 0.034694   0.2745   0.5000   0.1622  0.02857
Balanced Accuracy    0.631990   0.7753   0.7379   0.7471  0.68059

Recording training set performance:

performance_metrics_test[nrow(performance_metrics_test) + 1, ] <- c('xgbTree',                                                                round(xgbTreeCM$overall[1],4),
                                                                round(mean(xgbTreeCM$byClass[ , "Balanced Accuracy"], na.rm = TRUE),4),
                                                                round(max(prop.table(table(test_set$quality))),4),
                                                               round(mean(xgbTreeCM$byClass[ , "F1"], na.rm = TRUE),4),
                                                               round(mean(xgbTreeCM$byClass[ , "Sensitivity"], na.rm = TRUE),4),
                                                               round(mean(xgbTreeCM$byClass[ , "Specificity"], na.rm = TRUE),4),
                                                               round(mean(xgbTreeCM$byClass[ , "Pos Pred Value"], na.rm = TRUE),4))

SVM

Linear

Next, I will train an SVM model with a linear kernel and a radial kernel.

cluster2 <- makeCluster(detectCores()-2)
registerDoParallel(cluster2)
set.seed(789)

svm_lin <- train(
  quality ~ .,
  data = train_set,
  metric = 'Mean_Balanced_Accuracy',
  method = 'svmLinear',
  preProcess = c('center','scale'),
  trControl = ctrl,
  tuneGrid = expand.grid(.C=c(0.1, 1, 10, 100))
)

stopCluster(cluster2)
svm_lin
Support Vector Machines with Linear Kernel 

3918 samples
  11 predictor
   5 classes: '4', '5', '6', '7', '8' 

Pre-processing: centered (11), scaled (11) 
Resampling: Cross-Validated (5 fold, repeated 5 times) 
Summary of sample sizes: 3135, 3133, 3134, 3135, 3135, 3135, ... 
Addtional sampling using SMOTE prior to pre-processing

Resampling results across tuning parameters:

  C      Accuracy   Kappa      Mean_F1    Mean_Sensitivity  Mean_Specificity
    0.1  0.3476856  0.1720657  0.3037014  0.4170020         0.8370933       
    1.0  0.3482446  0.1724033  0.3040978  0.4159401         0.8373207       
   10.0  0.3466602  0.1712505  0.3029620  0.4156929         0.8370180       
  100.0  0.3481936  0.1732629  0.3045583  0.4193555         0.8374660       
  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value  Mean_Precision  Mean_Recall
  0.3256822            0.8360688            0.3256822       0.4170020  
  0.3267691            0.8360737            0.3267691       0.4159401  
  0.3256305            0.8359359            0.3256305       0.4156929  
  0.3272256            0.8363491            0.3272256       0.4193555  
  Mean_Detection_Rate  Mean_Balanced_Accuracy
  0.06953712           0.6270477             
  0.06964892           0.6266304             
  0.06933205           0.6263555             
  0.06963872           0.6284108             

Mean_Balanced_Accuracy was used to select the optimal model using the
 largest value.
The final value used for the model was C = 100.

Recording training set performance:

performance_metrics_training[nrow(performance_metrics_training) + 1, ] <- c('svm_lin',                                                                round(getTrainPerf(svm_lin)[1],4),
                                                                round(getTrainPerf(svm_lin)[11],4),
                                                                round(max(prop.table(table(train_set$quality))),4),
                                                               round(getTrainPerf(svm_lin)[3],4),
                                                               round(getTrainPerf(svm_lin)[4],4),
                                                               round(getTrainPerf(svm_lin)[5],4),
                                                               round(getTrainPerf(svm_lin)[8],4))

Next, I predict the test set

svmLin_predictions <- predict(svm_lin, test_set, type = 'raw')

Below are the performance metrics:

svm_linCM <- confusionMatrix(svmLin_predictions, test_set$quality)
svm_linCM
Confusion Matrix and Statistics

          Reference
Prediction   4   5   6   7   8
         4  17  75  42   9   0
         5   8 149 134  20   3
         6   2  38 105  29   2
         7   1  17  90  54   8
         8   3  15  72  60  27

Overall Statistics
                                          
               Accuracy : 0.3592          
                 95% CI : (0.3291, 0.3901)
    No Information Rate : 0.452           
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1787          
                                          
 Mcnemar's Test P-Value : <2e-16          

Statistics by Class:

                     Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
Sensitivity           0.54839   0.5068   0.2370   0.3140  0.67500
Specificity           0.86723   0.7595   0.8678   0.8564  0.84043
Pos Pred Value        0.11888   0.4745   0.5966   0.3176  0.15254
Neg Pred Value        0.98327   0.7823   0.5796   0.8543  0.98381
Prevalence            0.03163   0.3000   0.4520   0.1755  0.04082
Detection Rate        0.01735   0.1520   0.1071   0.0551  0.02755
Detection Prevalence  0.14592   0.3204   0.1796   0.1735  0.18061
Balanced Accuracy     0.70781   0.6331   0.5524   0.5852  0.75771

Recording test set performance:

performance_metrics_test[nrow(performance_metrics_test) + 1, ] <- c('svm_lin',                                                                round(svm_linCM$overall[1],4),
                                                                round(mean(svm_linCM$byClass[ , "Balanced Accuracy"], na.rm = TRUE),4),
                                                                round(max(prop.table(table(test_set$quality))),4),
                                                               round(mean(svm_linCM$byClass[ , "F1"], na.rm = TRUE),4),
                                                               round(mean(svm_linCM$byClass[ , "Sensitivity"], na.rm = TRUE),4),
                                                               round(mean(svm_linCM$byClass[ , "Specificity"], na.rm = TRUE),4),
                                                               round(mean(svm_linCM$byClass[ , "Pos Pred Value"], na.rm = TRUE),4))

Radial

Next, I fit the SVM Radial algo

cluster3 <- makeCluster(detectCores()-2)
registerDoParallel(cluster3)
set.seed(1011)

svm_rad <- train(
  quality ~ .,
  data = train_set,
  metric = 'Mean_Balanced_Accuracy',
  method = 'svmRadial',
  preProcess = c('center','scale'),
  trControl = ctrl,
  tuneGrid = expand.grid(.sigma= c(0.08, 0.10, 0.12, 0.14, 0.16),
                         .C=c(0.1, 1, 10, 100))
)

stopCluster(cluster3)
svm_rad
Support Vector Machines with Radial Basis Function Kernel 

3918 samples
  11 predictor
   5 classes: '4', '5', '6', '7', '8' 

Pre-processing: centered (11), scaled (11) 
Resampling: Cross-Validated (5 fold, repeated 5 times) 
Summary of sample sizes: 3135, 3133, 3134, 3136, 3134, 3134, ... 
Addtional sampling using SMOTE prior to pre-processing

Resampling results across tuning parameters:

  sigma  C      Accuracy   Kappa      Mean_F1    Mean_Sensitivity
  0.08     0.1  0.4114344  0.2234105  0.3537104  0.4517640       
  0.08     1.0  0.4595784  0.2730224  0.3936105  0.4749593       
  0.08    10.0  0.5023543  0.3126533  0.4264011  0.4851240       
  0.08   100.0  0.5533030  0.3627748  0.4665501  0.4971781       
  0.10     0.1  0.4177669  0.2288141  0.3589034  0.4543680       
  0.10     1.0  0.4669802  0.2795204  0.3977700  0.4731818       
  0.10    10.0  0.5210385  0.3320918  0.4429570  0.4949940       
  0.10   100.0  0.5676941  0.3784069  0.4808702  0.5060747       
  0.12     0.1  0.4259870  0.2366942  0.3647275  0.4565722       
  0.12     1.0  0.4763761  0.2879692  0.4045453  0.4751925       
  0.12    10.0  0.5365095  0.3488076  0.4554304  0.5011411       
  0.12   100.0  0.5774935  0.3887005  0.4910379  0.5096058       
  0.14     0.1  0.4322640  0.2421452  0.3698843  0.4588175       
  0.14     1.0  0.4862743  0.2977017  0.4125949  0.4799947       
  0.14    10.0  0.5460517  0.3588230  0.4605076  0.5003664       
  0.14   100.0  0.5805028  0.3908717  0.4938216  0.5095642       
  0.16     0.1  0.4392085  0.2485639  0.3740577  0.4593188       
  0.16     1.0  0.4951546  0.3071976  0.4195712  0.4832846       
  0.16    10.0  0.5557510  0.3672635  0.4668386  0.4991272       
  0.16   100.0  0.5866818  0.3973105  0.4997673  0.5128032       
  Mean_Specificity  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value  Mean_Precision
  0.8474994         0.3549575            0.8453332            0.3549575     
  0.8573258         0.3849821            0.8544649            0.3849821     
  0.8644394         0.4101334            0.8621169            0.4101334     
  0.8733854         0.4504084            0.8716356            0.4504084     
  0.8484050         0.3576076            0.8461988            0.3576076     
  0.8586029         0.3882527            0.8557243            0.3882527     
  0.8680845         0.4245917            0.8657850            0.4245917     
  0.8762287         0.4663586            0.8746591            0.4663586     
  0.8500557         0.3620661            0.8476815            0.3620661     
  0.8601544         0.3929025            0.8573072            0.3929025     
  0.8711660         0.4364696            0.8689867            0.4364696     
  0.8779635         0.4796898            0.8765989            0.4796898     
  0.8510056         0.3655828            0.8485395            0.3655828     
  0.8619481         0.3990621            0.8591832            0.3990621     
  0.8731047         0.4425554            0.8708704            0.4425554     
  0.8782505         0.4842151            0.8770822            0.4842151     
  0.8523036         0.3686943            0.8497458            0.3686943     
  0.8637963         0.4049953            0.8611072            0.4049953     
  0.8745299         0.4500078            0.8725317            0.4500078     
  0.8792966         0.4925219            0.8783407            0.4925219     
  Mean_Recall  Mean_Detection_Rate  Mean_Balanced_Accuracy
  0.4517640    0.08228688           0.6496317             
  0.4749593    0.09191567           0.6661425             
  0.4851240    0.10047086           0.6747817             
  0.4971781    0.11066060           0.6852817             
  0.4543680    0.08355339           0.6513865             
  0.4731818    0.09339605           0.6658924             
  0.4949940    0.10420770           0.6815393             
  0.5060747    0.11353882           0.6911517             
  0.4565722    0.08519739           0.6533139             
  0.4751925    0.09527522           0.6676735             
  0.5011411    0.10730189           0.6861535             
  0.5096058    0.11549870           0.6937847             
  0.4588175    0.08645281           0.6549116             
  0.4799947    0.09725485           0.6709714             
  0.5003664    0.10921035           0.6867355             
  0.5095642    0.11610056           0.6939073             
  0.4593188    0.08784170           0.6558112             
  0.4832846    0.09903092           0.6735404             
  0.4991272    0.11115019           0.6868285             
  0.5128032    0.11733636           0.6960499             

Mean_Balanced_Accuracy was used to select the optimal model using the
 largest value.
The final values used for the model were sigma = 0.16 and C = 100.

Recording training set performance:

performance_metrics_training[nrow(performance_metrics_training) + 1, ] <- c('svm_rad',                                                                round(getTrainPerf(svm_rad)[1],4),
                                                                round(getTrainPerf(svm_rad)[11],4),
                                                                round(max(prop.table(table(train_set$quality))),4),
                                                               round(getTrainPerf(svm_rad)[3],4),
                                                               round(getTrainPerf(svm_rad)[4],4),
                                                               round(getTrainPerf(svm_rad)[5],4),
                                                               round(getTrainPerf(svm_rad)[8],4))

Next, I predict the test set

svmRad_predictions <- predict(svm_rad, test_set, type = 'raw')

Below are the performance metrics

svm_radCM <- confusionMatrix(svmRad_predictions, test_set$quality)
svm_radCM
Confusion Matrix and Statistics

          Reference
Prediction   4   5   6   7   8
         4   9  15  14   0   0
         5  12 197  89  11   3
         6   8  68 279  52   9
         7   2  13  50  96  14
         8   0   1  11  13  14

Overall Statistics
                                          
               Accuracy : 0.6071          
                 95% CI : (0.5758, 0.6379)
    No Information Rate : 0.452           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.4209          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
Sensitivity          0.290323   0.6701   0.6298  0.55814  0.35000
Specificity          0.969442   0.8324   0.7449  0.90223  0.97340
Pos Pred Value       0.236842   0.6314   0.6707  0.54857  0.35897
Neg Pred Value       0.976645   0.8548   0.7092  0.90559  0.97237
Prevalence           0.031633   0.3000   0.4520  0.17551  0.04082
Detection Rate       0.009184   0.2010   0.2847  0.09796  0.01429
Detection Prevalence 0.038776   0.3184   0.4245  0.17857  0.03980
Balanced Accuracy    0.629882   0.7512   0.6873  0.73018  0.66170

Recording test set performance:

performance_metrics_test[nrow(performance_metrics_test) + 1, ] <- c('svm_rad',                                                                round(svm_radCM$overall[1],4),
                                                                round(mean(svm_radCM$byClass[ , "Balanced Accuracy"], na.rm = TRUE),4),
                                                                round(max(prop.table(table(test_set$quality))),4),
                                                               round(mean(svm_radCM$byClass[ , "F1"], na.rm = TRUE),4),
                                                               round(mean(svm_radCM$byClass[ , "Sensitivity"], na.rm = TRUE),4),
                                                               round(mean(svm_radCM$byClass[ , "Specificity"], na.rm = TRUE),4),
                                                               round(mean(svm_radCM$byClass[ , "Pos Pred Value"], na.rm = TRUE),4))

Neural Network

Next, I fit a multilayer perceptron neural network

For the neural network, I check to see which features are highly correlated and remove them to avoid multi-collinearity issues

tooHigh <- findCorrelation(cor(train_set[,-12]), cutoff = 0.75)
trainNNet <- train_set[,-tooHigh]
testNNet <- test_set[,-tooHigh]
cluster4 <- makeCluster(detectCores()-2)
registerDoParallel(cluster4)
set.seed(5000)

nn_mlp <- train(
  quality ~ .,
  data = trainNNet,
  metric = 'Mean_Balanced_Accuracy',
  method = 'mlpWeightDecay',
  preProcess = c('center','scale'),
  trControl = ctrl,
  tuneGrid = expand.grid(.size= c(1,2,3,4),
                         .decay=c(0.00, 0.01, 0.1))
)
Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
stopCluster(cluster4)
nn_mlp
Multi-Layer Perceptron 

3918 samples
  10 predictor
   5 classes: '4', '5', '6', '7', '8' 

Pre-processing: centered (10), scaled (10) 
Resampling: Cross-Validated (5 fold, repeated 5 times) 
Summary of sample sizes: 3135, 3135, 3133, 3135, 3134, 3135, ... 
Addtional sampling using SMOTE prior to pre-processing

Resampling results across tuning parameters:

  size  decay  Accuracy   Kappa       Mean_F1    Mean_Sensitivity
  1     0.00   0.1789790  0.05851172  0.1526592  0.3385801       
  1     0.01   0.2276945  0.00000000        NaN  0.2000000       
  1     0.10   0.1731827  0.00000000        NaN  0.2000000       
  2     0.00   0.3378315  0.15710392  0.2535107  0.4052123       
  2     0.01   0.1845955  0.00000000        NaN  0.2000000       
  2     0.10   0.1819732  0.00000000        NaN  0.2000000       
  3     0.00   0.3773336  0.18316556  0.3034585  0.4263705       
  3     0.01   0.1775491  0.00000000        NaN  0.2000000       
  3     0.10   0.2545405  0.00000000        NaN  0.2000000       
  4     0.00   0.3667079  0.18871705  0.3152960  0.4340435       
  4     0.01   0.1894467  0.00000000        NaN  0.2000000       
  4     0.10   0.1729334  0.00000000        NaN  0.2000000       
  Mean_Specificity  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value  Mean_Precision
  0.8123659         0.2472842            0.8159094            0.2472842     
  0.8000000               NaN                  NaN                  NaN     
  0.8000000               NaN                  NaN                  NaN     
  0.8331571         0.3041440            0.8379841            0.3041440     
  0.8000000               NaN                  NaN                  NaN     
  0.8000000               NaN                  NaN                  NaN     
  0.8392445         0.3381684            0.8386515            0.3381684     
  0.8000000               NaN                  NaN                  NaN     
  0.8000000               NaN                  NaN                  NaN     
  0.8400116         0.3356239            0.8403945            0.3356239     
  0.8000000               NaN                  NaN                  NaN     
  0.8000000               NaN                  NaN                  NaN     
  Mean_Recall  Mean_Detection_Rate  Mean_Balanced_Accuracy
  0.3385801    0.03579580           0.5754730             
  0.2000000    0.04553891           0.5000000             
  0.2000000    0.03463654           0.5000000             
  0.4052123    0.06756630           0.6191847             
  0.2000000    0.03691911           0.5000000             
  0.2000000    0.03639463           0.5000000             
  0.4263705    0.07546672           0.6328075             
  0.2000000    0.03550982           0.5000000             
  0.2000000    0.05090811           0.5000000             
  0.4340435    0.07334158           0.6370275             
  0.2000000    0.03788934           0.5000000             
  0.2000000    0.03458668           0.5000000             

Mean_Balanced_Accuracy was used to select the optimal model using the
 largest value.
The final values used for the model were size = 4 and decay = 0.

Recording training set performance:

performance_metrics_training[nrow(performance_metrics_training) + 1, ] <- c('nn_mlp',                                                                round(getTrainPerf(nn_mlp)[1],4),
                                                                round(getTrainPerf(nn_mlp)[11],4),
                                                                round(max(prop.table(table(trainNNet$quality))),4),
                                                               round(getTrainPerf(nn_mlp)[3],4),
                                                               round(getTrainPerf(nn_mlp)[4],4),
                                                               round(getTrainPerf(nn_mlp)[5],4),
                                                               round(getTrainPerf(nn_mlp)[8],4))
library(NeuralNetTools)
plotnet(nn_mlp$finalModel,
        circle_col  = list(input = "skyblue", hidden = "orange", output = "tomato"),
        alpha       = 0.8,              
        cex_val     = 0.75)             

Next, I predict the test set:

nn_mlp_predictions <- predict(nn_mlp, testNNet, type = 'raw')

Below are the performance metrics:

nn_mlpCM <- confusionMatrix(nn_mlp_predictions, testNNet$quality)
nn_mlpCM
Confusion Matrix and Statistics

          Reference
Prediction   4   5   6   7   8
         4  11  43  18   3   0
         5  12 164 123  19   0
         6   2  51 136  30   4
         7   2  29 106  61   9
         8   4   7  60  59  27

Overall Statistics
                                          
               Accuracy : 0.4071          
                 95% CI : (0.3762, 0.4387)
    No Information Rate : 0.452           
    P-Value [Acc > NIR] : 0.9979          
                                          
                  Kappa : 0.2135          
                                          
 Mcnemar's Test P-Value : <2e-16          

Statistics by Class:

                     Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
Sensitivity           0.35484   0.5578   0.3070  0.35465  0.67500
Specificity           0.93256   0.7755   0.8380  0.81931  0.86170
Pos Pred Value        0.14667   0.5157   0.6099  0.29469  0.17197
Neg Pred Value        0.97790   0.8036   0.5945  0.85640  0.98420
Prevalence            0.03163   0.3000   0.4520  0.17551  0.04082
Detection Rate        0.01122   0.1673   0.1388  0.06224  0.02755
Detection Prevalence  0.07653   0.3245   0.2276  0.21122  0.16020
Balanced Accuracy     0.64370   0.6667   0.5725  0.58698  0.76835

Recording test set performance:

performance_metrics_test[nrow(performance_metrics_test) + 1, ] <- c('nn_mlp',                                                                round(nn_mlpCM$overall[1],4),
                                                                round(mean(nn_mlpCM$byClass[ , "Balanced Accuracy"], na.rm = TRUE),4),
                                                                round(max(prop.table(table(testNNet$quality))),4),
                                                               round(mean(nn_mlpCM$byClass[ , "F1"], na.rm = TRUE),4),
                                                               round(mean(nn_mlpCM$byClass[ , "Sensitivity"], na.rm = TRUE),4),
                                                               round(mean(nn_mlpCM$byClass[ , "Specificity"], na.rm = TRUE),4),
                                                               round(mean(nn_mlpCM$byClass[ , "Pos Pred Value"], na.rm = TRUE),4))

Conclusion

Below is the training set performance

performance_metrics_training
    Model Accuracy Mean_Balanced_Accuracy No_Information_Rate Mean_F1
1 xgbTree   0.6399                 0.7215              0.4479  0.5537
2 svm_lin   0.3482                 0.6284              0.4479  0.3046
3 svm_rad   0.5867                 0.6960              0.4479  0.4998
4  nn_mlp   0.3667                 0.6370              0.4479  0.3153
  Sensitivity Specificity Precision
1      0.5517      0.8914    0.5592
2      0.4194      0.8375    0.3272
3      0.5128      0.8793    0.4925
4      0.4340      0.8400    0.3356

And the test set performance

performance_metrics_test
    Model Accuracy Mean_Balanced_Accuracy No_Information_Rate Mean_F1
1 xgbTree   0.6663                 0.7146               0.452  0.5444
2 svm_lin   0.3592                 0.6473               0.452  0.3179
3 svm_rad   0.6071                 0.6921               0.452  0.4937
4  nn_mlp   0.4071                 0.6476               0.452  0.3496
  Sensitivity Specificity Precision
1      0.5311       0.898    0.5651
2      0.4562      0.8383     0.332
3      0.4997      0.8845    0.4893
4      0.4499      0.8454    0.3478

The xgbTree model has the best performance (accuracy higher than the no information rate, highest mean balanced accuracy) on both the training and test sets, though svm_rad was close. Surprisingly, the neural net had the worst performance.

While xgbTree had decent balanced accuracy, it still had fairly low average precision and average sensitivity across classes. Looking at the class by by class performance, the model had decent performance on classes 5, 6, and 7 (sensitivity and precision) but struggled on classes 4 and 8. This can present a business issue: while we can predict the average classes (5, 6, 7) with some confidence (and use this to price our wines), we struggle at predicting the outlier classes (4 and 8) and thus might have a scenario where we overprice a client for inferior wine (worst outcome) or under price for a high quality wine (tolerable as client is pleasantly surprised). Lastly, performance on the test set for xgbTree was similar to performance on the training set, which suggests a decent balance between bias and variance (or low variance but possibly high bias as performance is decent but can be better).

Further improvements can involve using outlier detection algorithms to classify the high and low quality wines, expanded hyper-parameter tuning for the xgbTree model, utilizing a more complex neural network, and obtaining more data on the outlier wines.