Machine Learning with the UCI Wine Quality Dataset

Introduction

This is a machine learning project focused on the Wine Quality Dataset from the UCI Machine Learning Depository. After spending a lot of time playing around with this dataset the past few weeks, I decided to make a little project out of it and publish the results on rpubs.

I will train and tune 3 models — k-nearest neighbours, randomForest, and support vector machine.

The UCI webpage for this dataset has a link to an academic study on this dataset. The I will use the results that were published in that study as a benchmark to compare my results to.

More information, including a link to an academic paper on the dataset, can be found here.

1.1 — Exploring the Data

First, I will load the required libraries and import the data into R directly from the UCI website. Red and white wine each have their own dataset and will be analyzed seperately. A full analysis of the white wine data will be done first. Red wine will have it’s own analysis in section 2.

# these first 2 lines are for setting up parallel processing and can be omitted
library(doParallel)
registerDoParallel(cores = detectCores() - 1)

set.seed(10)
library(caret)
library(corrplot)
library(kknn)
library(randomForest)
library(kernlab)

white.url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"
white.raw <- read.csv(white.url, header = TRUE, sep = ";")
white <- white.raw
str(white)

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

The above output tells us that there are 4898 samples and 12 variables. The response variable is quality. The eleven predictor variables are of the numeric class and the response variable, quality, is of the integer class.

table(white$quality)

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

It’s clear from the above table that there is a very big class imbalance. There are 4898 samples but only 20 are of the 3 class and only 5 are of the 9 class. There are not enough samples of those classes to split the data into useable training and test sets and perform cross-validation.

However, the academic paper did not make any changes to the classes (by merging classes, for example) to deal with the class imbalances. So for now, I will not make any changes to the classes so that my results can remain comparable to theirs. But in the next of the series, I’ll merge some of the classes in order to deal with the class imbalances.

To visualize the data, plots for each predictor variable will be displayed. The line on each plot shows the linear regression of quality, the response variable, as a function of the plot’s predictor variable.

par(mfrow = c(4,3))
for (i in c(1:11)) {
    plot(white[, i], jitter(white[, "quality"]), xlab = names(white)[i],
         ylab = "quality", col = "firebrick", cex = 0.8, cex.lab = 1.3)
    abline(lm(white[, "quality"] ~ white[ ,i]), lty = 2, lwd = 2)
}
par(mfrow = c(1, 1))

The first thing that stands out in the plots is the presence of outliers for most of the predictor variables. The UCI wine dataset was cleaned prior to its posting, so I don’t think they are errors. However, the residual.sugar outlier is interesting. In the EU, a wine with more than 45g/l of sugar is considered a sweet wine. The outlier has a residual.sugar level of 65.8. The next highest sugar level in the dataset is 31.6. Because the sample represents a different wine category than all the others and its sugar level is twice as much as the next highest sample, I’m going to remove it from the dataset. Additionally, the sugar outlier comes from the same sample as the density outlier, so removing it cleans up the density distribution as well.

max.sug <- which(white$residual.sugar == max(white$residual.sugar))
white <- white[-max.sug, ]

Free.sulfur.dioxide has an outlying sample greater than 2x the next largest one. But that sample has a quality of 3, the lowest quality in the dataset. The high value for free.sulfur.dioxide may be linked to the sample’s poor quality rating. So, I’m not going to remove it.

Outliers can have other affects that may lead to them being filtered out, like affecting the mean and sd used in standardization. The k-nearerst neighbors algorithm requires the predictors to be transformed such that they all have a common range. I will use standardization to do this. Fortunately, the dataset has a large number of samples. The remaining outliers will not affect the mean and SD to a significant extent such that they need to be removed.

The regression line for citric.acid, free.sulfur.dioxide, and sulphates appear to have a very weak relationship to quality. It may be a good idea to remove those predictor variables to reduce the dimensionality of the data. Feature selection will be applied after the data is split into training and test sets.

par(mfrow = c(1,1))
cor.white <- cor(white)
corrplot(cor.white, method = 'number')

Weak relationships between quality and citric.acid, free.sulfur.dioxide, and sulphates are seen in the above correlation plot as well.

Density has a 0.83 correlation with residual.sugar and a -0.80 correlation with alcohol. This would be a greater concern if we were planning on training regression and/or linear models. But for now, after learning about the data, I’ve decided that non-linear classification models will be more appropriate than regression. My assumption is that the relationship between the response and predictor variables is more complex than what a regresson and/or linear model can capture.

1.2 — Model Building

First, I will convert the the response variable, quality, to factors. Then I will split the data into training and test sets.

white$quality <- as.factor(white$quality)
inTrain <- createDataPartition(white$quality, p = 2/3, list = F)
train.white <- white[inTrain,]
test.white <- white[-inTrain,]

Three types of models will be used — k-nearest neighbours, support vector machine, and random forest.

Training will be done with the help of the caret package’s train function. The cross-validation method will be 5 fold, repeated 5 times.

Caret

Caret’s train function simplfies model tuning. Through the tuneGrid argument, a grid of the hyperparameters we want to use to tune the model can be passed into the train function. The expand.grid function simplifies the creation of the grid by combining the selected hyperparameter values into every possible combination.

Feature selection

Recall that citric.acid, free.sulfur.dioxide, and sulphates had the weakest correlations with quality. However, I have decided that linear and/or regression methods are not the best choice for this data, so non-linear feature selection methods will be tried.

I tried a few feature selecton methods to see what they would return, and most of them retained all the predictors or only excluded 1 at the most. So, I’ve decided not to use feature selection while training and tuning the models.

Preprocessing

K-nearest neighbours uses distance to classify the response variable. Hence, it is necessary to standardize the predictor variables. This is to prevent predictors with larger ranges from being over-emphasized by the algorithm. The preProcess argument in the train function will be used to center and scale the predictors, i.e. standardization.

k-nearest neighbours

For k-nearest neighbours, 5 kmax, 2 distance, and 3 kernel values will be used. For the distance value, 1 is the Manhattan distance, and 2 is the Euclidian distance.

t.ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
kknn.grid <- expand.grid(kmax = c(3, 5, 7 ,9, 11), distance = c(1, 2),
                         kernel = c("rectangular", "gaussian", "cos"))
kknn.train <- train(quality ~ ., data = train.white, method = "kknn",
                    trControl = t.ctrl, tuneGrid = kknn.grid,
                    preProcess = c("center", "scale"))
plot(kknn.train)

kknn.train$bestTune

##    kmax distance kernel
## 15    7        1    cos

The cos kernel using a distance of 1 outperforms the alternatives. The best value for k is 7.

randomForest

For the random forest, only the mtry hyperparameter is available for tuning. Mtry values of 1 through 11 will be passed into the train function’s tuneGrid argument. Mtry is the number of variables randomly sampled as candidates at each split.

rf.grid <- expand.grid(mtry = 1:11)
rf.train <- train(quality ~ ., data = train.white, method = "rf",
                  trControl = t.ctrl, tuneGrid = rf.grid,
                  preProcess = c("center", "scale"))
plot(rf.train)

rf.train$bestTune

##   mtry
## 1    1

An mtry of 1 is like using univariate decision trees. I ran the train function again using an ntree of 1000 to see if the result of 1 would stick, and it did. So, I’m going to keep mtry = 1 as the best value.

support vector machine

The Radial Basis Function, svmRadial, will be used as the kernel for the SVM model. C is the cost of misclassification and sigma is the inverse kernel width, a smoothing hyperparameter.

Initially, sigma values between 0.1 and 1 were tried. But, looking at the plot of that trial, accuracy was increasing consistently as sigma approached 1. So I tried sigmas in the range of 0.25 to 2 and found a peak in accuracy there. I’ll use that range to train the model for white wine.

Larger cost values than the ones included below but they did not have much effect, so I reduced the number of cost values for my final tuning attempt.

svm.grid <- expand.grid(C = 2^(1:3), sigma = seq(0.25, 2, length = 8))
svm.train <- train(quality ~ ., data = train.white, method = "svmRadial",
                   trControl = t.ctrl, tuneGrid = svm.grid,
                   preProcess = c("center", "scale"))
plot(svm.train)

svm.train$bestTune

##   sigma C
## 5  1.25 2

Accuracy peaks at sigma = 1.25 with a cost value of 2.

1.3 — Model Selection

The models were trained using the accuracy metric. Kappa is important as well and we will take that into account when evaluating model performance. As a benchmark, I will use the results published in an academic study. A link to that study can be found on the UCI website here.

kknn.predict <- predict(kknn.train, test.white)
confusionMatrix(kknn.predict, test.white$quality)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   1   0   0   0   0   0
##          4   1  11  13   6   0   0   0
##          5   3  25 309 134  16   1   1
##          6   2  16 152 486 112  21   0
##          7   0   1  10  99 158  14   0
##          8   0   0   1   7   7  22   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6053          
##                  95% CI : (0.5811, 0.6291)
##     No Information Rate : 0.4494          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4023          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                       Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8  Class: 9
## Sensitivity          0.0000000 0.203704   0.6371   0.6639  0.53925  0.37931 0.0000000
## Specificity          0.9993839 0.987302   0.8427   0.6622  0.90719  0.99045 1.0000000
## Pos Pred Value       0.0000000 0.354839   0.6319   0.6160  0.56028  0.59459       NaN
## Neg Pred Value       0.9963145 0.973091   0.8456   0.7071  0.89978  0.97739 0.9993861
## Prevalence           0.0036832 0.033149   0.2977   0.4494  0.17986  0.03560 0.0006139
## Detection Rate       0.0000000 0.006753   0.1897   0.2983  0.09699  0.01351 0.0000000
## Detection Prevalence 0.0006139 0.019030   0.3002   0.4843  0.17311  0.02271 0.0000000
## Balanced Accuracy    0.4996919 0.595503   0.7399   0.6631  0.72322  0.68488 0.5000000

rf.predict <- predict(rf.train, test.white)
confusionMatrix(rf.predict, test.white$quality)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   0   9   1   1   0   0   0
##          5   2  25 322  70   6   0   1
##          6   4  20 161 624 145  23   0
##          7   0   0   1  37 141  17   0
##          8   0   0   0   0   1  18   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6839          
##                  95% CI : (0.6607, 0.7064)
##     No Information Rate : 0.4494          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4985          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8  Class: 9
## Sensitivity          0.000000 0.166667   0.6639   0.8525  0.48123  0.31034 0.0000000
## Specificity          1.000000 0.998730   0.9091   0.6065  0.95883  0.99936 1.0000000
## Pos Pred Value            NaN 0.818182   0.7559   0.6387  0.71939  0.94737       NaN
## Neg Pred Value       0.996317 0.972188   0.8645   0.8344  0.89393  0.97516 0.9993861
## Prevalence           0.003683 0.033149   0.2977   0.4494  0.17986  0.03560 0.0006139
## Detection Rate       0.000000 0.005525   0.1977   0.3831  0.08656  0.01105 0.0000000
## Detection Prevalence 0.000000 0.006753   0.2615   0.5998  0.12032  0.01166 0.0000000
## Balanced Accuracy    0.500000 0.582698   0.7865   0.7295  0.72003  0.65485 0.5000000

svm.predict <- predict(svm.train, test.white)
confusionMatrix(svm.predict, test.white$quality)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   0   5   1   1   0   0   0
##          5   1  13 250  66   5   1   0
##          6   5  36 232 622 146  32   1
##          7   0   0   2  42 140   4   0
##          8   0   0   0   1   2  21   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6372          
##                  95% CI : (0.6133, 0.6606)
##     No Information Rate : 0.4494          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4157          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8  Class: 9
## Sensitivity          0.000000 0.092593   0.5155   0.8497  0.47782  0.36207 0.0000000
## Specificity          1.000000 0.998730   0.9248   0.4961  0.96407  0.99809 1.0000000
## Pos Pred Value            NaN 0.714286   0.7440   0.5791  0.74468  0.87500       NaN
## Neg Pred Value       0.996317 0.969790   0.8183   0.8018  0.89382  0.97695 0.9993861
## Prevalence           0.003683 0.033149   0.2977   0.4494  0.17986  0.03560 0.0006139
## Detection Rate       0.000000 0.003069   0.1535   0.3818  0.08594  0.01289 0.0000000
## Detection Prevalence 0.000000 0.004297   0.2063   0.6593  0.11541  0.01473 0.0000000
## Balanced Accuracy    0.500000 0.545661   0.7201   0.6729  0.72094  0.68008 0.5000000

The benchmark for white wine was established using a SVM model with a Gaussian kernel. Benchmark accuracy(%) is 64.6±0.4, and benchmark Kappa(%) is 43.9±0.4, achieved using 20 runs of 5-fold cross-validation (100 experiments). The confidence intervals for the benchmarks are much smaller because I only used 5 runs of 5-fold cross-validation (25 experiments).

Only one model performed better than benchmark accuracy, the Random Forest model. Random Forest returned an accuracy of 68.4±2.3 and a Kappa of 49.9. K-nearest neighbours performed statistically worse and the SVM model performed was not significally better or worse at the 95% confidence interval.

All of the models did a poor job at identifying white wines of the 2 lowest and 2 highest classes.

Part 2 — Red Wine

Introduction

This part will analyze the red wine dataset from the UCI machine learning depository.

Three types of machine learning models will be used— k-nearest neighbours, randomForest, and Support Vector Machine. The Caret package will be used to tune the models.

2.1 — Exploring the Data

First, I will reset the seed and import the red wine data directly from the UCI website.

set.seed(10)
red.url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
red.raw <- read.csv(red.url, header = TRUE, sep = ";")
red <- red.raw
str(red)

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Similar to white wine, there are 11 predictor varables and one response variable, quality. The number of samples for red wine is 1599.

table(red$quality)

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

The red wine data suffers from class imbalances as well. There are 1599 samples but only 10 are of the 3 class and only 18 are of the 8 class. There are not enough samples of those classes to split the data into useable training and test sets and perform cross-validation.

To get a good overall look at the data, plots for each predictor variable will be displayed. The line on each plot shows the linear regression of quality, the response variable, as a function of the plot’s predictor variable.

red$quality <- as.integer(red$quality)
par(mfrow = c(4,3))
for (i in c(1:11)) {
    plot(red[, i], jitter(red[, "quality"]), xlab = names(red)[i],
         ylab = "quality", col = "firebrick", cex = 0.8, cex.lab = 1.3)
    abline(lm(red[, "quality"] ~ red[ ,i]), lty = 2, lwd = 2)
}
par(mfrow = c(1, 1))

A few things stand out. Volatile.acidity, sulphates, and alcohol appear to have the strongest relationship with quality. Free.sulfur.dioxide and residual.sugar appear to have the weakest relationships with quality.

Second, some of the predictor variables appear to have a few outliers. However, I will keep them in the dataset based on the same reasoning used in the white wine analysis.

Next, a plot of the correlations between each variable will be created.

par(mfrow = c(1,1))
cor.red <- cor(red)
corrplot(cor.red, method = 'number')

Weak relationships between quality and residual.sugar, free.sulfur.dioxide, and pH can be seen in the correlation plot. A few predictors are correlated at 0.67. The size of the corrleations will not be a concern.

2.2 — Model Building

First, I will convert the quality response variable to factors then split the data into training and test sets.

red$quality <- as.factor(red$quality)
inTrain <- createDataPartition(red$quality, p = 2/3, list = F)
train.red <- red[inTrain,]
test.red <- red[-inTrain,]

As with the white wine data, the Caret package’s train function will be used to train the models, all the predictor variables will be used, and preprocessing will be done within the cross-validation loop. The predictor variables will be standardized and the cross-validation method will be repeated cross-validation repeated 5 times using 5 folds.

k-nearest neighbours

For k-nearest neighbours, 5 kmax, 2 distance, 3 kernels will be used. In total, that is 24 possible cominations.

t.ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
kknn.grid <- expand.grid(kmax = c(3, 5, 7, 9, 11), distance = c(1, 2),
                         kernel = c("rectangular", "cos", "gaussian"))
kknn.train <- train(quality ~ ., data = train.red, method = "kknn",
                    trControl = t.ctrl, tuneGrid = kknn.grid,
                    preProcess = c("center", "scale"))
plot(kknn.train)

kknn.train$bestTune

##    kmax distance kernel
## 20    9        1    cos

The cos kernal with a distance of 1 outperformed the other combinations. For the red wine data, a kmax of 9 returned the highest accuracy.

randomForest

For the random forest, a mtry of 1 through 11 will be passed into the train function’s tuneGrid argument.

t.ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
rf.grid <- expand.grid(mtry = 1:11)
rf.train <- train(quality ~ ., data = train.red, method = "rf",
                  trControl = t.ctrl, tuneGrid = rf.grid, 
                  preProcess = c("center", "scale"))
plot(rf.train)

rf.train$bestTune

##   mtry
## 2    2

An mtry of 2 returned the highest accuracy. For white wine, mtry was 1.

support vector machine

The Radial Basis Function (svmRadial) will be used as the kernel.

I will use 10 sigma values between 0.1 and 1. For cost, 4 values between 2 and 16 will be used.

svm.grid <- expand.grid(C = 2^(1:3), sigma = seq(0.1, 1, length = 10))
svm.train <- train(quality ~ ., data = train.red, method = "svmRadial",
                   trControl = t.ctrl, tuneGrid = svm.grid,
                   preProcess = c("center", "scale"))
plot(svm.train)

svm.train$bestTune

##   sigma C
## 6   0.6 2

A cost of 2 and a sigma of 0.6 returned the highest accuracy.

2.3 — Model Selection

As with the white wine data, accuracy was used as the metric to choose the best models during the tuning phase. As a benchmark,I will use the results published in an academic study. A link to that study can be found on the UCI website here

kknn.predict <- predict(kknn.train, test.red)
confusionMatrix(kknn.predict, test.red$quality)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8
##          3   0   0   0   0   0   0
##          4   0   1   1   1   0   0
##          5   1   8 170  53   4   1
##          6   2   8  50 126  25   1
##          7   0   0   5  32  37   4
##          8   0   0   1   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.629           
##                  95% CI : (0.5863, 0.6702)
##     No Information Rate : 0.4275          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4124          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity           0.00000 0.058824   0.7489   0.5943  0.56061 0.000000
## Specificity           1.00000 0.996109   0.7796   0.7304  0.91183 0.998095
## Pos Pred Value            NaN 0.333333   0.7173   0.5943  0.47436 0.000000
## Neg Pred Value        0.99435 0.969697   0.8061   0.7304  0.93598 0.988679
## Prevalence            0.00565 0.032015   0.4275   0.3992  0.12429 0.011299
## Detection Rate        0.00000 0.001883   0.3202   0.2373  0.06968 0.000000
## Detection Prevalence  0.00000 0.005650   0.4463   0.3992  0.14689 0.001883
## Balanced Accuracy     0.50000 0.527466   0.7643   0.6624  0.73622 0.499048

rf.predict <- predict(rf.train, test.red)
confusionMatrix(rf.predict, test.red$quality)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8
##          3   0   1   0   0   0   0
##          4   0   0   0   0   0   0
##          5   2  13 183  52   3   0
##          6   1   3  42 143  24   3
##          7   0   0   2  17  39   3
##          8   0   0   0   0   0   0
## 
## Overall Statistics
##                                          
##                Accuracy : 0.6874         
##                  95% CI : (0.646, 0.7266)
##     No Information Rate : 0.4275         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.4955         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.000000  0.00000   0.8062   0.6745  0.59091   0.0000
## Specificity          0.998106  1.00000   0.7697   0.7712  0.95269   1.0000
## Pos Pred Value       0.000000      NaN   0.7233   0.6620  0.63934      NaN
## Neg Pred Value       0.994340  0.96798   0.8417   0.7810  0.94255   0.9887
## Prevalence           0.005650  0.03202   0.4275   0.3992  0.12429   0.0113
## Detection Rate       0.000000  0.00000   0.3446   0.2693  0.07345   0.0000
## Detection Prevalence 0.001883  0.00000   0.4765   0.4068  0.11488   0.0000
## Balanced Accuracy    0.499053  0.50000   0.7880   0.7228  0.77180   0.5000

svm.predict <- predict(svm.train, test.red)
confusionMatrix(svm.predict, test.red$quality)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8
##          3   0   1   0   0   0   0
##          4   0   0   0   0   0   0
##          5   2  12 172  63   7   2
##          6   1   4  51 133  21   1
##          7   0   0   4  16  38   3
##          8   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.646           
##                  95% CI : (0.6036, 0.6867)
##     No Information Rate : 0.4275          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4284          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.000000  0.00000   0.7577   0.6274  0.57576   0.0000
## Specificity          0.998106  1.00000   0.7171   0.7555  0.95054   1.0000
## Pos Pred Value       0.000000      NaN   0.6667   0.6303  0.62295      NaN
## Neg Pred Value       0.994340  0.96798   0.7985   0.7531  0.94043   0.9887
## Prevalence           0.005650  0.03202   0.4275   0.3992  0.12429   0.0113
## Detection Rate       0.000000  0.00000   0.3239   0.2505  0.07156   0.0000
## Detection Prevalence 0.001883  0.00000   0.4859   0.3974  0.11488   0.0000
## Balanced Accuracy    0.499053  0.50000   0.7374   0.6914  0.76315   0.5000

Benchmark accuracy(%) is 62.4±0.4 and benchmark Kappa(%) is 38.7±0.7. As with the white wine data, the Random Forest model was the only one that performed better than the benchmark. Random Forest accuracy is 68.7±4.0 with a Kappa of 49.6. The K-nearest neighbours and SVM model accuracies were not statistically better or worse than the benchmark.

All 3 models did a poor job at identifying the highest and lowest quality wines.

Conclusion

A model that is only accurate at identifying average quality wines is of limited use. With this dataset, it’s hard to say if a model can be found that accurately identifies the low and high quality wines. Only more work done with this dataset can answer that. The benchmark model has lower overall accuracy then I was able to achieve, but the benchmark accuracy for white wine is more balanced across the classes. However, that model is of limited overall use as well.

Machine Learning with the UCI Wine Quality Dataset

Garry Malloy

April 25, 2016

Introduction

1.1 — Exploring the Data

1.2 — Model Building

Caret

Feature selection

Preprocessing

k-nearest neighbours

randomForest

support vector machine

1.3 — Model Selection

Part 2 — Red Wine

Introduction

2.1 — Exploring the Data

2.2 — Model Building

k-nearest neighbours

randomForest

support vector machine

2.3 — Model Selection

Conclusion