Wine Quality

Introduction and Analysis Outlines

What’s something we all have in common? We enjoy food and drink! After the essentials (water, tea, and coffee), wine is one of the most consumed beverages around the world! Do you enjoy wine? What if you could predict how delicious a wine might be before having to pay for it and taste it? My question of study: based on the physical makeup of wine, can we predict its quality? To answer this question, I decided to use datasets about wine quality from the source: https://archive.ics.uci.edu/dataset/186/wine+quality. There are two datasets: red wine dataset and white wine dataset. There are 12 variables related to the contents and characteristics of wine in each dataset.

In addition to the initial data exploration, I plan to do [possible] principal component analysis, and then linear discriminant analysis, quadratic discriminant analysis, logistic regression, random forest, and decision tree. Among all of these classification method, we will pick out which method will give the best prediction on wine quality.

Data Preparation

Load libraries

library(MASS)
library(klaR)
library(boot)
library(candisc)
library(dplyr)
library(knitr)
library(ggplot2)
library(tidyverse)
library(papeR)
library(class)
library(e1071)
library(ISLR)
library(tree)
library(formattable)
library(gt)
library(dplyr)   
library(ggplot2)
library(corrplot) 
library(caret)  
library(skimr)  
library(gt) 
library(rpart) 
library(rpart.plot)  
library(randomForest)

Load datasets into R

red_df <- read.csv("./datasets/winequality-red.csv", sep = ";")
white_df <- read.csv("./datasets/winequality-white.csv", sep = ";")

The quality column for wine quality is scored between 0-10. I’m going to create an extra column, call it good_quality, that assign 1 for the “quality” wines which have quality score 7 or above and 0 otherwise.

red_df$good_quality <- ifelse(red_df$quality >= 7, 1, 0)

white_df$good_quality <- ifelse(white_df$quality >= 7, 1, 0)

Before getting to analysis, I am going to merge red wine dataset and white wine dataset into a single dataset call wine.

First, let’s create a vector of wine’s name so we know which dataset are the rows from.

red_df_name <- rep("red",nrow(red_df))
white_df_name <- rep("white",nrow(white_df))

Now use cbindto add the vector as new columns in each dataframe.

red_wine <- cbind(red_df, red_df_name)
white_wine <- cbind(white_df, white_df_name)

Looking at the updated dataframes, we see that the new column’s name is either red_df_name or white_df_name. Let’s change to color so that the color column have the same name in both datasets.

names(red_wine)[ncol(red_wine)] <- "color"
names(white_wine)[ncol(white_wine)] <- "color"

Each dataframe has the same structure and now we can merge the two dataframes into one.

wine <- rbind(red_wine,white_wine)
#remove quality column as we dont need it anymore 
wine_new <- subset(wine, select = -quality)

Data Exploration

Now, the data set has thirteen variables that can be grouped into categories of wine characteristics:

Acid: fixed_acidity, volatile_acidity, citric_acid
Sugar: residual_sugar
Salt: chlorides
Chemistry: free_sulfur_dioxide, total_sulfur_dioxide, pH, sulphates
Physical: density, color (red or white)
Alcohol Content: alcohol
Rating: quality (scale from 3 - 9), good_quality (1 - ‘a good wine’, 0 - otherwise)

Color is a categorical variable, quality is a quantitative discrete variable, good_quality is a binary varible, and the others are quantitative continuous variables. There are no missing values in the data set. The table below shows the numerical summaries for all of the quantitative variables. Of the 6497 observations, 1277 (19.66 %) were labeled ‘a good wine’. Of the 6497 wines, 1599 were red wines and 4898 were white wines. Based on the histograms of the continuous quantitative variables, there do not appear to be any outliers in the data.

counts_quality <- table(wine_new$good_quality)

percentages_quality <- prop.table(table(wine_new$good_quality)) * 100

knitr::kable(cbind(counts_quality, percentages_quality))

	counts_quality	percentages_quality
0	5220	80.34477
1	1277	19.65523

kable(summarize(wine_new[1:11], type = 'numeric'))

	N	Mean	SD	Min	Q1	Median	Q3	Max
fixed.acidity	6497	7.22	1.30	3.80	6.40	7.00	7.70	15.90
volatile.acidity	6497	0.34	0.16	0.08	0.23	0.29	0.40	1.58
citric.acid	6497	0.32	0.15	0.00	0.25	0.31	0.39	1.66
residual.sugar	6497	5.44	4.76	0.60	1.80	3.00	8.10	65.80
chlorides	6497	0.06	0.04	0.01	0.04	0.05	0.06	0.61
free.sulfur.dioxide	6497	30.53	17.75	1.00	17.00	29.00	41.00	289.00
total.sulfur.dioxide	6497	115.74	56.52	6.00	77.00	118.00	156.00	440.00
density	6497	0.99	0.00	0.99	0.99	0.99	1.00	1.04
pH	6497	3.22	0.16	2.72	3.11	3.21	3.32	4.01
sulphates	6497	0.53	0.15	0.22	0.43	0.51	0.60	2.00
alcohol	6497	10.49	1.19	8.00	9.50	10.30	11.30	14.90

Histograms of Quantitative Continuous Variables

#Histogram
par(mfrow = c(4,3), mar = c(2,2,2,2))

hist(wine_new$fixed.acidity, main = 'Fixed Acidity', col = "lightblue")

hist(wine_new$volatile.acidity, main = 'Volatile Acidity', col = "lightblue")

hist(wine_new$citric.acid, main = 'Citric Acid' , col = "lightblue")

hist(wine_new$residual.sugar, main = 'Residual Sugar', col = "lightblue")

hist(wine_new$chlorides, main = 'Chlorides', col = "lightblue")

hist(wine_new$free.sulfur.dioxide, main = 'Free Sulfur Dioxide', col = "lightblue")

hist(wine_new$total.sulfur.dioxide, main = 'Total Sulfur Dioxide', col = "lightblue")

hist(wine_new$density, main = 'Density', col = "lightblue")

hist(wine_new$pH, main = 'pH', col = "lightblue")

hist(wine_new$sulphates, main = 'Sulphates', col = "lightblue")

hist(wine_new$alcohol, main = 'Alcohol', col = "lightblue")

All the variables have unimodal distributions. Some are bell-shaped while some have a slight right skew.

Correlation Matrix

The plot below shows the correlation matrix of the quantitative continuous variables in the dataset.

corr_matrix <- cor(wine_new[,1:11])
# Create a correlation heatmap with values displayed
corrplot(corr_matrix, method = "color", type = "upper", 
         tl.col = "black", tl.srt = 45, 
         addCoef.col = "black", number.cex = 0.7)

Only three pairs of variables have correlations above \(|0.5|\) and the rest of the pairs have relatively weak correlations. Because of this, it is not expected that a dimension reduction technique such as Principal Component Analysis to effectively reduce the dimensions to be used in the classification model.

Classification

Before we start off our classification techniques, we will create 70|30 split of our dataset to obtain our training/testing data.

The training data was used to fit each model for the particular classification technique. Each fitted model was then used to predict good_quality wine classifications on the testing data. We are using the confusion matrix to evaluate the performance of classification models.

# convert good_quality to factor
wine_new$good_quality <- as.factor(wine_new$good_quality)
# Simple partition into train and test set 
set.seed(754) 
trainIndex <- createDataPartition(wine_new$good_quality, p = .70, list = FALSE, times = 1)

wine_train <- wine_new[trainIndex, ]  
wine_test <- wine_new[-trainIndex, ]

Logistic Regression Models

We’ll start by creating a logistic regression model to predict good_quality.

# Fit model to training data
logis_model <-  glm(good_quality~.,data = wine_train, family = "binomial")
# Make predictions on train set
predict_train_logis <- predict(logis_model, newdata = wine_train, type="response")
## Convert fitted model values to fitted classes. Use 0.5 as the
#  threshold for classifying a case as a 1.
class_train_logis <- as.factor(ifelse(predict_train_logis>0.5,1,0))

cm_train_logis <- confusionMatrix(class_train_logis, wine_train$good, positive="1")
cm_train_logis

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3478  647
##          1  176  247
##                                           
##                Accuracy : 0.819           
##                  95% CI : (0.8075, 0.8301)
##     No Information Rate : 0.8034          
##     P-Value [Acc > NIR] : 0.003979        
##                                           
##                   Kappa : 0.2848          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.27629         
##             Specificity : 0.95183         
##          Pos Pred Value : 0.58392         
##          Neg Pred Value : 0.84315         
##              Prevalence : 0.19657         
##          Detection Rate : 0.05431         
##    Detection Prevalence : 0.09301         
##       Balanced Accuracy : 0.61406         
##                                           
##        'Positive' Class : 1               
##

Reading the cofusion matrix outcome:

True Negatives (TN): 3478 instances were correctly predicted as class 0 (negative class).

False Positives (FP): 647 instances were incorrectly predicted as class 1 (positive class) when they were actually class 0.

False Negatives (FN): 176 instances were incorrectly predicted as class 0 when they were actually class 1.

True Positives (TP): 247 instances were correctly predicted as class 1.

And the statistics: for accuracy, the model correctly predicted 81.9% of the instances (both TP and TN out of the total number of instances). The model has a sensitivity of 27.629%, meaning it correctly identifies 27.629% of the actual positives and a specificity of 95.183%, meaning it correctly identifies 95.183% of the actual negatives. From these statistics, we can see that the model is much better at identifying negative instances than positive ones.

Let’s predict on the test data

# Make predictions on test set
predict_test_logis <- predict(logis_model, newdata = wine_test, type="response")
## Convert fitted model values to fitted classes. Use 0.5 as the
#  threshold for classifying a case as a 1.
class_test_logis <- as.factor(ifelse(predict_test_logis>0.5,1,0))

confusionMatrix(class_test_logis, wine_test$good, positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1475  276
##          1   91  107
##                                           
##                Accuracy : 0.8117          
##                  95% CI : (0.7936, 0.8288)
##     No Information Rate : 0.8035          
##     P-Value [Acc > NIR] : 0.1888          
##                                           
##                   Kappa : 0.2706          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.2794          
##             Specificity : 0.9419          
##          Pos Pred Value : 0.5404          
##          Neg Pred Value : 0.8424          
##              Prevalence : 0.1965          
##          Detection Rate : 0.0549          
##    Detection Prevalence : 0.1016          
##       Balanced Accuracy : 0.6106          
##                                           
##        'Positive' Class : 1               
##

The accuracy on the test set is very close to the train set ( 81.9% vs 81.17% respectively), indicating the model predict very well on the test data. Thus there is no potential of overfitting here.

Sensitivity (True Positive rate): the test data has the sensitivity of 27.94%, meaning it accurately identifies 27.94% of the portion of the good quality wine. It is slightly higher than sensitivity on the training data (27.629%), but the difference is little.

Specificity (True Negative rate): the test data has the specificity of 94.19%, meaning it accurately identifies 94.19% the portion of the lower quality wine. It is slightly lower than specificity on the training data (95.183%).

Overall, the results on test data are consistent with the model’s performance in the train data. However, the sensitivity is still low, suggesting there’s a large percentage of high quality wine is missing.

Descision Tree Model

Now we create a simple decision tree model to predict good_quality. Again, use all the variables.

# Decision tree model
tree_model <- rpart(good_quality ~ ., data=wine_train)

# Make prediction on train set
predict_train_tree <- predict(tree_model, type="class")

# confusion matrix on train
 confusionMatrix(predict_train_tree, wine_train$good_quality, positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3522  594
##          1  132  300
##                                           
##                Accuracy : 0.8404          
##                  95% CI : (0.8294, 0.8509)
##     No Information Rate : 0.8034          
##     P-Value [Acc > NIR] : 7.373e-11       
##                                           
##                   Kappa : 0.3721          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.33557         
##             Specificity : 0.96388         
##          Pos Pred Value : 0.69444         
##          Neg Pred Value : 0.85569         
##              Prevalence : 0.19657         
##          Detection Rate : 0.06596         
##    Detection Prevalence : 0.09499         
##       Balanced Accuracy : 0.64972         
##                                           
##        'Positive' Class : 1               
##

The tree model has better Accuracy (84.04%), Sensitivity (33.56%) and Specificity (96.39%) compare to logistic model.

Let’s predict on test set

# Make prediction on test set
predict_test_tree <- predict(tree_model, newdata = wine_test, type="class")

# confusion matrix on train
 confusionMatrix(predict_test_tree, wine_test$good_quality, positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1497  262
##          1   69  121
##                                           
##                Accuracy : 0.8302          
##                  95% CI : (0.8127, 0.8466)
##     No Information Rate : 0.8035          
##     P-Value [Acc > NIR] : 0.001428        
##                                           
##                   Kappa : 0.3358          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.31593         
##             Specificity : 0.95594         
##          Pos Pred Value : 0.63684         
##          Neg Pred Value : 0.85105         
##              Prevalence : 0.19651         
##          Detection Rate : 0.06208         
##    Detection Prevalence : 0.09749         
##       Balanced Accuracy : 0.63593         
##                                           
##        'Positive' Class : 1               
##

The result is very close to the training data.

Random Forest

#fit model
forest_model <- randomForest(good_quality~., data = wine_train, ntree = 100, mtry = sqrt(3), importance = TRUE)
forest_model

## 
## Call:
##  randomForest(formula = good_quality ~ ., data = wine_train, ntree = 100,      mtry = sqrt(3), importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 11.72%
## Confusion matrix:
##      0   1 class.error
## 0 3535 119  0.03256705
## 1  414 480  0.46308725

# create confusion matrix for train set
predict_forest_train <- predict(forest_model, type = "class")

#confusion matrix on train
confusionMatrix(predict_forest_train, wine_train$good_quality, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3535  414
##          1  119  480
##                                          
##                Accuracy : 0.8828         
##                  95% CI : (0.8731, 0.892)
##     No Information Rate : 0.8034         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5761         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.5369         
##             Specificity : 0.9674         
##          Pos Pred Value : 0.8013         
##          Neg Pred Value : 0.8952         
##              Prevalence : 0.1966         
##          Detection Rate : 0.1055         
##    Detection Prevalence : 0.1317         
##       Balanced Accuracy : 0.7522         
##                                          
##        'Positive' Class : 1              
##

Predict on test set

# create confusion matrix for test set
predict_forest_test <- predict(forest_model, newdata = wine_test, type = "class")

#confusion matrix on train
confusionMatrix(predict_forest_test, wine_test$good_quality, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1503  173
##          1   63  210
##                                           
##                Accuracy : 0.8789          
##                  95% CI : (0.8636, 0.8931)
##     No Information Rate : 0.8035          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5699          
##                                           
##  Mcnemar's Test P-Value : 1.291e-12       
##                                           
##             Sensitivity : 0.5483          
##             Specificity : 0.9598          
##          Pos Pred Value : 0.7692          
##          Neg Pred Value : 0.8968          
##              Prevalence : 0.1965          
##          Detection Rate : 0.1077          
##    Detection Prevalence : 0.1401          
##       Balanced Accuracy : 0.7540          
##                                           
##        'Positive' Class : 1               
##

Accuracy: 87.89%, Sensitivity: 54.31%, and Specificity: 96.10%. Similar to the confusion matrix from training set. The random fores model has better Accuracy, Sensitivity and Specificity than the previous two models.

Linear Discriminant Analysis (LDA)

The lda function through the MASS package was used to fit an LDA model on the training data.

# Fit model
lda_model <- lda(good_quality~., wine_train)
# create confusion matrix for train set
predict_lda_train <- predict(lda_model, type = "class")

#confusion matrix on train
confusionMatrix(data = predict_lda_train$class, reference = wine_train$good_quality)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3433  597
##          1  221  297
##                                           
##                Accuracy : 0.8201          
##                  95% CI : (0.8087, 0.8312)
##     No Information Rate : 0.8034          
##     P-Value [Acc > NIR] : 0.002224        
##                                           
##                   Kappa : 0.323           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9395          
##             Specificity : 0.3322          
##          Pos Pred Value : 0.8519          
##          Neg Pred Value : 0.5734          
##              Prevalence : 0.8034          
##          Detection Rate : 0.7548          
##    Detection Prevalence : 0.8861          
##       Balanced Accuracy : 0.6359          
##                                           
##        'Positive' Class : 0               
##

Let’s predict on test data

# create confusion matrix for test set
predict_lda_test <- predict(lda_model, newdata = wine_test, type = "class")

#confusion matrix on train
confusionMatrix(data = predict_lda_test$class, reference = wine_test$good_quality)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1463  263
##          1  103  120
##                                           
##                Accuracy : 0.8122          
##                  95% CI : (0.7942, 0.8293)
##     No Information Rate : 0.8035          
##     P-Value [Acc > NIR] : 0.1736          
##                                           
##                   Kappa : 0.2939          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9342          
##             Specificity : 0.3133          
##          Pos Pred Value : 0.8476          
##          Neg Pred Value : 0.5381          
##              Prevalence : 0.8035          
##          Detection Rate : 0.7506          
##    Detection Prevalence : 0.8856          
##       Balanced Accuracy : 0.6238          
##                                           
##        'Positive' Class : 0               
##

Quadratic Discriminant Analysis (QDA)

A classification model was again fit using the training data, this time with QDA, using the qda function through the MASS package.

quad_model = qda(good_quality~.,wine_train)
# create confusion matrix for test set
predict_quad_test <- predict(quad_model, newdata = wine_test, type = "class")

#confusion matrix on train
confusionMatrix(data = predict_quad_test$class, reference = wine_test$good_quality)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1199  115
##          1  367  268
##                                           
##                Accuracy : 0.7527          
##                  95% CI : (0.7329, 0.7717)
##     No Information Rate : 0.8035          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3727          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7656          
##             Specificity : 0.6997          
##          Pos Pred Value : 0.9125          
##          Neg Pred Value : 0.4220          
##              Prevalence : 0.8035          
##          Detection Rate : 0.6152          
##    Detection Prevalence : 0.6742          
##       Balanced Accuracy : 0.7327          
##                                           
##        'Positive' Class : 0               
##

Which is the “best” model?

After running the different types of classifications, it’s important to be able to compare these metrics and choose the model that performs best based on the specific requirements of individual’s problem. It’s important to consider the trade-offs between different metrics and choose the model that best aligns with our goals (e.g., minimizing false positives, maximizing recall, etc.).

Summary of the performance of all five model:

Accuracy: is the proportion of correctly classified instances (both true positives and true negatives) out of the total instances. It is calculated as \((TP + TN) / (TP + TN + FP + FN)\).
Sensitivity (True Positive Rate) measures the percentage of true positives that are correctly identified by the model. It is calculated as \(TP/(TP+FN)\).
Specificity (True Negative Rate) measures the percentage of true negatives that are correctly identified by the model (Specificity=\(TN/(TN+FP)\)).

Mode	Accuracy (%)	Sensitivity (%)	Specitivity (%)
Logistics	81.17	27.94	94.19
Descision Tree	83.02	31.59	95.59
Random Forest	87.89	54.31	96.10
LDA	81.22	93.42	31.33
QDA	75.27	76.56	69.97

From the table above, Random Forest has the highest accuracy of 87.89%, indicating that it correctly classifies the instances most often. LDA has the highest sensitivity of 93.42%, indicating that it correctly identifies a high proportion of positive instances. This is important if correctly identifying positive instances is a priority. Random Forest has the highest specificity of 96.10%, indicating that it correctly identifies a high proportion of negative instances. This is important if correctly identifying negative instances is a priority.

Based on the metrics we produced, Random Forest appears to have the highest accuracy and specificity, while LDA has the highest sensitivity. Decision Tree also shows competitive performance across all metrics. Ultimately, the choice of the “best” model depends on the specific goals of individual’s interest and the trade-offs we are willing to make between accuracy, sensitivity, and specificity.

Conclusion and future work

An analysis of wine quality on the data set of 6,497 observations of wine was presented using five different classification methods. With interest being only in distinguishing between high and poor qualities, the instances in the data set of 11 quality levels were grouped into two classes.

For future work, one can consider the different wine colors separately. More data can also be collected since such a data set tends to be very subjective. Other classification algorithms such as bayesian classification and neutral network can also be applied to this datset.

MIS 5470, Final Project

Mimi Tran, mttran3@oakland.edu

4/11/2024