Red Wine Classification Model

1. Context

Red wine is a type of wine made from dark-colored grape varieties. The actual color of the wine can range from intense violet, typical of young wines, through to brick red for mature wines and brown for older red wines. (Source: Wikipedia)

The aim of this report is to make a model which can automatically determine Red Wine Quality, thus help red wine producer produce a better product. In this report we are going to make 3 models; Naive Bayes, Decision Tree and Random Forest. In the end of the report we are going to choose the best model to predict Red Wine quality, which can be used for predicting new products in the future.

2. Data Preparation

2.1 Import Library

library(dplyr)
library(ggplot2)
library(caret)
library(e1071)
library(tidymodels)
library(partykit)
library(animation)

2.2 Read Data

wine <- read.csv("winequality-red.csv")
wine

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5
##  [ reached 'max' / getOption("max.print") -- omitted 1593 rows ]

Column Description:

fixed.acidity Overall wine acidity

volatile.acidity Acetic acid in wine

citric.acid Citric Acid in Wine

residual.sugar Amount of sugar in product

chlorides Amount of salt in product

free.sulfur.dioxide Free forms of S02, prevents microbial growth and the oxidation of wine

total.sulfur.dioxide Amount of free and bound form of SO2

density Water density depending on the alcohol percentage and sugar content

pH Product acidic scale (most wines are between 3-4 on pH scale)

sulphates Antimicrobial and Antioxidant

alcohol Alcohol percentage of Wine

quality Product’s score

2.3 Data Wrangling

As described above, the aim of this report is to make a model that can predict Red Wine quality. First we are going to make a threshold of 6.5 as a seperator of good wine and bad wine.

wine <- wine %>% 
  mutate(quality = as.factor(ifelse(quality > 6.5, "Good", "Bad")))
wine

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1     Bad
## 2     Bad
## 3     Bad
## 4     Bad
## 5     Bad
## 6     Bad
##  [ reached 'max' / getOption("max.print") -- omitted 1593 rows ]

2.4 Missing Values

Don’t forget to check for missing values

colSums(is.na(wine))

##        fixed.acidity     volatile.acidity          citric.acid 
##                    0                    0                    0 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                    0                    0                    0 
## total.sulfur.dioxide              density                   pH 
##                    0                    0                    0 
##            sulphates              alcohol              quality 
##                    0                    0                    0

Since there are no missing values we can carry on!

2.5 Exploratory Data Analysis

First, let’s check the proportion of our target value;

prop.table(table(wine$quality))

## 
##       Bad      Good 
## 0.8642902 0.1357098

Based on the table above we can see that the proportion between levels in our target variable is unbalanced. Since the difference between variables in target column is too great, we are going to down-sample to balance the proportion instead.

Data Splitting

RNGkind(sample.kind = "Rounding")
set.seed(100)

index <- sample(nrow(wine), nrow(wine)*0.8)

train <- wine[index,]
test <- wine[-index,]

table(train$quality)

## 
##  Bad Good 
## 1106  173

Down-sampling

train_down <- downSample(x = train[,-1], y = train$quality, yname = "labels")
table(train_down$quality)

## 
##  Bad Good 
##  173  173

3. Modelling

3.1 Modelling: Naive Bayes

# Model Creation
model_naive <- naiveBayes(train_down[-12], train_down$quality, laplace = 1)

# Model Fitting
naive_pred <- predict(model_naive, test, type = "class") 
naive_prob <- predict(model_naive, test, type = "raw") 

#Performance Evaluation (Confusion Matrix)
confusionMatrix(data = naive_pred, reference = test$quality, positive = "Good")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Bad Good
##       Bad  274    0
##       Good   2   44
##                                           
##                Accuracy : 0.9938          
##                  95% CI : (0.9776, 0.9992)
##     No Information Rate : 0.8625          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9741          
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9928          
##          Pos Pred Value : 0.9565          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.1375          
##          Detection Rate : 0.1375          
##    Detection Prevalence : 0.1437          
##       Balanced Accuracy : 0.9964          
##                                           
##        'Positive' Class : Good            
##

From the result above we can see that the model acquired from Naive Bayes has an Accuracy of 99%. Theoritically the model acquired from Naive Bayes won’t reach 100% accuracy, our model’s high accuracy might be the result of the train data down-sampling. Why is it not good to have a NB Model with close to 100% accuracy? Because theoritically it tends to happen because of Over-fitting.

3.2 Modelling: Decision Tree

Next, let’s try to predict our data using Decision Tree

# Model Fitting
model_dt <- ctree(quality ~ ., train)
plot(model_dt, type = "simple")

A decision tree, named after a tree will have several parts;

The first, topmost box is the Root, where we will begin classifying the product, in this case Red Wine Products.
The root will split and make branches, following certain rules. These branches always ends with a box. The boxes that still has a lower connection can be called Nodes.
Nodes that does not split anymore or appears to be at the end of the tree is called Leaves, like a leaf on a tree.

A decision tree is a good classification tool for new employees since they only need to follow the nodes to getting a classification result. From the decision tree above, we can conclude that what really makes a good wine is the Alcohol percentage level and The Sulphates in those high alcohol content Wines. It appears from using this model, only Red Wines with alcohol percentage level of above 11.5% and Suplhates of above 0.68 are considered to be a good wine.

# Training Train Data
pred_dt <- predict(model_dt, test[, -12])

# Model Evaluation
confusionMatrix(data = pred_dt, reference = test$quality, positive = "Good")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Bad Good
##       Bad  271   28
##       Good   5   16
##                                           
##                Accuracy : 0.8969          
##                  95% CI : (0.8582, 0.9279)
##     No Information Rate : 0.8625          
##     P-Value [Acc > NIR] : 0.0402752       
##                                           
##                   Kappa : 0.4428          
##                                           
##  Mcnemar's Test P-Value : 0.0001283       
##                                           
##             Sensitivity : 0.36364         
##             Specificity : 0.98188         
##          Pos Pred Value : 0.76190         
##          Neg Pred Value : 0.90635         
##              Prevalence : 0.13750         
##          Detection Rate : 0.05000         
##    Detection Prevalence : 0.06563         
##       Balanced Accuracy : 0.67276         
##                                           
##        'Positive' Class : Good            
##

From model evaluation above we can see that This model acquired from Decision Tree has an accuracy of 89.69%. Lower than that of Naive Bayes model, but then is more logical.

3.3 Modelling: Random Forest

Next, let’s make a model using Random Forest Method.

library(animation)
ani.options(interval = 1, nmax = 15)
cv.ani(main = "Demonstration of the k-fold Cross Validation", 
  bty = "l")

set.seed(100)
ctrl <- trainControl(method="repeatedcv", number=4, repeats=4) # k-fold cross validation

# Summary using Train Data
rf <- train(quality ~ ., data = train, method="rf", trControl = ctrl)
rf

## Random Forest 
## 
## 1279 samples
##   11 predictor
##    2 classes: 'Bad', 'Good' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 4 times) 
## Summary of sample sizes: 959, 960, 960, 958, 959, 959, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8997219  0.4686727
##    6    0.8960036  0.4870670
##   11    0.8922902  0.4811185
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

From the model summary above we can see that the optimum number of variables for splitting at each tree nodes is 2. Next, let’s inspect the importance of each variables that we used in our model using varImp()

varImp(rf)

## rf variable importance
## 
##                      Overall
## alcohol              100.000
## volatile.acidity      55.961
## sulphates             53.751
## density               38.351
## citric.acid           33.722
## total.sulfur.dioxide  25.220
## fixed.acidity         22.607
## chlorides             18.460
## residual.sugar        12.968
## pH                     5.117
## free.sulfur.dioxide    0.000

Next, let’s test our random forest model using our test data.

# Model Fitting
rf_pred <- predict(rf, test, type = "raw")
rf_prob <- predict(rf, test, type = "prob")

# Model Evaluation
confusionMatrix(data = rf_pred, reference = test$quality, positive = "Good")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Bad Good
##       Bad  271   21
##       Good   5   23
##                                           
##                Accuracy : 0.9188          
##                  95% CI : (0.8832, 0.9462)
##     No Information Rate : 0.8625          
##     P-Value [Acc > NIR] : 0.001292        
##                                           
##                   Kappa : 0.5956          
##                                           
##  Mcnemar's Test P-Value : 0.003264        
##                                           
##             Sensitivity : 0.52273         
##             Specificity : 0.98188         
##          Pos Pred Value : 0.82143         
##          Neg Pred Value : 0.92808         
##              Prevalence : 0.13750         
##          Detection Rate : 0.07187         
##    Detection Prevalence : 0.08750         
##       Balanced Accuracy : 0.75231         
##                                           
##        'Positive' Class : Good            
##

From the model evaluation above we can see that the model acquired using Random Forest method has an accuracy of 91.88%.

4. Conclusion

If we were to exclude Naive Bayes model, we can safely conclude that the model generated from Random Forest model has an edge over Decision Tree model. It has better accuracy and also Sensitivity and Specificity.

Therefore, with an overall accuracy of 91.88%, it’s safe to say that the model acquired from Random Forest method is the best one to pick, considering the possibility of over-fitting of Naive Bayes Model.