Case

Wine is an alcoholic drink made from fermented grapes and famous for its intoxicating effects and delicacy. Different varieties of grapes and strains of yeasts produce different styles of wine. The biochemical process that occur during fermentation produce alcohol and substances that enrich the flavour of the wine and determine wine’s quality.



It is important for wine producers to classify wine’s quality accurately. Unfortunately, traditional method (organoleptic test by panelist) is quite labor-intensive and time-consuming.

Throughout this article, I will try to develop a classification model to determine wine quality with Logistic Regression and k-NN using the Wine Quality dataset. This dataset was obtained from UCI Machine Learning Repository, containing the physicochemical characteristic of red and white variants of the Portuguese “Vinho Verde” wine1. The purpose of such model is to help wine producers to classify wine more efficiently and improve their productivity.

Solutions

1. Load Library

library(ggplot2) # to plot
library(inspectdf) # to inspect proportion of categorical varible
library(GGally) # to inspect correlation between variables
library(caret) # to perform cross-validation
library(class) # package for knn
library(car) # to inspect multicollinearity

2. Load Data

wine <- read.csv("wine.csv")
wine

Coloumn description:

  • type : wine variants (red/white)
  • fixed acidity: most acids involved with wine
  • volatile acidity: amount of acetic acid in wine
  • citric acid : found in small quantities
  • residual sugar: amount of sugar remaining after wine fermentation/production
  • chlorides: amount of salt in the wine
  • free sulfur dioxide: free forms of S02, prevents microbial growth and the oxidation of wine
  • total sulfur dioxide: amount of free and bound forms of S02
  • density: the density of water depending on the percent alcohol and sugar content
  • pH: describes how acidic or basic a wine is on a scale 0-14 (very acidic: 0, very basic: 14); most wines are between 3-4 on the pH scale
  • sulphates: an antimicrobial and antioxidant
  • alcohol: the percent alcohol content of the wine
  • quality : target variable (based on sensory data, score between 0 and 10)

3. Data Wrangling

I wanted to make a model to predict white wine quality only. Therefore, I subsetted the white wine data and removed type for easier data analysis.

# subsetting data
white <- wine[wine$type=="white",]
white <- white[,-1]

# check for missing value
anyNA(white)
## [1] TRUE
colSums(is.na(white))
##        fixed.acidity     volatile.acidity          citric.acid 
##                    8                    7                    2 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                    2                    2                    0 
## total.sulfur.dioxide              density                   pH 
##                    0                    0                    7 
##            sulphates              alcohol              quality 
##                    2                    0                    0
prop.table(table(is.na(white)))*100 # missing value = 0.05%, therefore we can remove them
## 
##       FALSE        TRUE 
## 99.94895876  0.05104124
white <- na.omit(white) # removing missing value

Below is the proportion of quality as our target variable:

round(prop.table(table(white$quality)), 3)*100
## 
##    3    4    5    6    7    8    9 
##  0.4  3.3 29.7 44.9 18.0  3.6  0.1

In this data set, white wine are classified as poor (0-4), normal (5-6), excellent (7-10), which most of the observation labelled as normal. Because I was trying to classify wine which have excellent quality, I transformed quality to a factor of 2 levels: poor-normal(0) and excellent(1).

white$quality <- as.factor(ifelse(white$quality>6, 1, 0))
summary(white$quality)
##    0    1 
## 3816 1054

After that, I inspected the proportion of our target variable using inspect_cat. This function let’s us inspect the proportion of different levels in each categorical variable from our dataset.

inspect_cat(white, show_plot = T)

A well-distributed levels is important in model making because it lets our model learns both characteristic of each level evenly. This will prevent the outcome of overfitting or underfitting model.

Unfortunately, the levels in quality are not evenly distributed and dominated by poor-normal (78.36%). Because the difference is quite large, an upsampling/downsampling step is needed. I choose to downSample the dataset because the number of observation for excellent is already high (1054).

set.seed(47)
white.d <- white
white.d <- downSample(x = white.d[,-12], y = white.d$quality, list = F, yname = "quality")
inspect_cat(white.d, show_plot = T)

Next, I inspected the data distribution from predictor variables.

# inspect data distribution as a whole
summary(white.d)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.900   Min.   :0.0800   Min.   :0.0000   Min.   : 0.800  
##  1st Qu.: 6.300   1st Qu.:0.2000   1st Qu.:0.2800   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 4.500  
##  Mean   : 6.824   Mean   :0.2711   Mean   :0.3357   Mean   : 5.912  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3800   3rd Qu.: 8.800  
##  Max.   :10.700   Max.   :0.9650   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   :  3.00      Min.   : 18.0       
##  1st Qu.:0.03400   1st Qu.: 25.00      1st Qu.:105.0       
##  Median :0.04000   Median : 34.00      Median :129.0       
##  Mean   :0.04317   Mean   : 35.43      Mean   :133.6       
##  3rd Qu.:0.04800   3rd Qu.: 45.00      3rd Qu.:158.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates        alcohol      quality 
##  Min.   :0.9871   Min.   :2.740   Min.   :0.220   Min.   : 8.00   0:1054  
##  1st Qu.:0.9911   1st Qu.:3.080   1st Qu.:0.410   1st Qu.: 9.60   1:1054  
##  Median :0.9928   Median :3.180   Median :0.480   Median :10.80           
##  Mean   :0.9934   Mean   :3.191   Mean   :0.495   Mean   :10.84           
##  3rd Qu.:0.9954   3rd Qu.:3.290   3rd Qu.:0.560   3rd Qu.:12.00           
##  Max.   :1.0390   Max.   :3.820   Max.   :1.080   Max.   :14.20

From the summary, it can be seen that each variables have different ranges. Therefore, a normalization step was performed so that all variables have uniform range.

# normalize function
normalize <- function(x){
  return ( 
    (x - min(x))/(max(x) - min(x)) 
  )}

# normalize our data
white.n <- white.d
white.n[,-12] <- sapply(white.n[,-12], normalize)
summary(white.n)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.3529   1st Qu.:0.1356   1st Qu.:0.1687   1st Qu.:0.01385  
##  Median :0.4265   Median :0.2034   Median :0.1928   Median :0.05692  
##  Mean   :0.4300   Mean   :0.2160   Mean   :0.2022   Mean   :0.07865  
##  3rd Qu.:0.5000   3rd Qu.:0.2712   3rd Qu.:0.2289   3rd Qu.:0.12308  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00000   Min.   :0.00000     Min.   :0.0000      
##  1st Qu.:0.06587   1st Qu.:0.07692     1st Qu.:0.2062      
##  Median :0.08383   Median :0.10839     Median :0.2630      
##  Mean   :0.09331   Mean   :0.11338     Mean   :0.2739      
##  3rd Qu.:0.10778   3rd Qu.:0.14685     3rd Qu.:0.3318      
##  Max.   :1.00000   Max.   :1.00000     Max.   :1.0000      
##     density              pH           sulphates         alcohol      
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.07692   1st Qu.:0.3148   1st Qu.:0.2209   1st Qu.:0.2581  
##  Median :0.10970   Median :0.4074   Median :0.3023   Median :0.4516  
##  Mean   :0.12165   Mean   :0.4180   Mean   :0.3197   Mean   :0.4583  
##  3rd Qu.:0.15992   3rd Qu.:0.5093   3rd Qu.:0.3953   3rd Qu.:0.6452  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  quality 
##  0:1054  
##  1:1054  
##          
##          
##          
## 

I also inspected the distribution of numeric variables versus wine quality. If there is any variable that have distinct distribution between different wine quality, I may as well not use them for model building. This is because such variable may result in perfect separation condition.

# function to boxplot numerical variable vs. categorical variable
cat_num <- function(data, x, y){
    ggplot(data, aes_string(x = y, y = x)) +
    geom_boxplot(aes_string(col = y), outlier.shape = 21, outlier.fill = NULL, show.legend = F) +
    theme_classic() + scale_color_brewer(palette = "Set1")}

# box plot
cat_num(white.n, "fixed.acidity", "quality")
cat_num(white.n, "volatile.acidity", "quality")
cat_num(white.n, "citric.acid", "quality")
cat_num(white.n, "residual.sugar", "quality")
cat_num(white.n, "chlorides", "quality")
cat_num(white.n, "free.sulfur.dioxide", "quality")
cat_num(white.n, "total.sulfur.dioxide", "quality")
cat_num(white.n, "density", "quality")
cat_num(white.n, "pH", "quality")
cat_num(white.n, "sulphates", "quality")
cat_num(white.n, "alcohol", "quality")

From the boxplot, alcohol and density has a distinct data distribution between poor-normal and excellent quality. Furthermore, chlorides variables has too many outliers that may influence the performance of the model. Therefore, I did not use alcohol, density, & chlorides for logistic regression model building.

4. Cross-Validation

When making a model, it is better to perform a cross-validation scheme to train and test our model using the data that we have. This allows us to know the accuracy, sensitivity, precicion, and specificity (model evaluation criteria) of our model before we apply it for real cases in the industry.

These are the main steps of cross-validation scheme:

  • Split our dataset into train and test sets
  • Develop the initial model using our train set
  • Evaluate model on test set, returning to the previous step if necessary for model tuning
  • Pick a final model based on an evaluation criteria

First, I splitted dataset into train and test set.

set.seed(47)
train <- sample(nrow(white.n), nrow(white.n)*0.8) # making condition to sample 80% of wine.n data
wine.train <- white.n[train,] # consist of 80% of wine.n data
wine.test <- white.n[-train,] # consist of the remaining 20% data

wine.train.label <- white.n[train,12] # taking `quality` label for train set
wine.test.label <- white.n[-train,12] # taking `quality` label for test set

5. Logistic Regression Model

I made logistic regression with automatic feature selection using backward elimination. I picked model which has the lowest AIC. AIC estimates the relative amount of information lost by a given model. The less information a model loses, the higher the quality of that model.

On picking the variables for backward elimination, I also inspected the correlation between each variables. It is important that each variables is not correlated with one another, because it is one of the assumption when making logistic regression.

ggcorr(wine.train[,-c(5,8,11)], hjust=1, layout.exp = 1, label = T, label_size = 2.9)

Based on matrix, there are correlation between free.sulfur.dioxide and total.sulfur.dioxide. Therefore, I picked only one of them for the model building.

log <- glm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
    free.sulfur.dioxide + pH + sulphates, wine.train, family = "binomial")
step(log,direction = "backward")
## Start:  AIC=2279.13
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
##     free.sulfur.dioxide + pH + sulphates
## 
##                       Df Deviance    AIC
## - sulphates            1   2263.2 2277.2
## <none>                     2263.1 2279.1
## - free.sulfur.dioxide  1   2265.2 2279.2
## - fixed.acidity        1   2265.8 2279.8
## - citric.acid          1   2269.7 2283.7
## - residual.sugar       1   2271.6 2285.6
## - pH                   1   2273.8 2287.8
## - volatile.acidity     1   2279.3 2293.3
## 
## Step:  AIC=2277.23
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
##     free.sulfur.dioxide + pH
## 
##                       Df Deviance    AIC
## <none>                     2263.2 2277.2
## - free.sulfur.dioxide  1   2265.4 2277.4
## - fixed.acidity        1   2266.0 2278.0
## - citric.acid          1   2269.8 2281.8
## - residual.sugar       1   2271.6 2283.6
## - pH                   1   2273.9 2285.9
## - volatile.acidity     1   2279.3 2291.3
## 
## Call:  glm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + free.sulfur.dioxide + pH, family = "binomial", 
##     data = wine.train)
## 
## Coefficients:
##         (Intercept)        fixed.acidity     volatile.acidity  
##              0.9817              -0.7946              -1.9414  
##         citric.acid       residual.sugar  free.sulfur.dioxide  
##             -2.1285              -2.0972              -1.4278  
##                  pH  
##              1.3232  
## 
## Degrees of Freedom: 1685 Total (i.e. Null);  1679 Residual
## Null Deviance:       2337 
## Residual Deviance: 2263  AIC: 2277
step <- glm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
    residual.sugar + free.sulfur.dioxide + pH, family = "binomial", 
    data = wine.train)

We can interpret our model based on its coefficient. The coefficient in logistic regression stand for log-of-odds ratio. We can get the odds ratio of for each variables by using exp(coef()).

data.frame(coefficient = round(coef(step),4),
           odds_ratio = round(exp(coef(step)),4))

In multivariate logistic regression, each exponentiated coefficient (exp(coeff)) or in the table as odds_ratio is the change of odds in multiplicative scale for a one-unit increase in the corresponding predictor variable, holding other variables constant.

For example, from the logistic regression model, we can say that: holding all other variables constant, for one unit increase in pH, there will be a 3.76 times change (increase) in odds for a white wine classified as excellent quality. This also apply to all other variables.

From the table, we can also see that positive coefficient gave > 1 odds_ratio, while negative coefficient gave < 1 odds_ratio. These results explain:

positive coefficient describes a positive correlation between a predictor variable and the odds of our target variable.

negative coefficient describes negative correlation between a predictor variable and the odds of our target variable.

We can summarize that an increase in pH will increase the odds of having a white wine classified as excellent quality, while other variables will decrease the odds.

6. k-NN Model

The k-Nearest Neighboor model uses information about an example’s k-nearest neighbors to classify unlabeled examples. The model requires a dataset made up of examples that have been classified into several categories as the training dataset. After that, for each unlabeled value in the test dataset, k-NN identifies k records in the training data that are the “nearest” in similarity. The unlabeled value is assigned the majority of class of the k-nearest neighbors.

In finding the optimum k, I used a common strategy to calculate the square root of the number of training examples.

round(sqrt(nrow(wine.train))) # finding optimum `k`
## [1] 41
wine_pred <- knn(train = wine.train[,-12],
                 test=wine.test[,-12],
                 cl=wine.train.label,
                 k=41) # k-NN prediction

7. Model Evaluation

Prior to the model evaluation, I predicted the value of our test set using our model. Following after, I evaluated the models using confusionMatrix which compares our predicted values versus the actual values from the dataset. In addition, I also checked the multicollinearity of the logistic regression model.

vif(step) # check for multicollinearity
##       fixed.acidity    volatile.acidity         citric.acid 
##            1.286141            1.060149            1.110854 
##      residual.sugar free.sulfur.dioxide                  pH 
##            1.153259            1.106729            1.286704
wine.test$pred <- predict(step, wine.test, type = "response") # predict the value
hist(wine.test$pred, breaks=30) # inspect the distribution of predicted value
abline(v = 0.53, col="blue") # adding abline where the distribution separate

wine.test$pred.label <- ifelse(wine.test$pred>0.53, 1, 0)
confusionMatrix(as.factor(wine.test$pred.label), wine.test.label, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 147  97
##          1  71 107
##                                           
##                Accuracy : 0.6019          
##                  95% CI : (0.5534, 0.6489)
##     No Information Rate : 0.5166          
##     P-Value [Acc > NIR] : 0.0002586       
##                                           
##                   Kappa : 0.1996          
##                                           
##  Mcnemar's Test P-Value : 0.0537567       
##                                           
##             Sensitivity : 0.5245          
##             Specificity : 0.6743          
##          Pos Pred Value : 0.6011          
##          Neg Pred Value : 0.6025          
##              Prevalence : 0.4834          
##          Detection Rate : 0.2536          
##    Detection Prevalence : 0.4218          
##       Balanced Accuracy : 0.5994          
##                                           
##        'Positive' Class : 1               
## 
confusionMatrix(as.factor(wine_pred), as.factor(wine.test.label), "1") # k-NN model evaluation
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 146  28
##          1  72 176
##                                                
##                Accuracy : 0.763                
##                  95% CI : (0.7195, 0.8028)     
##     No Information Rate : 0.5166               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.5288               
##                                                
##  Mcnemar's Test P-Value : 0.00001708           
##                                                
##             Sensitivity : 0.8627               
##             Specificity : 0.6697               
##          Pos Pred Value : 0.7097               
##          Neg Pred Value : 0.8391               
##              Prevalence : 0.4834               
##          Detection Rate : 0.4171               
##    Detection Prevalence : 0.5877               
##       Balanced Accuracy : 0.7662               
##                                                
##        'Positive' Class : 1                    
## 

The accuracy and sensitivity of logistic regression model is 0.619 and 0.5245. Meanwhile, the accuracy and sensitivity of k-NN model is 0.763 and 0.8627.

In wine classification, it is important for a model to have high accuracy to correctly predict both excellent and poor-normal quality of wine to avoid miss-clasification. Model with high sensitivity is also preferable because it describes the ability of a model to correctly predict positive class from the total of real positive class.

From the above evaluation, k-NN model gave higher score in both accuracy and sensitivity, making it a more appropriate model to predict wine’s quality. In addition, logistic regression may not be the best method for this data because there are many variables that ended up unused to meet the model’s assumption.

Although k-NN was picked as the best model in our study, we only get a 0.76 accuracy and 0.86 sensitivity from that model. It is always possible to have higher accuracy by using other classification model and/or with the addition of other data wrangling process. It is even better to try and build models using different classification method and compare them to find the best method for our data.

Thank you for reading, happy learning, and keep improving!


  1. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.