Wine is an alcoholic drink made from fermented grapes and famous for its intoxicating effects and delicacy. Different varieties of grapes and strains of yeasts produce different styles of wine. The biochemical process that occur during fermentation produce alcohol and substances that enrich the flavour of the wine and determine wine’s quality.
It is important for wine producers to classify wine’s quality accurately. Unfortunately, traditional method (organoleptic test by panelist) is quite labor-intensive and time-consuming.
Throughout this article, I will try to develop a classification model to determine wine quality with Logistic Regression and k-NN using the Wine Quality dataset. This dataset was obtained from UCI Machine Learning Repository, containing the physicochemical characteristic of red and white variants of the Portuguese “Vinho Verde” wine1. The purpose of such model is to help wine producers to classify wine more efficiently and improve their productivity.
library(ggplot2) # to plot
library(inspectdf) # to inspect proportion of categorical varible
library(GGally) # to inspect correlation between variables
library(caret) # to perform cross-validation
library(class) # package for knn
library(car) # to inspect multicollinearity
wine <- read.csv("wine.csv")
wine
Coloumn description:
type : wine variants (red/white)fixed acidity: most acids involved with winevolatile acidity: amount of acetic acid in winecitric acid : found in small quantitiesresidual sugar: amount of sugar remaining after wine fermentation/productionchlorides: amount of salt in the winefree sulfur dioxide: free forms of S02, prevents microbial growth and the oxidation of winetotal sulfur dioxide: amount of free and bound forms of S02density: the density of water depending on the percent alcohol and sugar contentpH: describes how acidic or basic a wine is on a scale 0-14 (very acidic: 0, very basic: 14); most wines are between 3-4 on the pH scalesulphates: an antimicrobial and antioxidantalcohol: the percent alcohol content of the winequality : target variable (based on sensory data, score between 0 and 10)I wanted to make a model to predict white wine quality only. Therefore, I subsetted the white wine data and removed type for easier data analysis.
# subsetting data
white <- wine[wine$type=="white",]
white <- white[,-1]
# check for missing value
anyNA(white)
## [1] TRUE
colSums(is.na(white))
## fixed.acidity volatile.acidity citric.acid
## 8 7 2
## residual.sugar chlorides free.sulfur.dioxide
## 2 2 0
## total.sulfur.dioxide density pH
## 0 0 7
## sulphates alcohol quality
## 2 0 0
prop.table(table(is.na(white)))*100 # missing value = 0.05%, therefore we can remove them
##
## FALSE TRUE
## 99.94895876 0.05104124
white <- na.omit(white) # removing missing value
Below is the proportion of quality as our target variable:
round(prop.table(table(white$quality)), 3)*100
##
## 3 4 5 6 7 8 9
## 0.4 3.3 29.7 44.9 18.0 3.6 0.1
In this data set, white wine are classified as poor (0-4), normal (5-6), excellent (7-10), which most of the observation labelled as normal. Because I was trying to classify wine which have excellent quality, I transformed quality to a factor of 2 levels: poor-normal(0) and excellent(1).
white$quality <- as.factor(ifelse(white$quality>6, 1, 0))
summary(white$quality)
## 0 1
## 3816 1054
After that, I inspected the proportion of our target variable using inspect_cat. This function let’s us inspect the proportion of different levels in each categorical variable from our dataset.
inspect_cat(white, show_plot = T)
A well-distributed levels is important in model making because it lets our model learns both characteristic of each level evenly. This will prevent the outcome of overfitting or underfitting model.
Unfortunately, the levels in quality are not evenly distributed and dominated by poor-normal (78.36%). Because the difference is quite large, an upsampling/downsampling step is needed. I choose to downSample the dataset because the number of observation for excellent is already high (1054).
set.seed(47)
white.d <- white
white.d <- downSample(x = white.d[,-12], y = white.d$quality, list = F, yname = "quality")
inspect_cat(white.d, show_plot = T)
Next, I inspected the data distribution from predictor variables.
# inspect data distribution as a whole
summary(white.d)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.900 Min. :0.0800 Min. :0.0000 Min. : 0.800
## 1st Qu.: 6.300 1st Qu.:0.2000 1st Qu.:0.2800 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 4.500
## Mean : 6.824 Mean :0.2711 Mean :0.3357 Mean : 5.912
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3800 3rd Qu.: 8.800
## Max. :10.700 Max. :0.9650 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 3.00 Min. : 18.0
## 1st Qu.:0.03400 1st Qu.: 25.00 1st Qu.:105.0
## Median :0.04000 Median : 34.00 Median :129.0
## Mean :0.04317 Mean : 35.43 Mean :133.6
## 3rd Qu.:0.04800 3rd Qu.: 45.00 3rd Qu.:158.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol quality
## Min. :0.9871 Min. :2.740 Min. :0.220 Min. : 8.00 0:1054
## 1st Qu.:0.9911 1st Qu.:3.080 1st Qu.:0.410 1st Qu.: 9.60 1:1054
## Median :0.9928 Median :3.180 Median :0.480 Median :10.80
## Mean :0.9934 Mean :3.191 Mean :0.495 Mean :10.84
## 3rd Qu.:0.9954 3rd Qu.:3.290 3rd Qu.:0.560 3rd Qu.:12.00
## Max. :1.0390 Max. :3.820 Max. :1.080 Max. :14.20
From the summary, it can be seen that each variables have different ranges. Therefore, a normalization step was performed so that all variables have uniform range.
# normalize function
normalize <- function(x){
return (
(x - min(x))/(max(x) - min(x))
)}
# normalize our data
white.n <- white.d
white.n[,-12] <- sapply(white.n[,-12], normalize)
summary(white.n)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.3529 1st Qu.:0.1356 1st Qu.:0.1687 1st Qu.:0.01385
## Median :0.4265 Median :0.2034 Median :0.1928 Median :0.05692
## Mean :0.4300 Mean :0.2160 Mean :0.2022 Mean :0.07865
## 3rd Qu.:0.5000 3rd Qu.:0.2712 3rd Qu.:0.2289 3rd Qu.:0.12308
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.06587 1st Qu.:0.07692 1st Qu.:0.2062
## Median :0.08383 Median :0.10839 Median :0.2630
## Mean :0.09331 Mean :0.11338 Mean :0.2739
## 3rd Qu.:0.10778 3rd Qu.:0.14685 3rd Qu.:0.3318
## Max. :1.00000 Max. :1.00000 Max. :1.0000
## density pH sulphates alcohol
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.07692 1st Qu.:0.3148 1st Qu.:0.2209 1st Qu.:0.2581
## Median :0.10970 Median :0.4074 Median :0.3023 Median :0.4516
## Mean :0.12165 Mean :0.4180 Mean :0.3197 Mean :0.4583
## 3rd Qu.:0.15992 3rd Qu.:0.5093 3rd Qu.:0.3953 3rd Qu.:0.6452
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## quality
## 0:1054
## 1:1054
##
##
##
##
I also inspected the distribution of numeric variables versus wine quality. If there is any variable that have distinct distribution between different wine quality, I may as well not use them for model building. This is because such variable may result in perfect separation condition.
# function to boxplot numerical variable vs. categorical variable
cat_num <- function(data, x, y){
ggplot(data, aes_string(x = y, y = x)) +
geom_boxplot(aes_string(col = y), outlier.shape = 21, outlier.fill = NULL, show.legend = F) +
theme_classic() + scale_color_brewer(palette = "Set1")}
# box plot
cat_num(white.n, "fixed.acidity", "quality")
cat_num(white.n, "volatile.acidity", "quality")
cat_num(white.n, "citric.acid", "quality")
cat_num(white.n, "residual.sugar", "quality")
cat_num(white.n, "chlorides", "quality")
cat_num(white.n, "free.sulfur.dioxide", "quality")
cat_num(white.n, "total.sulfur.dioxide", "quality")
cat_num(white.n, "density", "quality")
cat_num(white.n, "pH", "quality")
cat_num(white.n, "sulphates", "quality")
cat_num(white.n, "alcohol", "quality")
From the boxplot, alcohol and density has a distinct data distribution between poor-normal and excellent quality. Furthermore, chlorides variables has too many outliers that may influence the performance of the model. Therefore, I did not use alcohol, density, & chlorides for logistic regression model building.
When making a model, it is better to perform a cross-validation scheme to train and test our model using the data that we have. This allows us to know the accuracy, sensitivity, precicion, and specificity (model evaluation criteria) of our model before we apply it for real cases in the industry.
These are the main steps of cross-validation scheme:
First, I splitted dataset into train and test set.
set.seed(47)
train <- sample(nrow(white.n), nrow(white.n)*0.8) # making condition to sample 80% of wine.n data
wine.train <- white.n[train,] # consist of 80% of wine.n data
wine.test <- white.n[-train,] # consist of the remaining 20% data
wine.train.label <- white.n[train,12] # taking `quality` label for train set
wine.test.label <- white.n[-train,12] # taking `quality` label for test set
I made logistic regression with automatic feature selection using backward elimination. I picked model which has the lowest AIC. AIC estimates the relative amount of information lost by a given model. The less information a model loses, the higher the quality of that model.
On picking the variables for backward elimination, I also inspected the correlation between each variables. It is important that each variables is not correlated with one another, because it is one of the assumption when making logistic regression.
ggcorr(wine.train[,-c(5,8,11)], hjust=1, layout.exp = 1, label = T, label_size = 2.9)
Based on matrix, there are correlation between free.sulfur.dioxide and total.sulfur.dioxide. Therefore, I picked only one of them for the model building.
log <- glm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar +
free.sulfur.dioxide + pH + sulphates, wine.train, family = "binomial")
step(log,direction = "backward")
## Start: AIC=2279.13
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar +
## free.sulfur.dioxide + pH + sulphates
##
## Df Deviance AIC
## - sulphates 1 2263.2 2277.2
## <none> 2263.1 2279.1
## - free.sulfur.dioxide 1 2265.2 2279.2
## - fixed.acidity 1 2265.8 2279.8
## - citric.acid 1 2269.7 2283.7
## - residual.sugar 1 2271.6 2285.6
## - pH 1 2273.8 2287.8
## - volatile.acidity 1 2279.3 2293.3
##
## Step: AIC=2277.23
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar +
## free.sulfur.dioxide + pH
##
## Df Deviance AIC
## <none> 2263.2 2277.2
## - free.sulfur.dioxide 1 2265.4 2277.4
## - fixed.acidity 1 2266.0 2278.0
## - citric.acid 1 2269.8 2281.8
## - residual.sugar 1 2271.6 2283.6
## - pH 1 2273.9 2285.9
## - volatile.acidity 1 2279.3 2291.3
##
## Call: glm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + free.sulfur.dioxide + pH, family = "binomial",
## data = wine.train)
##
## Coefficients:
## (Intercept) fixed.acidity volatile.acidity
## 0.9817 -0.7946 -1.9414
## citric.acid residual.sugar free.sulfur.dioxide
## -2.1285 -2.0972 -1.4278
## pH
## 1.3232
##
## Degrees of Freedom: 1685 Total (i.e. Null); 1679 Residual
## Null Deviance: 2337
## Residual Deviance: 2263 AIC: 2277
step <- glm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
residual.sugar + free.sulfur.dioxide + pH, family = "binomial",
data = wine.train)
We can interpret our model based on its coefficient. The coefficient in logistic regression stand for log-of-odds ratio. We can get the odds ratio of for each variables by using exp(coef()).
data.frame(coefficient = round(coef(step),4),
odds_ratio = round(exp(coef(step)),4))
In multivariate logistic regression, each exponentiated coefficient (exp(coeff)) or in the table as odds_ratio is the change of odds in multiplicative scale for a one-unit increase in the corresponding predictor variable, holding other variables constant.
For example, from the logistic regression model, we can say that: holding all other variables constant, for one unit increase in pH, there will be a 3.76 times change (increase) in odds for a white wine classified as excellent quality. This also apply to all other variables.
From the table, we can also see that positive coefficient gave > 1 odds_ratio, while negative coefficient gave < 1 odds_ratio. These results explain:
positive coefficient describes a positive correlation between a predictor variable and the odds of our target variable.
negative coefficient describes negative correlation between a predictor variable and the odds of our target variable.
We can summarize that an increase in pH will increase the odds of having a white wine classified as excellent quality, while other variables will decrease the odds.
The k-Nearest Neighboor model uses information about an example’s k-nearest neighbors to classify unlabeled examples. The model requires a dataset made up of examples that have been classified into several categories as the training dataset. After that, for each unlabeled value in the test dataset, k-NN identifies k records in the training data that are the “nearest” in similarity. The unlabeled value is assigned the majority of class of the k-nearest neighbors.
In finding the optimum k, I used a common strategy to calculate the square root of the number of training examples.
round(sqrt(nrow(wine.train))) # finding optimum `k`
## [1] 41
wine_pred <- knn(train = wine.train[,-12],
test=wine.test[,-12],
cl=wine.train.label,
k=41) # k-NN prediction
Prior to the model evaluation, I predicted the value of our test set using our model. Following after, I evaluated the models using confusionMatrix which compares our predicted values versus the actual values from the dataset. In addition, I also checked the multicollinearity of the logistic regression model.
vif(step) # check for multicollinearity
## fixed.acidity volatile.acidity citric.acid
## 1.286141 1.060149 1.110854
## residual.sugar free.sulfur.dioxide pH
## 1.153259 1.106729 1.286704
wine.test$pred <- predict(step, wine.test, type = "response") # predict the value
hist(wine.test$pred, breaks=30) # inspect the distribution of predicted value
abline(v = 0.53, col="blue") # adding abline where the distribution separate
wine.test$pred.label <- ifelse(wine.test$pred>0.53, 1, 0)
confusionMatrix(as.factor(wine.test$pred.label), wine.test.label, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 147 97
## 1 71 107
##
## Accuracy : 0.6019
## 95% CI : (0.5534, 0.6489)
## No Information Rate : 0.5166
## P-Value [Acc > NIR] : 0.0002586
##
## Kappa : 0.1996
##
## Mcnemar's Test P-Value : 0.0537567
##
## Sensitivity : 0.5245
## Specificity : 0.6743
## Pos Pred Value : 0.6011
## Neg Pred Value : 0.6025
## Prevalence : 0.4834
## Detection Rate : 0.2536
## Detection Prevalence : 0.4218
## Balanced Accuracy : 0.5994
##
## 'Positive' Class : 1
##
confusionMatrix(as.factor(wine_pred), as.factor(wine.test.label), "1") # k-NN model evaluation
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 146 28
## 1 72 176
##
## Accuracy : 0.763
## 95% CI : (0.7195, 0.8028)
## No Information Rate : 0.5166
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.5288
##
## Mcnemar's Test P-Value : 0.00001708
##
## Sensitivity : 0.8627
## Specificity : 0.6697
## Pos Pred Value : 0.7097
## Neg Pred Value : 0.8391
## Prevalence : 0.4834
## Detection Rate : 0.4171
## Detection Prevalence : 0.5877
## Balanced Accuracy : 0.7662
##
## 'Positive' Class : 1
##
The accuracy and sensitivity of logistic regression model is 0.619 and 0.5245. Meanwhile, the accuracy and sensitivity of k-NN model is 0.763 and 0.8627.
In wine classification, it is important for a model to have high accuracy to correctly predict both excellent and poor-normal quality of wine to avoid miss-clasification. Model with high sensitivity is also preferable because it describes the ability of a model to correctly predict positive class from the total of real positive class.
From the above evaluation, k-NN model gave higher score in both accuracy and sensitivity, making it a more appropriate model to predict wine’s quality. In addition, logistic regression may not be the best method for this data because there are many variables that ended up unused to meet the model’s assumption.
Although k-NN was picked as the best model in our study, we only get a 0.76 accuracy and 0.86 sensitivity from that model. It is always possible to have higher accuracy by using other classification model and/or with the addition of other data wrangling process. It is even better to try and build models using different classification method and compare them to find the best method for our data.
Thank you for reading, happy learning, and keep improving!
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.↩