In this article we will try to do regression to analyze the quality of wine based on this kaggle dataset. These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).

Background

Objective

We’ll try to build linear regression model and analyze which variables are significant in predicting the quality of wine and how well the variable describe the quality. We will also evaluate the model with statistical approach to determine the quality of our model.

libraries

You can load the package into your workspace using the library() function

library(dplyr)
library(data.table)
library(ggplot2)
library(rsample)
library(ggpubr)
library(GGally)
library(caret)
library(lmtest)
library(car)

Let’s begin

Data Import

red.df <- read.csv("winequality-red.csv", sep = ";")
white.df <- read.csv("winequality-white.csv", sep = ";")

glimpse(red.df)

## Observations: 1,599
## Variables: 12
## $ fixed.acidity        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, ...
## $ volatile.acidity     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.60...
## $ citric.acid          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00...
## $ residual.sugar       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6...
## $ chlorides            <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.06...
## $ free.sulfur.dioxide  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17,...
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 1...
## $ density              <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978...
## $ pH                   <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39...
## $ sulphates            <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47...
## $ alcohol              <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, ...
## $ quality              <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5...

glimpse(white.df)

## Observations: 4,898
## Variables: 12
## $ fixed.acidity        <dbl> 7.0, 6.3, 8.1, 7.2, 7.2, 8.1, 6.2, 7.0, 6.3, 8...
## $ volatile.acidity     <dbl> 0.27, 0.30, 0.28, 0.23, 0.23, 0.28, 0.32, 0.27...
## $ citric.acid          <dbl> 0.36, 0.34, 0.40, 0.32, 0.32, 0.40, 0.16, 0.36...
## $ residual.sugar       <dbl> 20.70, 1.60, 6.90, 8.50, 8.50, 6.90, 7.00, 20....
## $ chlorides            <dbl> 0.045, 0.049, 0.050, 0.058, 0.058, 0.050, 0.04...
## $ free.sulfur.dioxide  <dbl> 45, 14, 30, 47, 47, 30, 30, 45, 14, 28, 11, 17...
## $ total.sulfur.dioxide <dbl> 170, 132, 97, 186, 186, 97, 136, 170, 132, 129...
## $ density              <dbl> 1.0010, 0.9940, 0.9951, 0.9956, 0.9956, 0.9951...
## $ pH                   <dbl> 3.00, 3.30, 3.26, 3.19, 3.19, 3.26, 3.18, 3.00...
## $ sulphates            <dbl> 0.45, 0.49, 0.44, 0.40, 0.40, 0.44, 0.47, 0.45...
## $ alcohol              <dbl> 8.8, 9.5, 10.1, 9.9, 9.9, 10.1, 9.6, 8.8, 9.5,...
## $ quality              <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 7, 5, 7...

The dataset contain white and red win, both have same exact column. In this case i’ll only analyze the white ones. the only reason why i pick white wine is because white dataset has more data (rows) than red wine.

Here is some description about the data: - fixed.acidity = This column is most acids involved with wine or fixed or nonvolatile. (Advice: Do not evaporate readily)
- volatile acidity = The amount of acetic acid in wine which at this column is too high of levels can lead to an unpleasant. This means is wine contains the high vinegar taste.
- citric acid = When found in small quantities, citric acid can add ‘freshness’ and ‘flavor’ to wines.
- residual sugar = The amount of sugar remaining after fermentation stops. It’s rare to find wines with less than 1 gram/liter. Wines with greater than 45 grams/liter are considered sweet.
- chlorides = This column gives the amount of salt in the wine.
- free sulfur dioxide = The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.
- total sulfur dioxide= Amount of free and bound forms of S02. This amount in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in flavour of smell and taste of wine.
- density = This column gives the density. The density of water is close to that of water depending on the percent of sugar and alcohol content.
- pH = This column gives the amount of acid or wine basic. Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic). Most of wines are between 3-4 on the pH scale.
- sulphates = This column gives the amount of sulphates of contains in wine. A wine additive which can contribute to sulphur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant.
- alcoholThis = column gives the amount the percent content of alcohol in the wine
- quality = This column gives the quality of the wine. Output variable; based on sensory data, scores between 0 and 10. This means, 0 is poor quality and 10 is high quality.

Explanatory Data Analysis

Correlation check

We will use ggcorr() function from GGally packages. ggcorr is a function for making a correlation matrix plot.

ggcorr(white.df, label=T, layout.exp=2, hjust= 1)

white wine data have low correlation to each other except residual sugar to density and alcohol to density.

Dealing with outlier

# lets see if our target variable has outlier
# we need to scale the data first because all data have different numeric scale
boxplot(scale(white.df))

from the boxplots above we can see that citric.acid, residual.sugar, free.sulfur.dioxide, and density have extreme value of data or what we usually said as Outlier. let’s see which rows contain outlier.

# first we draw boxplot from that specific column, then after we know in which range the outlier happens, subset the data to see the outlier's row
# apply this to every suspected column
boxplot(white.df$citric.acid)

white.df[which(white.df$citric.acid > 1.2),]

boxplot(white.df$residual.sugar)

white.df[which(white.df$residual.sugar > 40),]

boxplot(white.df$free.sulfur.dioxide)

white.df[which(white.df$free.sulfur.dioxide > 200),]

boxplot(white.df$density)

white.df[which(white.df$density > 1.02),]

remove row with outlier

white.df <- white.df[-c(746, 3153, 2782, 4746),]

Modeling

Splitting

# split to target & train data

split.df.1 <- initial_split(white.df, prop = 0.8, strata = "quality")
train.1 <- training(split.df.1)
test.1 <- testing(split.df.1)

Linear Regression

# build a simple linear regression model with all existing variables
mod.1 <- lm(quality~., train.1)
summary(mod.1)

## 
## Call:
## lm(formula = quality ~ ., data = train.1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5840 -0.4964 -0.0427  0.4627  3.1160 
## 
## Coefficients:
##                          Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)           190.8337011   25.0687730   7.612   0.0000000000000335 ***
## fixed.acidity           0.0974334    0.0251489   3.874             0.000109 ***
## volatile.acidity       -2.0205297    0.1273771 -15.863 < 0.0000000000000002 ***
## citric.acid            -0.0501521    0.1090475  -0.460             0.645606    
## residual.sugar          0.0958231    0.0094907  10.097 < 0.0000000000000002 ***
## chlorides              -0.0173835    0.6013370  -0.029             0.976939    
## free.sulfur.dioxide     0.0044269    0.0009696   4.566   0.0000051285822214 ***
## total.sulfur.dioxide   -0.0001228    0.0004272  -0.288             0.773736    
## density              -191.2163134   25.4027849  -7.527   0.0000000000000639 ***
## pH                      0.7533606    0.1246299   6.045   0.0000000016362551 ***
## sulphates               0.7002043    0.1144924   6.116   0.0000000010555660 ***
## alcohol                 0.1459350    0.0317373   4.598   0.0000043960991445 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7466 on 3906 degrees of freedom
## Multiple R-squared:  0.2897, Adjusted R-squared:  0.2877 
## F-statistic: 144.8 on 11 and 3906 DF,  p-value: < 0.00000000000000022

The summary above presents a lot of metrics but for now, we only take a look at Pr(>|t|). Variable with Pr(>|t|) > 0.05 mean that variables doesn’t have any significant value to our model, thus we’re looking for variables with Pr(>|t|) below 0.05. It’save to remove variable that doesn’t meet our criteria. We can identify better model by looking at R-squared. Let’s build second model with selected variables only. To do that, we can use stepwise.

Stepwise regression is a set of methods to fit regression models where the choice of predictive variables is carried out by an automatic procedure. Stepwise will train model with every posible combination of variables with the lowest AIC. AIC are not represented in summary() but i can say that AIC demonstrated how we can estimate the amount of “information loss” from one model to another.

mod.2 <- step(mod.1, direction = "backward", trace = 0)
summary(mod.2)

## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + residual.sugar + 
##     free.sulfur.dioxide + density + pH + sulphates + alcohol, 
##     data = train.1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5714 -0.4987 -0.0417  0.4611  3.1158 
## 
## Coefficients:
##                         Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)          193.9841189   23.6672190   8.196 0.000000000000000333 ***
## fixed.acidity          0.0973566    0.0244673   3.979 0.000070447117108978 ***
## volatile.acidity      -2.0175338    0.1225167 -16.467 < 0.0000000000000002 ***
## residual.sugar         0.0967974    0.0090464  10.700 < 0.0000000000000002 ***
## free.sulfur.dioxide    0.0042333    0.0007792   5.433 0.000000058866013339 ***
## density             -194.4198414   23.9658187  -8.112 0.000000000000000658 ***
## pH                     0.7628269    0.1217508   6.265 0.000000000411951034 ***
## sulphates              0.6963800    0.1141404   6.101 0.000000001155909773 ***
## alcohol                0.1432226    0.0312481   4.583 0.000004717716784313 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7463 on 3909 degrees of freedom
## Multiple R-squared:  0.2896, Adjusted R-squared:  0.2881 
## F-statistic: 199.2 on 8 and 3909 DF,  p-value: < 0.00000000000000022

Now we have the best selected variables according to stepwise. We’ll build new model from the stepwise for later use.

mod.2x <- lm(quality ~ fixed.acidity + volatile.acidity + residual.sugar + free.sulfur.dioxide + density + pH + sulphates + alcohol, data = train.1)

You need to keep in mind, R-squared doesn’t indicate the quality of the model, instead R-squared value tells us how well our model describes the data. It measures the extent to which the variance in our dependent variable can be explained by the independent variables. Apparently, both model only shows low R-squared values, our model only represent 27% variance of data. both model only have slight differences in R-squared (0.2788 > 0.2785), thus it save for us to remove several variable because it still represent variance of our data as much as before.

However, let’s evaluate our model to see the differences between both model

Evaluation

RMSE and MAE

# prediction model with all vaiabel
pred.1 <- predict(mod.1, newdata = test.1)

# prediction model using stepwise
pred.2 <- predict(mod.2x, newdata = test.1)

# calculate MAE and RMSE 
all.var <- data.frame(MAE = MAE(pred.1, test.1$quality), RMSE = RMSE(pred.1, test.1$quality))
with.step <- data.frame(MAE = MAE(pred.2, test.1$quality), RMSE = RMSE(pred.2, test.1$quality))
rbind(all.var, with.step)

Top row represent MAE and RMSE from first model (with all variables) and the bottom represent second model (with stepwise). Our model with stepwise are slightly better than the first one. But we’re dealing with linear regression here, we need more statistical apporach to determine the quality of our model.

Residual plot

A residual plot is a plot that plot the residual values on the y-axis and the fitted values on the x-axis. A residual plot that has points randomly scattered around the x-axis is the one we want. It doesn’t mean that the model is perfect, but it does mean that the regression model you fit appropriately describes the variability in our dependent variable.

plot(train.1$quality, residuals(mod.2x),
     xlab = "quality", ylab = "residuals")
abline(h=0, col="maroon", lwd=2)

We can see a pattern from our residual plot, means our model doesn’t fit appropriately. our model doesn’t meet linearity assumption.

Heteroscedasticity

Heteroscedasticity is a condition where the variability of a variable is unequal across its range of value. In a linear regression model, if the variance of its error is showing unequal variation across the target variable range, it shows that heteroscedasticity is present and the implication to that is related to the previous statement of a non-random pattern in residual.

bptest(mod.2x)

## 
##  studentized Breusch-Pagan test
## 
## data:  mod.2x
## BP = 49.976, df = 8, p-value = 0.00000004131

The test has a p-value below 0.05, therefore we know that the residual of our model is not constant (heteroscedasticity is present)

Normality

The normality assumption means that the residuals from the linear regression model should be normally distributed because we expect to get residuals near the zero value. We will use Shapiro-Wilk test to our residual

shapiro.test(mod.2x$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  mod.2x$residuals
## W = 0.98928, p-value < 0.00000000000000022

The test has a p-value below 0.05, therefore we know that our residuals are not normally distributed

Multicolinearity

we’re using Variance Inflation Factor (VIF) to check multicollinearity among our variables in model. A common rule of thumbs is that a VIF number greater than 10 may indicate high collinearity and worth further inspection,

vif(mod.2x)

##       fixed.acidity    volatile.acidity      residual.sugar free.sulfur.dioxide 
##            3.050271            1.048269           14.498110            1.158508 
##             density                  pH           sulphates             alcohol 
##           34.571740            2.375009            1.165004           10.383438

From the test we know that density, residual.sugar, and alcohol has high colinearity.

conclusion

Sadly our model doesn’t meet any sign of good criteria model based on statistical approach like Normality, Heteroscedasticity, residual plot, and Multicolinearity. The R-squared is also only represent 27% variance of data even after we remove the outlier. Predictive ability of the model measured by RMSE is 0.7355, and MAE is 0.565. We’ve learn that this data are not suitable for linear regresion, i also have assumption why this problem occurs:

Our data doesn’t have any linear relationship between the predictor variables and target exist. The author of dataset also said “Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.)” in the kaggle dataset.
The target variable is actually an ordinal classification task, higher number of quality means better wine quality. However regression can actually solve this problem since we assume the target as continuous numeric.
The quality of wine is not determined by the existing variables in data. Our R-squared only reach 27% variance of data, maybe the other 73% have a greater influence.

Thank you

Wine Quality

jojoecp

4/9/2020