In this article we will try to do regression to analyze the quality of wine based on this kaggle dataset. These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).
We’ll try to build linear regression model and analyze which variables are significant in predicting the quality of wine and how well the variable describe the quality. We will also evaluate the model with statistical approach to determine the quality of our model.
You can load the package into your workspace using the library()
function
red.df <- read.csv("winequality-red.csv", sep = ";")
white.df <- read.csv("winequality-white.csv", sep = ";")
glimpse(red.df)
## Observations: 1,599
## Variables: 12
## $ fixed.acidity <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, ...
## $ volatile.acidity <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.60...
## $ citric.acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00...
## $ residual.sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6...
## $ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.06...
## $ free.sulfur.dioxide <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17,...
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 1...
## $ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978...
## $ pH <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39...
## $ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47...
## $ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, ...
## $ quality <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5...
## Observations: 4,898
## Variables: 12
## $ fixed.acidity <dbl> 7.0, 6.3, 8.1, 7.2, 7.2, 8.1, 6.2, 7.0, 6.3, 8...
## $ volatile.acidity <dbl> 0.27, 0.30, 0.28, 0.23, 0.23, 0.28, 0.32, 0.27...
## $ citric.acid <dbl> 0.36, 0.34, 0.40, 0.32, 0.32, 0.40, 0.16, 0.36...
## $ residual.sugar <dbl> 20.70, 1.60, 6.90, 8.50, 8.50, 6.90, 7.00, 20....
## $ chlorides <dbl> 0.045, 0.049, 0.050, 0.058, 0.058, 0.050, 0.04...
## $ free.sulfur.dioxide <dbl> 45, 14, 30, 47, 47, 30, 30, 45, 14, 28, 11, 17...
## $ total.sulfur.dioxide <dbl> 170, 132, 97, 186, 186, 97, 136, 170, 132, 129...
## $ density <dbl> 1.0010, 0.9940, 0.9951, 0.9956, 0.9956, 0.9951...
## $ pH <dbl> 3.00, 3.30, 3.26, 3.19, 3.19, 3.26, 3.18, 3.00...
## $ sulphates <dbl> 0.45, 0.49, 0.44, 0.40, 0.40, 0.44, 0.47, 0.45...
## $ alcohol <dbl> 8.8, 9.5, 10.1, 9.9, 9.9, 10.1, 9.6, 8.8, 9.5,...
## $ quality <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 7, 5, 7...
The dataset contain white and red win, both have same exact column. In this case i’ll only analyze the white ones. the only reason why i pick white wine is because white dataset has more data (rows) than red wine.
Here is some description about the data: - fixed.acidity
= This column is most acids involved with wine or fixed or nonvolatile. (Advice: Do not evaporate readily)
- volatile acidity
= The amount of acetic acid in wine which at this column is too high of levels can lead to an unpleasant. This means is wine contains the high vinegar taste.
- citric acid
= When found in small quantities, citric acid can add ‘freshness’ and ‘flavor’ to wines.
- residual sugar
= The amount of sugar remaining after fermentation stops. It’s rare to find wines with less than 1 gram/liter. Wines with greater than 45 grams/liter are considered sweet.
- chlorides
= This column gives the amount of salt in the wine.
- free sulfur dioxide
= The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.
- total sulfur dioxide
= Amount of free and bound forms of S02. This amount in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in flavour of smell and taste of wine.
- density
= This column gives the density. The density of water is close to that of water depending on the percent of sugar and alcohol content.
- pH
= This column gives the amount of acid or wine basic. Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic). Most of wines are between 3-4 on the pH scale.
- sulphates
= This column gives the amount of sulphates of contains in wine. A wine additive which can contribute to sulphur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant.
- alcoholThis
= column gives the amount the percent content of alcohol in the wine
- quality
= This column gives the quality of the wine. Output variable; based on sensory data, scores between 0 and 10. This means, 0 is poor quality and 10 is high quality.
We will use ggcorr()
function from GGally
packages. ggcorr is a function for making a correlation matrix plot.
white wine data have low correlation to each other except
residual sugar
to density
and alcohol
to density
.
# lets see if our target variable has outlier
# we need to scale the data first because all data have different numeric scale
boxplot(scale(white.df))
from the boxplots above we can see that citric.acid
, residual.sugar
, free.sulfur.dioxide
, and density
have extreme value of data or what we usually said as Outlier. let’s see which rows contain outlier.
# first we draw boxplot from that specific column, then after we know in which range the outlier happens, subset the data to see the outlier's row
# apply this to every suspected column
boxplot(white.df$citric.acid)
remove row with outlier
# build a simple linear regression model with all existing variables
mod.1 <- lm(quality~., train.1)
summary(mod.1)
##
## Call:
## lm(formula = quality ~ ., data = train.1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5840 -0.4964 -0.0427 0.4627 3.1160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 190.8337011 25.0687730 7.612 0.0000000000000335 ***
## fixed.acidity 0.0974334 0.0251489 3.874 0.000109 ***
## volatile.acidity -2.0205297 0.1273771 -15.863 < 0.0000000000000002 ***
## citric.acid -0.0501521 0.1090475 -0.460 0.645606
## residual.sugar 0.0958231 0.0094907 10.097 < 0.0000000000000002 ***
## chlorides -0.0173835 0.6013370 -0.029 0.976939
## free.sulfur.dioxide 0.0044269 0.0009696 4.566 0.0000051285822214 ***
## total.sulfur.dioxide -0.0001228 0.0004272 -0.288 0.773736
## density -191.2163134 25.4027849 -7.527 0.0000000000000639 ***
## pH 0.7533606 0.1246299 6.045 0.0000000016362551 ***
## sulphates 0.7002043 0.1144924 6.116 0.0000000010555660 ***
## alcohol 0.1459350 0.0317373 4.598 0.0000043960991445 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7466 on 3906 degrees of freedom
## Multiple R-squared: 0.2897, Adjusted R-squared: 0.2877
## F-statistic: 144.8 on 11 and 3906 DF, p-value: < 0.00000000000000022
The summary above presents a lot of metrics but for now, we only take a look at Pr(>|t|)
. Variable with Pr(>|t|)
> 0.05 mean that variables doesn’t have any significant value to our model, thus we’re looking for variables with Pr(>|t|)
below 0.05. It’save to remove variable that doesn’t meet our criteria. We can identify better model by looking at R-squared
. Let’s build second model with selected variables only. To do that, we can use stepwise
.
Stepwise
regression is a set of methods to fit regression models where the choice of predictive variables is carried out by an automatic procedure. Stepwise
will train model with every posible combination of variables with the lowest AIC
. AIC
are not represented in summary()
but i can say that AIC
demonstrated how we can estimate the amount of “information loss” from one model to another.
##
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + residual.sugar +
## free.sulfur.dioxide + density + pH + sulphates + alcohol,
## data = train.1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5714 -0.4987 -0.0417 0.4611 3.1158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 193.9841189 23.6672190 8.196 0.000000000000000333 ***
## fixed.acidity 0.0973566 0.0244673 3.979 0.000070447117108978 ***
## volatile.acidity -2.0175338 0.1225167 -16.467 < 0.0000000000000002 ***
## residual.sugar 0.0967974 0.0090464 10.700 < 0.0000000000000002 ***
## free.sulfur.dioxide 0.0042333 0.0007792 5.433 0.000000058866013339 ***
## density -194.4198414 23.9658187 -8.112 0.000000000000000658 ***
## pH 0.7628269 0.1217508 6.265 0.000000000411951034 ***
## sulphates 0.6963800 0.1141404 6.101 0.000000001155909773 ***
## alcohol 0.1432226 0.0312481 4.583 0.000004717716784313 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7463 on 3909 degrees of freedom
## Multiple R-squared: 0.2896, Adjusted R-squared: 0.2881
## F-statistic: 199.2 on 8 and 3909 DF, p-value: < 0.00000000000000022
Now we have the best selected variables according to stepwise
. We’ll build new model from the stepwise
for later use.
You need to keep in mind, R-squared doesn’t indicate the quality of the model, instead R-squared value tells us how well our model describes the data. It measures the extent to which the variance in our dependent variable can be explained by the independent variables. Apparently, both model only shows low R-squared values, our model only represent 27% variance of data. both model only have slight differences in R-squared (0.2788 > 0.2785), thus it save for us to remove several variable because it still represent variance of our data as much as before.
However, let’s evaluate our model to see the differences between both model
# prediction model with all vaiabel
pred.1 <- predict(mod.1, newdata = test.1)
# prediction model using stepwise
pred.2 <- predict(mod.2x, newdata = test.1)
# calculate MAE and RMSE
all.var <- data.frame(MAE = MAE(pred.1, test.1$quality), RMSE = RMSE(pred.1, test.1$quality))
with.step <- data.frame(MAE = MAE(pred.2, test.1$quality), RMSE = RMSE(pred.2, test.1$quality))
rbind(all.var, with.step)
Top row represent MAE and RMSE from first model (with all variables) and the bottom represent second model (with stepwise). Our model with stepwise are slightly better than the first one. But we’re dealing with linear regression here, we need more statistical apporach to determine the quality of our model.
A residual plot is a plot that plot the residual values on the y-axis and the fitted values on the x-axis. A residual plot that has points randomly scattered around the x-axis is the one we want. It doesn’t mean that the model is perfect, but it does mean that the regression model you fit appropriately describes the variability in our dependent variable.
plot(train.1$quality, residuals(mod.2x),
xlab = "quality", ylab = "residuals")
abline(h=0, col="maroon", lwd=2)
We can see a pattern from our residual plot, means our model doesn’t fit appropriately. our model doesn’t meet linearity assumption.
Heteroscedasticity is a condition where the variability of a variable is unequal across its range of value. In a linear regression model, if the variance of its error is showing unequal variation across the target variable range, it shows that heteroscedasticity is present and the implication to that is related to the previous statement of a non-random pattern in residual.
##
## studentized Breusch-Pagan test
##
## data: mod.2x
## BP = 49.976, df = 8, p-value = 0.00000004131
The test has a p-value below 0.05, therefore we know that the residual of our model is not constant (heteroscedasticity is present)
The normality assumption means that the residuals from the linear regression model should be normally distributed because we expect to get residuals near the zero value. We will use Shapiro-Wilk test to our residual
##
## Shapiro-Wilk normality test
##
## data: mod.2x$residuals
## W = 0.98928, p-value < 0.00000000000000022
The test has a p-value below 0.05, therefore we know that our residuals are not normally distributed
we’re using Variance Inflation Factor (VIF) to check multicollinearity among our variables in model. A common rule of thumbs is that a VIF number greater than 10 may indicate high collinearity and worth further inspection,
## fixed.acidity volatile.acidity residual.sugar free.sulfur.dioxide
## 3.050271 1.048269 14.498110 1.158508
## density pH sulphates alcohol
## 34.571740 2.375009 1.165004 10.383438
From the test we know that density
, residual.sugar
, and alcohol
has high colinearity.
Sadly our model doesn’t meet any sign of good criteria
model based on statistical approach like Normality
, Heteroscedasticity
, residual plot
, and Multicolinearity
. The R-squared is also only represent 27% variance of data even after we remove the outlier. Predictive ability of the model measured by RMSE
is 0.7355, and MAE
is 0.565. We’ve learn that this data are not suitable for linear regresion, i also have assumption why this problem occurs:
Thank you