Red Wine Quality

Intro

Red wine is a type of wine made from dark-colored grape varieties. The color of the wine can range from intense violet, typical of young wines, through to brick red for mature wines and brown for older red wines. The juice from most purple grapes is greenish-white, the red color coming from anthocyan pigments present in the skin of the grape. Much of the red wine production process involves extraction of color and flavor components from the grape skin. source: wikipedia

Four Indicators of Wine Quality

  1. Complexity

Higher quality wines are more complex in their flavor profile. They often have numerous layers that release flavors over time. Lower quality wines lack this complexity, having just one or two main notes that may or may not linger.

With high-quality wines, these flavors may appear on the palate one after the other, giving you time to savor each one before the next appears.

  1. Balance

Wines that have good balance will be of higher quality than ones where one component stands out above the rest.

The five components – acidity, tannins, sugar/sweetness, alcohol and fruit – need to be balanced. For wines that need several years of aging to reach maturity, this gives them the time they need to reach optimal balance.

Higher quality wines don’t necessarily need moderation in each component – indeed, some red wines have higher acidity while others have a higher alcohol content. What makes the difference is that the other components balance things out.

  1. Typicity

Another indicator of wine quality comes from typicity, or how much the wine looks and tastes the way it should.

For example, red Burgundy should have a certain appearance and taste, and it’s this combination that wine connoisseurs look for with each new vintage. An Australian Shiraz will also have a certain typicity, as will a Barolo, a Rioja or a Napa Valley Cabernet Sauvignon, among others.

  1. Intensity and Finish

The final indicators of both white and red wine quality are the intensity and finish. High-quality wines will express intense flavors and a lingering finish, with flavors lasting after you’ve swallowed the wine. Flavors that disappear immediately can indicate that your wine is of moderate quality at best. The better the wine, the longer the flavor finish will last on your palate. source https://www.jjbuckley.com/wine-knowledge

it very important for wine factory to produce high quality wine. let try it by using data set provided!

Data Preparation

i’m using data set from kaggle.com Redwine Quality https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009

let’s take a look at the data

wine <- read.csv("winequality-red.csv")

rmarkdown::paged_table(wine)

Columns Description:

  • fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
  • volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.
  • citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
  • residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and
  • chlorides: the amount of salt in the wine
  • free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion;
  • total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2
  • density: the density of water is close to that of water depending on the percent alcohol and sugar content
  • pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4
  • sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial
dim(wine)
## [1] 1599   12

this data set contains 1599 rows and 12 columns

check the missing values since it will affect our work process

anyNA(wine)
## [1] FALSE

so there is no NA values and then check data type using glimpse by dplyr library

glimpse(wine)
## Rows: 1,599
## Columns: 12
## $ fixed.acidity        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5…
## $ volatile.acidity     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, …
## $ citric.acid          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0…
## $ residual.sugar       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,…
## $ chlorides            <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, …
## $ free.sulfur.dioxide  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16…
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,…
## $ density              <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0…
## $ pH                   <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3…
## $ sulphates            <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0…
## $ alcohol              <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.…
## $ quality              <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7…

Since i’m gonna using Linear Regression method and this method needs numeric predictors, i’ll keep those data in their original data type.

Exploratory Data Analysis

For short view let’s check the correlation between target and it’s predictors.

ggcorr(wine, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

the figure above gives view that there are not strong correlation between target which is quality and other variables. beside that between variables have strong correlation such as fixed acidity to citric acid and density, free sulfur dioxide to total sulfur dioxide. It’s indicate that we couldn’t use naive bayes method since all variables aren’t equal. But let’s work on as it is.

next step is checking the outliers.

plot1 <- boxplot(wine, las = 2)

There are outliers in total sulfur dioxide and free sulfur dioxide so i decided to eliminate those outliers.

wine_clean <- wine %>% filter(total.sulfur.dioxide < 160 , free.sulfur.dioxide < 60 
                              )
plot2 <- boxplot(wine_clean, las = 2)

Modeling

Step and Train-Test Split

Using step for decide model with correct predictors ,

wine_lm <- lm(quality ~ ., data = wine_clean)
stats::step(wine_lm, direction = "backward")
## Start:  AIC=-1369.88
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol
## 
##                        Df Sum of Sq    RSS     AIC
## - density               1     0.096 663.37 -1371.7
## - residual.sugar        1     0.140 663.42 -1371.5
## - fixed.acidity         1     0.252 663.53 -1371.3
## - citric.acid           1     0.659 663.94 -1370.3
## <none>                              663.28 -1369.9
## - pH                    1     2.021 665.30 -1367.0
## - free.sulfur.dioxide   1     2.244 665.52 -1366.5
## - chlorides             1     8.349 671.63 -1352.0
## - total.sulfur.dioxide  1    10.509 673.79 -1346.9
## - sulphates             1    27.433 690.71 -1307.4
## - volatile.acidity      1    32.737 696.01 -1295.2
## - alcohol               1    45.955 709.23 -1265.2
## 
## Step:  AIC=-1371.65
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     pH + sulphates + alcohol
## 
##                        Df Sum of Sq    RSS     AIC
## - residual.sugar        1     0.056 663.43 -1373.5
## - fixed.acidity         1     0.175 663.55 -1373.2
## - citric.acid           1     0.664 664.04 -1372.1
## <none>                              663.37 -1371.7
## - free.sulfur.dioxide   1     2.368 665.74 -1368.0
## - pH                    1     3.773 667.15 -1364.6
## - chlorides             1     8.582 671.96 -1353.2
## - total.sulfur.dioxide  1    10.876 674.25 -1347.8
## - sulphates             1    28.311 691.68 -1307.1
## - volatile.acidity      1    33.641 697.01 -1294.9
## - alcohol               1   113.004 776.38 -1123.2
## 
## Step:  AIC=-1373.52
## quality ~ fixed.acidity + volatile.acidity + citric.acid + chlorides + 
##     free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates + 
##     alcohol
## 
##                        Df Sum of Sq    RSS     AIC
## - fixed.acidity         1     0.193 663.62 -1375.0
## - citric.acid           1     0.640 664.07 -1374.0
## <none>                              663.43 -1373.5
## - free.sulfur.dioxide   1     2.425 665.85 -1369.7
## - pH                    1     3.751 667.18 -1366.5
## - chlorides             1     8.534 671.96 -1355.2
## - total.sulfur.dioxide  1    10.822 674.25 -1349.8
## - sulphates             1    28.256 691.69 -1309.1
## - volatile.acidity      1    33.620 697.05 -1296.8
## - alcohol               1   114.345 777.77 -1122.4
## 
## Step:  AIC=-1375.05
## quality ~ volatile.acidity + citric.acid + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol
## 
##                        Df Sum of Sq    RSS     AIC
## - citric.acid           1     0.447 664.07 -1376.0
## <none>                              663.62 -1375.0
## - free.sulfur.dioxide   1     2.538 666.16 -1371.0
## - pH                    1     6.575 670.20 -1361.4
## - chlorides             1     9.828 673.45 -1353.7
## - total.sulfur.dioxide  1    12.284 675.91 -1347.8
## - sulphates             1    28.648 692.27 -1309.8
## - volatile.acidity      1    34.472 698.09 -1296.4
## - alcohol               1   114.697 778.32 -1123.2
## 
## Step:  AIC=-1375.98
## quality ~ volatile.acidity + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol
## 
##                        Df Sum of Sq    RSS     AIC
## <none>                              664.07 -1376.0
## - free.sulfur.dioxide   1     2.889 666.96 -1371.1
## - pH                    1     6.494 670.56 -1362.5
## - chlorides             1    10.772 674.84 -1352.4
## - total.sulfur.dioxide  1    13.341 677.41 -1346.3
## - sulphates             1    28.278 692.35 -1311.6
## - volatile.acidity      1    40.598 704.67 -1283.5
## - alcohol               1   116.400 780.47 -1120.9
## 
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol, data = wine_clean)
## 
## Coefficients:
##          (Intercept)      volatile.acidity             chlorides  
##             4.423598             -0.995693             -2.023960  
##  free.sulfur.dioxide  total.sulfur.dioxide                    pH  
##             0.005781             -0.004081             -0.463363  
##            sulphates               alcohol  
##             0.907900              0.282801

splitting wine_clean into data training and data test.

set.seed(123)
samplesize <- round(0.7 * nrow(wine_clean), 0)
index <- sample(seq_len(nrow(wine_clean)), size = samplesize)

data_train <- wine_clean[index, ]
data_test <- wine_clean[-index, ]

Linear Regression

set.seed(123)
wine_lm <- lm(quality ~ ., data = data_train)

summary(wine_lm)
## 
## Call:
## lm(formula = quality ~ ., data = data_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.60965 -0.37720 -0.05214  0.45858  2.08224 
## 
## Coefficients:
##                         Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)          -12.4059392  25.9046875  -0.479             0.632100    
## fixed.acidity         -0.0213629   0.0318500  -0.671             0.502530    
## volatile.acidity      -1.0506193   0.1446628  -7.263    0.000000000000717 ***
## citric.acid           -0.2170904   0.1774698  -1.223             0.221496    
## residual.sugar         0.0009035   0.0176844   0.051             0.959265    
## chlorides             -1.6972223   0.4993322  -3.399             0.000701 ***
## free.sulfur.dioxide    0.0053219   0.0026366   2.018             0.043784 *  
## total.sulfur.dioxide  -0.0040987   0.0009200  -4.455    0.000009244225416 ***
## density               17.6336545  26.4439204   0.667             0.505019    
## pH                    -0.6787211   0.2313804  -2.933             0.003423 ** 
## sulphates              0.7981439   0.1403178   5.688    0.000000016447379 ***
## alcohol                0.3086259   0.0318526   9.689 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6513 on 1102 degrees of freedom
## Multiple R-squared:  0.344,  Adjusted R-squared:  0.3375 
## F-statistic: 52.54 on 11 and 1102 DF,  p-value: < 0.00000000000000022

Value 0.75 for Adjusted R- squared used as parameter for good model and the calculation from my model is 0.3373, giving me assumption that the model built by all predictors not suitable to predict unseen data.

Using predictors stated by step method let’s have look of the results below.

lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates + alcohol, data = wine)$$

wine2 <- wine_clean %>% 
  select(quality,volatile.acidity, chlorides , free.sulfur.dioxide ,
    total.sulfur.dioxide , pH ,sulphates , alcohol )
data_train2 <- wine2[index, ]
data_test2 <- wine2[-index, ]

set.seed(123)
wine_lm2 <- lm(quality ~ ., data = data_train2)

summary(wine_lm2)
## 
## Call:
## lm(formula = quality ~ ., data = data_train2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68102 -0.37517 -0.05496  0.47441  2.07185 
## 
## Coefficients:
##                        Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)           4.3468685  0.4849981   8.963 < 0.0000000000000002 ***
## volatile.acidity     -0.9343988  0.1203625  -7.763   0.0000000000000188 ***
## chlorides            -1.7345229  0.4717566  -3.677             0.000248 ***
## free.sulfur.dioxide   0.0057432  0.0025714   2.233             0.025718 *  
## total.sulfur.dioxide -0.0041446  0.0008775  -4.723   0.0000026183267565 ***
## pH                   -0.4517942  0.1401958  -3.223             0.001307 ** 
## sulphates             0.7940990  0.1354946   5.861   0.0000000060751075 ***
## alcohol               0.2873338  0.0201604  14.252 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6511 on 1106 degrees of freedom
## Multiple R-squared:  0.3421, Adjusted R-squared:  0.338 
## F-statistic: 82.16 on 7 and 1106 DF,  p-value: < 0.00000000000000022

Adjusted R-squared with new model have no huge different number it’s about 0.0005.

Evaluation

Model Performance

To see all models performance, i calculate Root Mean Squared Error (RMSE)

this one from the 1st model

wine_pred <- predict(wine_lm, newdata = data_test %>% select(-quality))

#RMSE of train dataset
RMSE(pred = wine_lm$fitted.values, obs = data_train$quality)
## [1] 0.6477953
#RMSE of test dataset
RMSE(pred = wine_pred, obs = data_test$quality)
## [1] 0.6441531

this one from second model.

wine_pred2 =  predict(wine_lm2, newdata = data_test2 %>% select(-quality))

#RMSE of train dataset
RMSE(pred = wine_lm2$fitted.values, obs = data_train$quality)
## [1] 0.6487416
#RMSE of train dataset
RMSE(pred = wine_pred2, obs = data_test2$quality)
## [1] 0.6409882

Conclusion

Smaller RMSE indicate that model is good. Looking at models above. It show that the second model slightly better than 1st model.

But as analyst we need to consider that Ggally figure and adjusted R square calculate that all predictors don’t show strong correlation, and consider to using other ML method such as Logistic Regression.

