Red Wine Quality
Intro
Red wine is a type of wine made from dark-colored grape varieties. The
color of the wine can range from intense violet, typical of young wines,
through to brick red for mature wines and brown for older red wines. The
juice from most purple grapes is greenish-white, the red color coming
from anthocyan pigments present in the skin of the grape. Much of the
red wine production process involves extraction of color and flavor
components from the grape skin. source: wikipedia
Four Indicators of Wine Quality
- Complexity
Higher quality wines are more complex in their flavor profile. They often have numerous layers that release flavors over time. Lower quality wines lack this complexity, having just one or two main notes that may or may not linger.
With high-quality wines, these flavors may appear on the palate one after the other, giving you time to savor each one before the next appears.
- Balance
Wines that have good balance will be of higher quality than ones where one component stands out above the rest.
The five components – acidity, tannins, sugar/sweetness, alcohol and fruit – need to be balanced. For wines that need several years of aging to reach maturity, this gives them the time they need to reach optimal balance.
Higher quality wines don’t necessarily need moderation in each component – indeed, some red wines have higher acidity while others have a higher alcohol content. What makes the difference is that the other components balance things out.
- Typicity
Another indicator of wine quality comes from typicity, or how much the wine looks and tastes the way it should.
For example, red Burgundy should have a certain appearance and taste, and it’s this combination that wine connoisseurs look for with each new vintage. An Australian Shiraz will also have a certain typicity, as will a Barolo, a Rioja or a Napa Valley Cabernet Sauvignon, among others.
- Intensity and Finish
The final indicators of both white and red wine quality are the intensity and finish. High-quality wines will express intense flavors and a lingering finish, with flavors lasting after you’ve swallowed the wine. Flavors that disappear immediately can indicate that your wine is of moderate quality at best. The better the wine, the longer the flavor finish will last on your palate. source https://www.jjbuckley.com/wine-knowledge
it very important for wine factory to produce high quality wine. let try it by using data set provided!
Data Preparation
i’m using data set from kaggle.com Redwine Quality https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009
let’s take a look at the data
wine <- read.csv("winequality-red.csv")
rmarkdown::paged_table(wine)Columns Description:
fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to winesresidual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter andchlorides: the amount of salt in the winefree sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion;total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2density: the density of water is close to that of water depending on the percent alcohol and sugar contentpH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial
dim(wine)## [1] 1599 12
this data set contains 1599 rows and 12 columns
check the missing values since it will affect our work process
anyNA(wine)## [1] FALSE
so there is no NA values and then check data type using
glimpse by dplyr library
glimpse(wine)## Rows: 1,599
## Columns: 12
## $ fixed.acidity <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5…
## $ volatile.acidity <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, …
## $ citric.acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0…
## $ residual.sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,…
## $ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, …
## $ free.sulfur.dioxide <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16…
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,…
## $ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0…
## $ pH <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3…
## $ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0…
## $ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.…
## $ quality <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7…
Since i’m gonna using Linear Regression method and this method needs numeric predictors, i’ll keep those data in their original data type.
Exploratory Data Analysis
For short view let’s check the correlation between target and it’s predictors.
ggcorr(wine, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)
the figure above gives view that there are not strong correlation
between target which is quality and other variables. beside that between
variables have strong correlation such as fixed acidity to citric acid
and density, free sulfur dioxide to total sulfur dioxide. It’s indicate
that we couldn’t use naive bayes method since all variables aren’t
equal. But let’s work on as it is.
next step is checking the outliers.
plot1 <- boxplot(wine, las = 2)There are outliers in total sulfur dioxide and free sulfur dioxide so i decided to eliminate those outliers.
wine_clean <- wine %>% filter(total.sulfur.dioxide < 160 , free.sulfur.dioxide < 60
)
plot2 <- boxplot(wine_clean, las = 2)Modeling
Step and Train-Test Split
Using step for decide model with correct predictors
,
wine_lm <- lm(quality ~ ., data = wine_clean)stats::step(wine_lm, direction = "backward")## Start: AIC=-1369.88
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar +
## chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol
##
## Df Sum of Sq RSS AIC
## - density 1 0.096 663.37 -1371.7
## - residual.sugar 1 0.140 663.42 -1371.5
## - fixed.acidity 1 0.252 663.53 -1371.3
## - citric.acid 1 0.659 663.94 -1370.3
## <none> 663.28 -1369.9
## - pH 1 2.021 665.30 -1367.0
## - free.sulfur.dioxide 1 2.244 665.52 -1366.5
## - chlorides 1 8.349 671.63 -1352.0
## - total.sulfur.dioxide 1 10.509 673.79 -1346.9
## - sulphates 1 27.433 690.71 -1307.4
## - volatile.acidity 1 32.737 696.01 -1295.2
## - alcohol 1 45.955 709.23 -1265.2
##
## Step: AIC=-1371.65
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar +
## chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## pH + sulphates + alcohol
##
## Df Sum of Sq RSS AIC
## - residual.sugar 1 0.056 663.43 -1373.5
## - fixed.acidity 1 0.175 663.55 -1373.2
## - citric.acid 1 0.664 664.04 -1372.1
## <none> 663.37 -1371.7
## - free.sulfur.dioxide 1 2.368 665.74 -1368.0
## - pH 1 3.773 667.15 -1364.6
## - chlorides 1 8.582 671.96 -1353.2
## - total.sulfur.dioxide 1 10.876 674.25 -1347.8
## - sulphates 1 28.311 691.68 -1307.1
## - volatile.acidity 1 33.641 697.01 -1294.9
## - alcohol 1 113.004 776.38 -1123.2
##
## Step: AIC=-1373.52
## quality ~ fixed.acidity + volatile.acidity + citric.acid + chlorides +
## free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates +
## alcohol
##
## Df Sum of Sq RSS AIC
## - fixed.acidity 1 0.193 663.62 -1375.0
## - citric.acid 1 0.640 664.07 -1374.0
## <none> 663.43 -1373.5
## - free.sulfur.dioxide 1 2.425 665.85 -1369.7
## - pH 1 3.751 667.18 -1366.5
## - chlorides 1 8.534 671.96 -1355.2
## - total.sulfur.dioxide 1 10.822 674.25 -1349.8
## - sulphates 1 28.256 691.69 -1309.1
## - volatile.acidity 1 33.620 697.05 -1296.8
## - alcohol 1 114.345 777.77 -1122.4
##
## Step: AIC=-1375.05
## quality ~ volatile.acidity + citric.acid + chlorides + free.sulfur.dioxide +
## total.sulfur.dioxide + pH + sulphates + alcohol
##
## Df Sum of Sq RSS AIC
## - citric.acid 1 0.447 664.07 -1376.0
## <none> 663.62 -1375.0
## - free.sulfur.dioxide 1 2.538 666.16 -1371.0
## - pH 1 6.575 670.20 -1361.4
## - chlorides 1 9.828 673.45 -1353.7
## - total.sulfur.dioxide 1 12.284 675.91 -1347.8
## - sulphates 1 28.648 692.27 -1309.8
## - volatile.acidity 1 34.472 698.09 -1296.4
## - alcohol 1 114.697 778.32 -1123.2
##
## Step: AIC=-1375.98
## quality ~ volatile.acidity + chlorides + free.sulfur.dioxide +
## total.sulfur.dioxide + pH + sulphates + alcohol
##
## Df Sum of Sq RSS AIC
## <none> 664.07 -1376.0
## - free.sulfur.dioxide 1 2.889 666.96 -1371.1
## - pH 1 6.494 670.56 -1362.5
## - chlorides 1 10.772 674.84 -1352.4
## - total.sulfur.dioxide 1 13.341 677.41 -1346.3
## - sulphates 1 28.278 692.35 -1311.6
## - volatile.acidity 1 40.598 704.67 -1283.5
## - alcohol 1 116.400 780.47 -1120.9
##
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide +
## total.sulfur.dioxide + pH + sulphates + alcohol, data = wine_clean)
##
## Coefficients:
## (Intercept) volatile.acidity chlorides
## 4.423598 -0.995693 -2.023960
## free.sulfur.dioxide total.sulfur.dioxide pH
## 0.005781 -0.004081 -0.463363
## sulphates alcohol
## 0.907900 0.282801
splitting wine_clean into data training and data test.
set.seed(123)
samplesize <- round(0.7 * nrow(wine_clean), 0)
index <- sample(seq_len(nrow(wine_clean)), size = samplesize)
data_train <- wine_clean[index, ]
data_test <- wine_clean[-index, ]Linear Regression
set.seed(123)
wine_lm <- lm(quality ~ ., data = data_train)
summary(wine_lm)##
## Call:
## lm(formula = quality ~ ., data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.60965 -0.37720 -0.05214 0.45858 2.08224
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.4059392 25.9046875 -0.479 0.632100
## fixed.acidity -0.0213629 0.0318500 -0.671 0.502530
## volatile.acidity -1.0506193 0.1446628 -7.263 0.000000000000717 ***
## citric.acid -0.2170904 0.1774698 -1.223 0.221496
## residual.sugar 0.0009035 0.0176844 0.051 0.959265
## chlorides -1.6972223 0.4993322 -3.399 0.000701 ***
## free.sulfur.dioxide 0.0053219 0.0026366 2.018 0.043784 *
## total.sulfur.dioxide -0.0040987 0.0009200 -4.455 0.000009244225416 ***
## density 17.6336545 26.4439204 0.667 0.505019
## pH -0.6787211 0.2313804 -2.933 0.003423 **
## sulphates 0.7981439 0.1403178 5.688 0.000000016447379 ***
## alcohol 0.3086259 0.0318526 9.689 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6513 on 1102 degrees of freedom
## Multiple R-squared: 0.344, Adjusted R-squared: 0.3375
## F-statistic: 52.54 on 11 and 1102 DF, p-value: < 0.00000000000000022
Value 0.75 for Adjusted R- squared used as parameter for good model and the calculation from my model is 0.3373, giving me assumption that the model built by all predictors not suitable to predict unseen data.
Using predictors stated by step method let’s have look of the results below.
lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates + alcohol, data = wine)$$
wine2 <- wine_clean %>%
select(quality,volatile.acidity, chlorides , free.sulfur.dioxide ,
total.sulfur.dioxide , pH ,sulphates , alcohol )
data_train2 <- wine2[index, ]
data_test2 <- wine2[-index, ]
set.seed(123)
wine_lm2 <- lm(quality ~ ., data = data_train2)
summary(wine_lm2)##
## Call:
## lm(formula = quality ~ ., data = data_train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.68102 -0.37517 -0.05496 0.47441 2.07185
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.3468685 0.4849981 8.963 < 0.0000000000000002 ***
## volatile.acidity -0.9343988 0.1203625 -7.763 0.0000000000000188 ***
## chlorides -1.7345229 0.4717566 -3.677 0.000248 ***
## free.sulfur.dioxide 0.0057432 0.0025714 2.233 0.025718 *
## total.sulfur.dioxide -0.0041446 0.0008775 -4.723 0.0000026183267565 ***
## pH -0.4517942 0.1401958 -3.223 0.001307 **
## sulphates 0.7940990 0.1354946 5.861 0.0000000060751075 ***
## alcohol 0.2873338 0.0201604 14.252 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6511 on 1106 degrees of freedom
## Multiple R-squared: 0.3421, Adjusted R-squared: 0.338
## F-statistic: 82.16 on 7 and 1106 DF, p-value: < 0.00000000000000022
Adjusted R-squared with new model have no huge different number it’s about 0.0005.
Evaluation
Model Performance
To see all models performance, i calculate Root Mean Squared Error (RMSE)
this one from the 1st model
wine_pred <- predict(wine_lm, newdata = data_test %>% select(-quality))
#RMSE of train dataset
RMSE(pred = wine_lm$fitted.values, obs = data_train$quality)## [1] 0.6477953
#RMSE of test dataset
RMSE(pred = wine_pred, obs = data_test$quality)## [1] 0.6441531
this one from second model.
wine_pred2 = predict(wine_lm2, newdata = data_test2 %>% select(-quality))
#RMSE of train dataset
RMSE(pred = wine_lm2$fitted.values, obs = data_train$quality)## [1] 0.6487416
#RMSE of train dataset
RMSE(pred = wine_pred2, obs = data_test2$quality)## [1] 0.6409882
Conclusion
Smaller RMSE indicate that model is good. Looking at models above. It show that the second model slightly better than 1st model.
But as analyst we need to consider that Ggally figure and adjusted R square calculate that all predictors don’t show strong correlation, and consider to using other ML method such as Logistic Regression.