Wine tasters rated 1599 wines for quality; eleven other, quantitative chemical aspects were measured for each wine, in an effort to link wine quality to observable physical factors.
These are the 11 physical factors, taken from the dataset description:
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3)
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume)
The output variable (based on sensory data):
12 - quality (score between 0 and 10)
First, let’s look at some summary statistics about the quality ratings in our sample. Wine quality is assessed on a 0 to 10 point scale, from terrible to superb.
Let’s start with a histogram:
The ratings are roughly normally distributed, with values from 3 to 8, with most of the wines scoring a 5 or 6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Quality ratings vary from 3 to 8, with an average of 5.6, a median of 6, and an interquartile range of 5 to 6. So there are few wines better than a 6 or worse than a 5.
Let’s look at one of the physical factors, one we’re all familiar with: alcohol concentration.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
We can see the wines tend to fall in the 9-11% alcohol by volume range, with some a bit higher.
For your casual viewing, here are histograms for the other physical factors:
The relationship between the physical factors and the quality ratings. If we can determine this relationship, it can inform future winemaking, to make better quality wines.
The dataset is clean, and is not missing any data, ensuring that all data points can be used in the dataset, without adjustment of the raw data needed.
I did not.
0.4 g/dm^3. The fixed acidity and sulfur dioxid histograms appear to be somewhat right-skewed.
The dataset was clean, and not missing any data, making such adjustments unnecessary.
Based on the physical factor names alone, we can expect some correlation among the factors. For instance, total and free sulfur dioxide, and among the different indicators of acidity.
## fixed.acidity volatile.acidity citric.acid quality
## fixed.acidity 1.00 -0.26 0.67 0.12
## volatile.acidity -0.26 1.00 -0.55 -0.39
## citric.acid 0.67 -0.55 1.00 0.23
## quality 0.12 -0.39 0.23 1.00
## free.sulfur.dioxide total.sulfur.dioxide quality
## free.sulfur.dioxide 1.00 0.67 -0.05
## total.sulfur.dioxide 0.67 1.00 -0.19
## quality -0.05 -0.19 1.00
## quality
## alcohol 0.47616632
## volatile.acidity -0.39055778
## sulphates 0.25139708
## citric.acid 0.22637251
## total.sulfur.dioxide -0.18510029
## density -0.17491923
## chlorides -0.12890656
## fixed.acidity 0.12405165
## pH -0.05773139
## free.sulfur.dioxide -0.05065606
## residual.sugar 0.01373164
Factors that measured similar qualities, such as the fixed and volatile acidity, as well as the sulfur dioxide variants, often correlated with one another;
Alcohol was the leading indicator that a wine would be rated of good quality. Why? Do tasters prefer higher alcohol contents, or do higher quality grapes get fermented for longer? The strength of this indicator, and the ability of vinters to influence it through various production processes, offers a ripe avenue for exploration of future quality improvements.
## Loading required package: tcltk
For our multivariate analysis, we will perform a series of linear regressions, adding factors in the order of the absolute magnitude of the correlation between the physical factors and the outcome, quality.
##
## Call:
## lm(formula = quality ~ alcohol, data = wineData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8442 -0.4112 -0.1690 0.5166 2.5888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.87497 0.17471 10.73 <2e-16 ***
## alcohol 0.36084 0.01668 21.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared: 0.2267, Adjusted R-squared: 0.2263
## F-statistic: 468.3 on 1 and 1597 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity, data = wineData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.59342 -0.40416 -0.07426 0.46539 2.25809
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.09547 0.18450 16.78 <2e-16 ***
## alcohol 0.31381 0.01601 19.60 <2e-16 ***
## volatile.acidity -1.38364 0.09527 -14.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6678 on 1596 degrees of freedom
## Multiple R-squared: 0.317, Adjusted R-squared: 0.3161
## F-statistic: 370.4 on 2 and 1596 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates,
## data = wineData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7186 -0.3820 -0.0641 0.4746 2.1807
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.61083 0.19569 13.342 < 2e-16 ***
## alcohol 0.30922 0.01580 19.566 < 2e-16 ***
## volatile.acidity -1.22140 0.09701 -12.591 < 2e-16 ***
## sulphates 0.67903 0.10080 6.737 2.26e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6587 on 1595 degrees of freedom
## Multiple R-squared: 0.3359, Adjusted R-squared: 0.3346
## F-statistic: 268.9 on 3 and 1595 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## total.sulfur.dioxide, data = wineData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.72716 -0.38486 -0.06503 0.44980 2.13257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.8258128 0.2006892 14.081 < 2e-16 ***
## alcohol 0.2953105 0.0160331 18.419 < 2e-16 ***
## volatile.acidity -1.1985632 0.0966011 -12.407 < 2e-16 ***
## sulphates 0.7121396 0.1005146 7.085 2.08e-12 ***
## total.sulfur.dioxide -0.0022354 0.0005108 -4.376 1.28e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.655 on 1594 degrees of freedom
## Multiple R-squared: 0.3438, Adjusted R-squared: 0.3421
## F-statistic: 208.8 on 4 and 1594 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## total.sulfur.dioxide + chlorides, data = wineData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.67443 -0.38254 -0.06368 0.44893 2.07310
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0048920 0.2037663 14.747 < 2e-16 ***
## alcohol 0.2770979 0.0164836 16.811 < 2e-16 ***
## volatile.acidity -1.1419024 0.0969400 -11.779 < 2e-16 ***
## sulphates 0.9148320 0.1102702 8.296 2.26e-16 ***
## total.sulfur.dioxide -0.0023096 0.0005082 -4.544 5.92e-06 ***
## chlorides -1.7047871 0.3916886 -4.352 1.43e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6514 on 1593 degrees of freedom
## Multiple R-squared: 0.3515, Adjusted R-squared: 0.3495
## F-statistic: 172.7 on 5 and 1593 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## total.sulfur.dioxide + chlorides + pH, data = wineData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.60575 -0.35883 -0.04806 0.46079 1.95643
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.2957316 0.3995603 10.751 < 2e-16 ***
## alcohol 0.2906738 0.0168108 17.291 < 2e-16 ***
## volatile.acidity -1.0381945 0.1004270 -10.338 < 2e-16 ***
## sulphates 0.8886802 0.1100419 8.076 1.31e-15 ***
## total.sulfur.dioxide -0.0023721 0.0005064 -4.684 3.05e-06 ***
## chlorides -2.0022839 0.3980757 -5.030 5.46e-07 ***
## pH -0.4351830 0.1160368 -3.750 0.000183 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6487 on 1592 degrees of freedom
## Multiple R-squared: 0.3572, Adjusted R-squared: 0.3548
## F-statistic: 147.4 on 6 and 1592 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## total.sulfur.dioxide + chlorides + pH + free.sulfur.dioxide,
## data = wineData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.68918 -0.36757 -0.04653 0.46081 2.02954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4300987 0.4029168 10.995 < 2e-16 ***
## alcohol 0.2893028 0.0167958 17.225 < 2e-16 ***
## volatile.acidity -1.0127527 0.1008429 -10.043 < 2e-16 ***
## sulphates 0.8826651 0.1099084 8.031 1.86e-15 ***
## total.sulfur.dioxide -0.0034822 0.0006868 -5.070 4.43e-07 ***
## chlorides -2.0178138 0.3975417 -5.076 4.31e-07 ***
## pH -0.4826614 0.1175581 -4.106 4.23e-05 ***
## free.sulfur.dioxide 0.0050774 0.0021255 2.389 0.017 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6477 on 1591 degrees of freedom
## Multiple R-squared: 0.3595, Adjusted R-squared: 0.3567
## F-statistic: 127.6 on 7 and 1591 DF, p-value: < 2.2e-16
After adding predictors in the order of their pearson r squared value, and then removing them if their p value > .05, we’re left with a 7 factor model comprised of alcohol, volatile acidity, sulphates, total sulfur dioxide, chlorides, pH and free sulfur dioxide, in descending order of significance.
There were observable diminishing returns in adding predictors beyond the first, the alcohol concentration. After that, the sequential addition of more predictors tended to add positive but small returns.
Surprisingly, the two sulfur dioxide measures both made it into the model, despite measuring similar qualities, or so the name would suggest.
Additionally, alcohol content was the greatest factor in explaining quality ratings. Why don’t winemakers raise the alcohol volume of their wares across the board? Maybe it costs more, or some varietals do not attain as high an alcohol level. Or simply that some customers prefer lower alcohol content.
Above all, tastes vary; winemakers may not grade quality in the same way that the testers for this dataset did. A wine may have niche appeal, giving it a low rating, but the winemaker cares about his core customer, be that himself, his family or his coterie of devotees.
The model serves as a rough guide as to which qualities are desirable, to please the median professional wine tasters. Whether that aligns with popular tastes is another matter, and is of vital importance to the commercial vintner.
## Warning in bxp(structure(list(stats = structure(c(8.4, 9.7, 9.925, 10.7, :
## some notches went outside hinges ('box'): maybe set notch=FALSE
Alcohol by volume proved to be the best single predictor of high quality ratings. We can see a marked gradation from quality ratings 3-5 versus the higher ratings, and that more alcohol by volume tends to result in a higher rating.
This plot puts on one panel scatter plots for all the other physical factors on the x axis, and the quality rating on the y-axis. Closer investigations of these plots individually is often a first step for exploratory data analysis, to determine the nature and strength of a relationship between potential predictors and the outcome.
Here we plot the actual quality values against what we predict them to be, based no our regression analysis. We can see that there is considerable clustering around the points (c, c), where 3 <= c <= 8, and c is an integer. But there are still many points that are not quite as predicted, meaning our model doesn’t anticipate the quality ratings correctly. ——
Wine is shrouded in mystique, intrigue, and doubt. Many would say the quality of a wine cannot be merely reduced to physical factors, that it is more than the sum of its parts. In light of such sediment sentiment, it was fun to examine this dataset and see if indeed there is a way to predict wine quality, as measured by professional tasters.
I was surprised to find that higher alcohol content was such a factor. Without knowledge of the wine industry, there is no clear explanation. My hunch is that higher alcohol content takes longer to achieve, which drives up the cost. Since the cost is higher, one would tend to use higher quality grapes to get a better return. Perhaps an additional variable of the fermentation length would help confirm or reject this hypothesis.
I found it easy to compare the relationship of quality to the various physical factors.
For future work, I would divide the dataset into training and test sets, to ensure that the resulting regression model was robust, and not the result of overfitting.