In Vino, Veritas by Asher Meyers

Red Wine Quality: An Introduction

What makes wine taste good?

Wine tasters rated 1599 wines for quality; eleven other, quantitative chemical aspects were measured for each wine, in an effort to link wine quality to observable physical factors.

These are the 11 physical factors, taken from the dataset description:

1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3)
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume)

The output variable (based on sensory data):

12 - quality (score between 0 and 10)

Univariate Plots Section

First, let’s look at some summary statistics about the quality ratings in our sample. Wine quality is assessed on a 0 to 10 point scale, from terrible to superb.

Distribution of Quality Values

Let’s start with a histogram:

The ratings are roughly normally distributed, with values from 3 to 8, with most of the wines scoring a 5 or 6.

Quality Summary Statistics

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Quality ratings vary from 3 to 8, with an average of 5.6, a median of 6, and an interquartile range of 5 to 6. So there are few wines better than a 6 or worse than a 5.

Univariate Plots for Physical Factors

Let’s look at one of the physical factors, one we’re all familiar with: alcohol concentration.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

We can see the wines tend to fall in the 9-11% alcohol by volume range, with some a bit higher.

For your casual viewing, here are histograms for the other physical factors:

Univariate Analysis

What are the main feature(s) of interest in your dataset?

The relationship between the physical factors and the quality ratings. If we can determine this relationship, it can inform future winemaking, to make better quality wines.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

The dataset is clean, and is not missing any data, ensuring that all data points can be used in the dataset, without adjustment of the raw data needed.

Did you create any new variables from existing variables in the dataset?

I did not.

Of the features you investigated, were there any unusual distributions?
The citric acid factor seems to have a bimodal distribution at 0 approximately

0.4 g/dm^3. The fixed acidity and sulfur dioxid histograms appear to be somewhat right-skewed.

Did you perform any operations on the data to tidy, adjust, or change the
form of the data? If so, why did you do this?

The dataset was clean, and not missing any data, making such adjustments unnecessary.

Bivariate Plots Section

Based on the physical factor names alone, we can expect some correlation among the factors. For instance, total and free sulfur dioxide, and among the different indicators of acidity.

##                  fixed.acidity volatile.acidity citric.acid quality
## fixed.acidity             1.00            -0.26        0.67    0.12
## volatile.acidity         -0.26             1.00       -0.55   -0.39
## citric.acid               0.67            -0.55        1.00    0.23
## quality                   0.12            -0.39        0.23    1.00

##                      free.sulfur.dioxide total.sulfur.dioxide quality
## free.sulfur.dioxide                 1.00                 0.67   -0.05
## total.sulfur.dioxide                0.67                 1.00   -0.19
## quality                            -0.05                -0.19    1.00

##                          quality
## alcohol               0.47616632
## volatile.acidity     -0.39055778
## sulphates             0.25139708
## citric.acid           0.22637251
## total.sulfur.dioxide -0.18510029
## density              -0.17491923
## chlorides            -0.12890656
## fixed.acidity         0.12405165
## pH                   -0.05773139
## free.sulfur.dioxide  -0.05065606
## residual.sugar        0.01373164

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Factors that measured similar qualities, such as the fixed and volatile acidity, as well as the sulfur dioxide variants, often correlated with one another;

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)? What was the strongest relationship
you found?

Alcohol was the leading indicator that a wine would be rated of good quality. Why? Do tasters prefer higher alcohol contents, or do higher quality grapes get fermented for longer? The strength of this indicator, and the ability of vinters to influence it through various production processes, offers a ripe avenue for exploration of future quality improvements.

## Loading required package: tcltk

Multivariate Analysis

For our multivariate analysis, we will perform a series of linear regressions, adding factors in the order of the absolute magnitude of the correlation between the physical factors and the outcome, quality.

## 
## Call:
## lm(formula = quality ~ alcohol, data = wineData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8442 -0.4112 -0.1690  0.5166  2.5888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.87497    0.17471   10.73   <2e-16 ***
## alcohol      0.36084    0.01668   21.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared:  0.2267, Adjusted R-squared:  0.2263 
## F-statistic: 468.3 on 1 and 1597 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity, data = wineData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.59342 -0.40416 -0.07426  0.46539  2.25809 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.09547    0.18450   16.78   <2e-16 ***
## alcohol           0.31381    0.01601   19.60   <2e-16 ***
## volatile.acidity -1.38364    0.09527  -14.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6678 on 1596 degrees of freedom
## Multiple R-squared:  0.317,  Adjusted R-squared:  0.3161 
## F-statistic: 370.4 on 2 and 1596 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates, 
##     data = wineData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7186 -0.3820 -0.0641  0.4746  2.1807 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.61083    0.19569  13.342  < 2e-16 ***
## alcohol           0.30922    0.01580  19.566  < 2e-16 ***
## volatile.acidity -1.22140    0.09701 -12.591  < 2e-16 ***
## sulphates         0.67903    0.10080   6.737 2.26e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6587 on 1595 degrees of freedom
## Multiple R-squared:  0.3359, Adjusted R-squared:  0.3346 
## F-statistic: 268.9 on 3 and 1595 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     total.sulfur.dioxide, data = wineData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.72716 -0.38486 -0.06503  0.44980  2.13257 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.8258128  0.2006892  14.081  < 2e-16 ***
## alcohol               0.2953105  0.0160331  18.419  < 2e-16 ***
## volatile.acidity     -1.1985632  0.0966011 -12.407  < 2e-16 ***
## sulphates             0.7121396  0.1005146   7.085 2.08e-12 ***
## total.sulfur.dioxide -0.0022354  0.0005108  -4.376 1.28e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.655 on 1594 degrees of freedom
## Multiple R-squared:  0.3438, Adjusted R-squared:  0.3421 
## F-statistic: 208.8 on 4 and 1594 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     total.sulfur.dioxide + chlorides, data = wineData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.67443 -0.38254 -0.06368  0.44893  2.07310 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.0048920  0.2037663  14.747  < 2e-16 ***
## alcohol               0.2770979  0.0164836  16.811  < 2e-16 ***
## volatile.acidity     -1.1419024  0.0969400 -11.779  < 2e-16 ***
## sulphates             0.9148320  0.1102702   8.296 2.26e-16 ***
## total.sulfur.dioxide -0.0023096  0.0005082  -4.544 5.92e-06 ***
## chlorides            -1.7047871  0.3916886  -4.352 1.43e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6514 on 1593 degrees of freedom
## Multiple R-squared:  0.3515, Adjusted R-squared:  0.3495 
## F-statistic: 172.7 on 5 and 1593 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     total.sulfur.dioxide + chlorides + pH, data = wineData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.60575 -0.35883 -0.04806  0.46079  1.95643 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.2957316  0.3995603  10.751  < 2e-16 ***
## alcohol               0.2906738  0.0168108  17.291  < 2e-16 ***
## volatile.acidity     -1.0381945  0.1004270 -10.338  < 2e-16 ***
## sulphates             0.8886802  0.1100419   8.076 1.31e-15 ***
## total.sulfur.dioxide -0.0023721  0.0005064  -4.684 3.05e-06 ***
## chlorides            -2.0022839  0.3980757  -5.030 5.46e-07 ***
## pH                   -0.4351830  0.1160368  -3.750 0.000183 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6487 on 1592 degrees of freedom
## Multiple R-squared:  0.3572, Adjusted R-squared:  0.3548 
## F-statistic: 147.4 on 6 and 1592 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     total.sulfur.dioxide + chlorides + pH + free.sulfur.dioxide, 
##     data = wineData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68918 -0.36757 -0.04653  0.46081  2.02954 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.4300987  0.4029168  10.995  < 2e-16 ***
## alcohol               0.2893028  0.0167958  17.225  < 2e-16 ***
## volatile.acidity     -1.0127527  0.1008429 -10.043  < 2e-16 ***
## sulphates             0.8826651  0.1099084   8.031 1.86e-15 ***
## total.sulfur.dioxide -0.0034822  0.0006868  -5.070 4.43e-07 ***
## chlorides            -2.0178138  0.3975417  -5.076 4.31e-07 ***
## pH                   -0.4826614  0.1175581  -4.106 4.23e-05 ***
## free.sulfur.dioxide   0.0050774  0.0021255   2.389    0.017 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6477 on 1591 degrees of freedom
## Multiple R-squared:  0.3595, Adjusted R-squared:  0.3567 
## F-statistic: 127.6 on 7 and 1591 DF,  p-value: < 2.2e-16

After adding predictors in the order of their pearson r squared value, and then removing them if their p value > .05, we’re left with a 7 factor model comprised of alcohol, volatile acidity, sulphates, total sulfur dioxide, chlorides, pH and free sulfur dioxide, in descending order of significance.

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

There were observable diminishing returns in adding predictors beyond the first, the alcohol concentration. After that, the sequential addition of more predictors tended to add positive but small returns.

Were there any interesting or surprising interactions between features?

Surprisingly, the two sulfur dioxide measures both made it into the model, despite measuring similar qualities, or so the name would suggest.

Additionally, alcohol content was the greatest factor in explaining quality ratings. Why don’t winemakers raise the alcohol volume of their wares across the board? Maybe it costs more, or some varietals do not attain as high an alcohol level. Or simply that some customers prefer lower alcohol content.

Above all, tastes vary; winemakers may not grade quality in the same way that the testers for this dataset did. A wine may have niche appeal, giving it a low rating, but the winemaker cares about his core customer, be that himself, his family or his coterie of devotees.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

The model serves as a rough guide as to which qualities are desirable, to please the median professional wine tasters. Whether that aligns with popular tastes is another matter, and is of vital importance to the commercial vintner.

Final Plots and Summary

Plot One

## Warning in bxp(structure(list(stats = structure(c(8.4, 9.7, 9.925, 10.7, :
## some notches went outside hinges ('box'): maybe set notch=FALSE

Description One

Alcohol by volume proved to be the best single predictor of high quality ratings. We can see a marked gradation from quality ratings 3-5 versus the higher ratings, and that more alcohol by volume tends to result in a higher rating.

Plot Two

Description Two

This plot puts on one panel scatter plots for all the other physical factors on the x axis, and the quality rating on the y-axis. Closer investigations of these plots individually is often a first step for exploratory data analysis, to determine the nature and strength of a relationship between potential predictors and the outcome.

Plot Three

Description Three

Here we plot the actual quality values against what we predict them to be, based no our regression analysis. We can see that there is considerable clustering around the points (c, c), where 3 <= c <= 8, and c is an integer. But there are still many points that are not quite as predicted, meaning our model doesn’t anticipate the quality ratings correctly. ——

Reflection

Wine is shrouded in mystique, intrigue, and doubt. Many would say the quality of a wine cannot be merely reduced to physical factors, that it is more than the sum of its parts. In light of such _sediment sentiment, it was fun to examine this dataset and see if indeed there is a way to predict wine quality, as measured by professional tasters.

I was surprised to find that higher alcohol content was such a factor. Without knowledge of the wine industry, there is no clear explanation. My hunch is that higher alcohol content takes longer to achieve, which drives up the cost. Since the cost is higher, one would tend to use higher quality grapes to get a better return. Perhaps an additional variable of the fermentation length would help confirm or reject this hypothesis.

I found it easy to compare the relationship of quality to the various physical factors.

For future work, I would divide the dataset into training and test sets, to ensure that the resulting regression model was robust, and not the result of overfitting.