Table of Contents

  1. Introduction
  2. Exploration
    1. General Statistics
    2. Quality Distributions
    3. Related Variables (Intuitive Relation)
    4. Other Related Variables (Unintuitive Relation)
    5. The Outcome Variable – Quality
  3. Conclusion
    1. Final Plots
    2. Summary

1. Introduction

In the following we will be exploring 2 datasets. Both data sets document the quality of wine and their associated physiochemical properties. The data sets are divided into red and white wines. The grape is the Portugese varietal of Vinho Verde. Documentation of the entire study can be found here. Documentation of the data set and its subsequent variables can be found here. For the purposes of determining patterns in the overall designation of wine, I combined the 2 datasets and added a categorical variable called type which denotes whether the particular iteration is white or red wines. This should not impede any analysis of the individual type of wine but should streamline the analysis of the relationship of quality to the physiochemical properties of the wine varietal in whole.

1. Exploration

1.1 General Statistics

Note: After the loading of the datasets, an additional variable, type, was added to indicate the type of wine; Red Wine for red wine and White Wine for white wine. This was to help identify the type of wine when both datasets were combined.

  • Dimensions
Type Observations Variables
Red Wine 1599 13
White Wine 4898 13

Note: There are considerably more white wine iterations than there are red wine iterations. This may or may not make a difference when analyzing the 2 datasets as one. I have created 2 different combinations, 1 data set is a straight combination of the 2 data sets, the other is a combination of the red wine data set AND the white wine data set of which I randomly chose observations from so the number of observations were equal to the red wine data set. We will be conducting analysis on both so the type of dataset will be denoted in the title as Equalized or Unequal.

  • Variables (both datasets have identical covariates)
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"             
## [13] "type"

Note: the X variable is an index variable and will be removed as it will interfere with the combination of the 2 datasets and is incidentally uneeded.

Description of the Variables

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of \(SO_2\) exists in equilibrium between molecular \(SO_2\) (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of \(S0_2\); in low concentrations, \(SO_2\) is mostly undetectable in wine, but at free \(SO_2\) concentrations over 50 ppm, \(SO_2\) becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (\(S0_2\)) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data): 12 - quality (score between 0 and 10)

  • Summarizations for Red Wines
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality          type          
##  Min.   :3.000   Length:1599       
##  1st Qu.:5.000   Class :character  
##  Median :6.000   Mode  :character  
##  Mean   :5.636                     
##  3rd Qu.:6.000                     
##  Max.   :8.000
  • Summarizations for White Wines
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality          type          
##  Min.   :3.000   Length:4898       
##  1st Qu.:5.000   Class :character  
##  Median :6.000   Mode  :character  
##  Mean   :5.878                     
##  3rd Qu.:6.000                     
##  Max.   :9.000

When looking at the distributions of the physiochemical properties, you can notice some major outliers on some of the covariates. Namely, residual.sugar, chlorides, sulphates, and total.sulfur.dioxide. These variables’ max values are considerably higher than the other parameters while the mean and median stay within a relatively small range. If we were to consider this dataset for predictive modeling, it may be beneficial to remove these outliers. Removing these observations could also increase correlative values between the covariates as well as the quality variable.


1.2 Distributions between Red and White Wines

Unequal Observations

Red Wine’s Table of Quality
##   Quality Frequency
## 1       3        10
## 2       4        53
## 3       5       681
## 4       6       638
## 5       7       199
## 6       8        18
White Wine’s Table of Quality
##   Quality Frequency
## 1       3        20
## 2       4       163
## 3       5      1457
## 4       6      2198
## 5       7       880
## 6       8       175
## 7       9         5

Equalized Observations

Red Wine’s Table of Quality
##   Quality Frequency
## 1       3        10
## 2       4        53
## 3       5       681
## 4       6       638
## 5       7       199
## 6       8        18
White Wine’s Table of Quality
##   Quality Frequency
## 1       3         2
## 2       4        50
## 3       5       441
## 4       6       743
## 5       7       296
## 6       8        65
## 7       9         2

Note: The distributions of the outcome variable, quality, seem to be normal.

Distributions of the Prediction Variables

You may notice that the range of the values are between 0-1. I have normalized all the variables so 0 is the minimum value and 1 is the maximum value.

My Reasoning for 0-1 Normalization

  • To concisely represent all the variables and their distributions without having variables of larger scales visually warp the variables of smaller scales.
  • We are concerned with distribution, and in distribution, the actual values are not as much of a concern so much as the visualizing the distribution of those values within its range

Things to observe

  • We immedietely note that there are certain properties that have very small or large distributions depending on the wine type.
  • Aside from a few properties, the majority of the boxes tend to be at below the median value. This could indicate that we have some extreme outliers.

1.5 Quality – The Outcome Variable

Applying a regression line to the individual plots give us an idea of how the physiochemical properties may or may not affect the percieved quality of the wine. Here are some observations that are uniform across the different wine types; White Wine, Red Wine.

  • As expected, as the amount of volatile.acidity in wine increases (acetic acid; vinegar flavor), the perceived quality decreases
  • The perceived quality of both wines decline as the amount of chlorides, or salt, increase.
    • This type of salt sensitivity in White Wine could be because White Wine typically lacks the tanins present in Red Wine. Tanins, as well as the color, is derived from the skin of the grape. For example, Pinot Noir is used both in Champagne, which is white, and in Burgundian Wines, which are red. The key difference in color is that the skin is not present in Champagne. more information about tanins can be found here
  • As density increases, perceived quality declines. It is unclear, from this examination, how much density actually affects quality because…
  • The directly related alcohol variable has the opposite effect on quality to a very similar degree. Since we know that density is directly related alcohol, in that alcohol causes change in density, it would be more prudent to say that alcohol affects both density and quality, or, density is a result of alcohol and its correlation with quality is not evidence of causality.
  • As sulphates increase, so does quality. sulphates are used as an antimicrobial, so it would make sense that the more clean your wine is, the better it tastes.
  • total.sulfur.dioxide can have a pungent and repelling aroma and it makes sense that in both wines, an increased ammount results in a lower quality level.

We have looked at physiochemical properties that affect the different types of wines uniformly. Now let us examine some attributes that have distinct effects on the different types of wine and what we may be able to conclude from it.

Fixed Acidity’s effect on the different types of Wine

We notice that increased fixed.acidity has different effects on the types of wine, strikingly different.

  • As fixed.acidity increases in Red Wine, the percieved quality seems to also increase and the opposite seems to happen with White Wine
  • As stated here, Red Wine tends to contain more tanins which add more biterness and astringent qualities to wine. This combination of acidity and tanins could be well perceived by wine tasters.
  • The lack of tanins in White Wine could make the perceived acidity in White Wine more apparent and therefore undesired at high levels.
  • This could explain why wine makers do not seem to produce White Wine with fixed.acidity beyond \(12 g/dm^3\)

Note: This same effect can be observed between citric.acid and quality but since these relationships are so closely related, the same arguments can be used to justify the patterns.

Residual Sugar’s affect on the types of Wine

residual.sugar levels have opposite effects on quality perceived. We could imply…

  • The grey areas represent the bulk of the data (25% quantile - 75% quantile). By the time residual.sugar levels get to \(15 g/dm^3\) in Red Wine, the grey are becomes increasingly large and further away from the regression line. This could indicate that although the regression line shows the trend of the data, it may not be the most accurate indicator of the relationship between quality and residual.sugar. For now we will error on the side that the trend is accurate enough.
  • Red Wines tendency to have more tanins, and therefore bitterness and astringency, may benefit from more residual.sugar
  • While in White Wine, residual.sugar may be more apparent and therefore less desired at higher levels.

pH’s relationship to the different Wine Types

As expected, pH has the inverse effect on quality when compared to acidity. This again reiterates that tasters prefer more acid in Red Wines but not in White Wines

3. Conclusion

Final Plots

As the outcome variable is concerned, quality, there were some distinct relationships I noticed.

Perception of Acidity

It is a commonly held belief that acid is a needed in wine to pair with food and to enjoy. Being in the service industry myself, I have heard this many times. White Wine most often is noted to have the perceived quality of acidity. It was interesting to see that not only in general that Red Wine typically contains more acid but also that White Wine is negatively percieved when acid increases.

The Want for more Alcohol

Another widely accepted idea is that alcohol diminishes your ability to taste and therefore higher alcohol wines are not ideal for pairing with food. It is interesting to me that when tasting wine, however, that alcohol is well recieved. It could be simply that fact that increased alcohol content lowers the density of wine, as we have seen, or people just like higher proof drinks and our palletes are sensitive enough to notice.

The Typicity of Wine

In wine tasting, you will often hear about the Typicity of wine. Typicity is defined as “the degree to which a wine reflects its varietal origins, and thus demonstrate the signature characteristics of the grape from which it was produced, i.e., how much a Merlot wine ‘tastes like a Merlot’.” Looking at the box-plot, you can see certain properties that have a really small concentration depending on the type of wine. For example, residual.sugar, for Red Wine, is highly ocncentrated in a small area. If we have an abundance of data, like ours, and such a small distribution of a key predictor, it could suggest Red Wines typicity.

Summary

When trying to extrapolate patterns in the physiochemical properties of wine, it is better to look at Red and White wines seperately. Even though they share a lot of similar relationships, it is clear that in some cases, what works for one does not work for the other.

I also noticed some redundancy with the variables of choice. If predictive modeling were of any interest, I would suggest try dimensional reduction techniques. For example, exploring the possibility of combining fixed.acidity, volatile.acidity, citric acid, and pH into a new variable that accurately holds up model integrity while speeding up any randomforest or logistic regression modeling.

All in all, I really enjoyed this assignment and would appreciate some more up to date data sets on this subject matter. I would also suggest using a more widely familiar varietal, such as Pinot Noir, or maybe a wide list of varietals where the predictors would be the physiochemical properties and the outcome variable would be the type of wine, i.e. Pinot Noir, Cabernet Sauvignon, and Merlot.