Table of Contents
In the following we will be exploring 2 datasets. Both data sets document the quality of wine and their associated physiochemical properties. The data sets are divided into red and white wines. The grape is the Portugese varietal of Vinho Verde. Documentation of the entire study can be found here. Documentation of the data set and its subsequent variables can be found here. For the purposes of determining patterns in the overall designation of wine, I combined the 2 datasets and added a categorical variable called type which denotes whether the particular iteration is white or red wines. This should not impede any analysis of the individual type of wine but should streamline the analysis of the relationship of quality to the physiochemical properties of the wine varietal in whole.
Note: After the loading of the datasets, an additional variable, type, was added to indicate the type of wine; Red Wine for red wine and White Wine for white wine. This was to help identify the type of wine when both datasets were combined.
| Type | Observations | Variables |
|---|---|---|
| Red Wine | 1599 | 13 |
| White Wine | 4898 | 13 |
Note: There are considerably more white wine iterations than there are red wine iterations. This may or may not make a difference when analyzing the 2 datasets as one. I have created 2 different combinations, 1 data set is a straight combination of the 2 data sets, the other is a combination of the red wine data set AND the white wine data set of which I randomly chose observations from so the number of observations were equal to the red wine data set. We will be conducting analysis on both so the type of dataset will be denoted in the title as Equalized or Unequal.
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## [13] "type"
Note: the X variable is an index variable and will be removed as it will interfere with the combination of the 2 datasets and is incidentally uneeded.
Description of the Variables
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of \(SO_2\) exists in equilibrium between molecular \(SO_2\) (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of \(S0_2\); in low concentrations, \(SO_2\) is mostly undetectable in wine, but at free \(SO_2\) concentrations over 50 ppm, \(SO_2\) becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (\(S0_2\)) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality type
## Min. :3.000 Length:1599
## 1st Qu.:5.000 Class :character
## Median :6.000 Mode :character
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality type
## Min. :3.000 Length:4898
## 1st Qu.:5.000 Class :character
## Median :6.000 Mode :character
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
When looking at the distributions of the physiochemical properties, you can notice some major outliers on some of the covariates. Namely, residual.sugar, chlorides, sulphates, and total.sulfur.dioxide. These variables’ max values are considerably higher than the other parameters while the mean and median stay within a relatively small range. If we were to consider this dataset for predictive modeling, it may be beneficial to remove these outliers. Removing these observations could also increase correlative values between the covariates as well as the quality variable.
## Quality Frequency
## 1 3 10
## 2 4 53
## 3 5 681
## 4 6 638
## 5 7 199
## 6 8 18
## Quality Frequency
## 1 3 20
## 2 4 163
## 3 5 1457
## 4 6 2198
## 5 7 880
## 6 8 175
## 7 9 5
## Quality Frequency
## 1 3 10
## 2 4 53
## 3 5 681
## 4 6 638
## 5 7 199
## 6 8 18
## Quality Frequency
## 1 3 2
## 2 4 50
## 3 5 441
## 4 6 743
## 5 7 296
## 6 8 65
## 7 9 2
Note: The distributions of the outcome variable, quality, seem to be normal.
You may notice that the range of the values are between 0-1. I have normalized all the variables so 0 is the minimum value and 1 is the maximum value.
My Reasoning for 0-1 Normalization
Things to observe
Applying a regression line to the individual plots give us an idea of how the physiochemical properties may or may not affect the percieved quality of the wine. Here are some observations that are uniform across the different wine types; White Wine, Red Wine.
volatile.acidity in wine increases (acetic acid; vinegar flavor), the perceived quality decreasesquality of both wines decline as the amount of chlorides, or salt, increase.
White Wine could be because White Wine typically lacks the tanins present in Red Wine. Tanins, as well as the color, is derived from the skin of the grape. For example, Pinot Noir is used both in Champagne, which is white, and in Burgundian Wines, which are red. The key difference in color is that the skin is not present in Champagne. more information about tanins can be found heredensity increases, perceived quality declines. It is unclear, from this examination, how much density actually affects quality because…alcohol variable has the opposite effect on quality to a very similar degree. Since we know that density is directly related alcohol, in that alcohol causes change in density, it would be more prudent to say that alcohol affects both density and quality, or, density is a result of alcohol and its correlation with quality is not evidence of causality.sulphates increase, so does quality. sulphates are used as an antimicrobial, so it would make sense that the more clean your wine is, the better it tastes.total.sulfur.dioxide can have a pungent and repelling aroma and it makes sense that in both wines, an increased ammount results in a lower quality level.We have looked at physiochemical properties that affect the different types of wines uniformly. Now let us examine some attributes that have distinct effects on the different types of wine and what we may be able to conclude from it.
We notice that increased fixed.acidity has different effects on the types of wine, strikingly different.
fixed.acidity increases in Red Wine, the percieved quality seems to also increase and the opposite seems to happen with White WineRed Wine tends to contain more tanins which add more biterness and astringent qualities to wine. This combination of acidity and tanins could be well perceived by wine tasters.White Wine could make the perceived acidity in White Wine more apparent and therefore undesired at high levels.White Wine with fixed.acidity beyond \(12 g/dm^3\)Note: This same effect can be observed between citric.acid and quality but since these relationships are so closely related, the same arguments can be used to justify the patterns.
residual.sugar levels have opposite effects on quality perceived. We could imply…
residual.sugar levels get to \(15 g/dm^3\) in Red Wine, the grey are becomes increasingly large and further away from the regression line. This could indicate that although the regression line shows the trend of the data, it may not be the most accurate indicator of the relationship between quality and residual.sugar. For now we will error on the side that the trend is accurate enough.Red Wines tendency to have more tanins, and therefore bitterness and astringency, may benefit from more residual.sugarWhite Wine, residual.sugar may be more apparent and therefore less desired at higher levels.As expected, pH has the inverse effect on quality when compared to acidity. This again reiterates that tasters prefer more acid in Red Wines but not in White Wines
As the outcome variable is concerned, quality, there were some distinct relationships I noticed.
It is a commonly held belief that acid is a needed in wine to pair with food and to enjoy. Being in the service industry myself, I have heard this many times. White Wine most often is noted to have the perceived quality of acidity. It was interesting to see that not only in general that Red Wine typically contains more acid but also that White Wine is negatively percieved when acid increases.
Another widely accepted idea is that alcohol diminishes your ability to taste and therefore higher alcohol wines are not ideal for pairing with food. It is interesting to me that when tasting wine, however, that alcohol is well recieved. It could be simply that fact that increased alcohol content lowers the density of wine, as we have seen, or people just like higher proof drinks and our palletes are sensitive enough to notice.
In wine tasting, you will often hear about the Typicity of wine. Typicity is defined as “the degree to which a wine reflects its varietal origins, and thus demonstrate the signature characteristics of the grape from which it was produced, i.e., how much a Merlot wine ‘tastes like a Merlot’.” Looking at the box-plot, you can see certain properties that have a really small concentration depending on the type of wine. For example, residual.sugar, for Red Wine, is highly ocncentrated in a small area. If we have an abundance of data, like ours, and such a small distribution of a key predictor, it could suggest Red Wines typicity.
When trying to extrapolate patterns in the physiochemical properties of wine, it is better to look at Red and White wines seperately. Even though they share a lot of similar relationships, it is clear that in some cases, what works for one does not work for the other.
I also noticed some redundancy with the variables of choice. If predictive modeling were of any interest, I would suggest try dimensional reduction techniques. For example, exploring the possibility of combining fixed.acidity, volatile.acidity, citric acid, and pH into a new variable that accurately holds up model integrity while speeding up any randomforest or logistic regression modeling.
All in all, I really enjoyed this assignment and would appreciate some more up to date data sets on this subject matter. I would also suggest using a more widely familiar varietal, such as Pinot Noir, or maybe a wide list of varietals where the predictors would be the physiochemical properties and the outcome variable would be the type of wine, i.e. Pinot Noir, Cabernet Sauvignon, and Merlot.