Wine tasting is said to be an art. For centuries many people have been practicing this profession, where guided by the smell, color and taste of the wine they give them a rank. This data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (excellent). The purpose of the analysis is to find which objective variables could define a pattern that can predict this score. We will discuss that Citric Acid and Alochol are the main positive attributes a wine should have, while Volatile Acidity and Total Sulfur Dioxide are the main negative ones.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
The data set contains 1599 observations and 13 variables. Besides, there aren’t NAs values in any variable.
##
## 3 4 5 6 7 8
## 0.63 3.31 42.59 39.90 12.45 1.13
Interestingly, there isn’t much variabilty in the scores. Most of the wine’s scores are between 5 and 6, and only 12% of them got a 7.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
PH is a measure of acidic or basic, we spect that it may not have values greater than 7. As we can see above, the median PH value is 3.3 and most of the values seems between 3.0 and 3.5. The distribution seems nearly normal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Regarding alcohol, we can see that is right skewed. So perhaps there is something else affecting this variable distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
The same happends with citric acid. There is a hugh concentration of cases with value 0.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Another important variable to check is the Total Sulfur Dioxide, which measure the amount of free and bound forms of SO2. We can see that most of the wines have between 0 an 100 ppm of this substance.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
With respect to volatile acidity, we can see is more bell shaped with its center around 0.5.
The data set contains 1599 observations and 13 variables, all numeric. There aren’t NAs values in any variable
The main feature in the data set are citric acid, alcohol and total sulfur dioxide. We suspect that citric acid and alcohol may have an important contribution to the wine’s quality.
We will need to do work a litte bit more with respect to volatile acidity. On the next sesion we will add quality as a control variable.
It seems to be a positive realation between Citric Acid consentrations and wine quality. As the median of citric acid is greater, the quality tend to increased.
The same seems to happend with level of alcohol within the wine.
Another important variable to check is the Total Sulfur Dioxide, which measure the amount of free and bound forms of SO2 and it is used as an antioxidant. I found on google that over 50 ppm of, SO2 becomes evident in the nose and taste of wine.
## Warning: Removed 9 rows containing non-finite values (stat_boxplot).
Accordingly to what we said above, there is a threshold around 50 ppm. Wines which got a 5 are near this value and better wines tend to have less proportion of this substance. However, it is worth to take into account that wines quality 3 and 4 seems to have the same proportion of SO2 as the better ones.
We can see now from this last chart, that there is a strong negative relation between acetic acid and wine quality.
Now we can see that the skewed distributions we saw earlier in alcohol and citric acid variables may have to do with the influence of quality of the wine. From the boxplots we plotted we checked this statement.
We did see now a strong negative relation between volatile acidity and wine quality.
We found four types of relationships. Two positive and two negatives. The first two are alcohol and citric acid, and the second ones are total sulfur dioxide and volatile acidity.
In this plot we can see how the relatioships we saw at the beggining of this analysis using boxplots are now integrated into one plot. On the to the top right corner of the plot the quality of the wines tends to be higher, this means that wines with higher concentration of alcohol and citric acid tend perform better in a test.
Regarding this second plot now the argument changed, that is the quality tends to be better on the bottom left corner. This means that wines with low concentration of SO2 and acetic acid tend to perform better than wines with higher concentration. Besides, is it worth to notice, that we can see an imaginary vertical line around 50 ppm where good wines tend to be on the left as we said earlier.
We can see clearer now the two different drivers of quality. The two positive variables, alcohol and citric acid, and the two negative variables total sulfur dioxide and volatile acidity.
The sorprising interaction is with respect to total sulfur dioxide. We didn’t hope to find a negative relationship between such variable and wine quality. However, this may be caused by its concentration. Wines with a concentration greater than 50 ppm tend to get less score than wines with ppm values around 15 and 35. Again, one posible explanaition is that though total sulfur dioxide may be useful as an antibacterial and antioxidant, in important concentration can be detected by smell or taste.
This plot summerized the two main positive variables we found that may exaplained the scores each one may have taken. Separatly we can see a positive relation between citric acid and alcohol and wine quality.
## Warning: Removed 9 rows containing non-finite values (stat_boxplot).
This plot summerized the two main negative variables we found that may exaplained the scores each one may have taken. Separatly we can see a negative relation between volatile acidity and total sulfur dioxide and wine quality. However, it is worth to take into account that wines quality 3 and 4 seems to have the same proportion of SO2 as the better ones.
Finally, we plotted the positive and negative relationships we found in the analysis. Wines with higher concentration of alcohol and citric acid tend perform better in a test. And, on the other side, Wines with low concentration of SO2 and acetic acid tend to perform better than wines with higher concentration. Again, is it worth to notice, that it can be seen an imaginary vertical line around 50 ppm where good wines tend to be on the left.
To summarize, though wine tasting may seem an ancient practice done by professionals and some times seen as subjective and/or relegated to a small group of people, we were able to find some patterns concerning those decisions. It can be said now, that four variables can be the main drivers of quality in wines. Citric Acid and Alochol were found to be the main positive attributes a wine should have, while Volatile Acidity and Total Sulfur Dioxide were the main negative ones. In spite of that, more work need to be done regarding the latter variable. We know that SO2 helps as an antibacterial and antioxidant, but in high concentrations it may be detected and becomes a negative drive for its bad taste and smell. The right amount perhaps can be found around 15 and 35 ppm, but this is still only a rough aproximation.