Wine Quality Project Analysis

by Federico

Synopsis

Wine tasting is said to be an art. For centuries many people have been practicing this profession, where guided by the smell, color and taste of the wine they give them a rank. This data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (excellent). The purpose of the analysis is to find which objective variables could define a pattern that can predict this score. We will discuss that Citric Acid and Alochol are the main positive attributes a wine should have, while Volatile Acidity and Total Sulfur Dioxide are the main negative ones.

Data variables and definition

Atributes Information

  1. Fixed Acidity (tartaric acid - g / dm^3)
  2. Volatile acidity (acetic acid - g / dm^3)
  3. Citric acid (g / dm^3)
  4. Residual sugar (g / dm^3)
  5. Chlorides (sodium chloride - g / dm^3
  6. Free sulfur dioxide (mg / dm^3)
  7. Total sulfur dioxide (mg / dm^3)
  8. Density (g / cm^3)
  9. PH
  10. Sulphates (potassium sulphate - g / dm3)
  11. Alcohol (% by volume)
  12. Quality (score between 0 and 10)

Description Atributes

  1. Fixed acidity: Most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
  2. Volatile acidity: The amount of acetic acid in wine
  3. Citric acid: Found in small quantities, citric acid can add ‘freshness’ and flavor to wines
  4. Residual sugar: The amount of sugar remaining after fermentation stops
  5. Chlorides: the amount of salt in the wine
  6. Free sulfur dioxide: The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
  7. Total sulfur dioxide: Amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
  8. Density: The density of a substance is its mass per unite volume
  9. PH: Describes how acidic or basic a substance is on a scale from 0 (very acidic) to 14 (very basic)
  10. Sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
  11. Alcohol: The percent alcohol content of the wine
  12. Quality: Score between 0 and 10

Univariate Plot Section

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

The data set contains 1599 observations and 13 variables. Besides, there aren’t NAs values in any variable.

## 
##     3     4     5     6     7     8 
##  0.63  3.31 42.59 39.90 12.45  1.13

Interestingly, there isn’t much variabilty in the scores. Most of the wine’s scores are between 5 and 6, and only 12% of them got a 7.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

PH is a measure of acidic or basic, we spect that it may not have values greater than 7. As we can see above, the median PH value is 3.3 and most of the values seems between 3.0 and 3.5. The distribution seems nearly normal.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Regarding alcohol, we can see that is right skewed. So perhaps there is something else affecting this variable distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

The same happends with citric acid. There is a hugh concentration of cases with value 0.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Another important variable to check is the Total Sulfur Dioxide, which measure the amount of free and bound forms of SO2. We can see that most of the wines have between 0 an 100 ppm of this substance.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

With respect to volatile acidity, we can see is more bell shaped with its center around 0.5.

What is the structure of your dataset?

The data set contains 1599 observations and 13 variables, all numeric. There aren’t NAs values in any variable

What is/are the main feature(s) of interest in your dataset?

The main feature in the data set are citric acid, alcohol and total sulfur dioxide. We suspect that citric acid and alcohol may have an important contribution to the wine’s quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

We will need to do work a litte bit more with respect to volatile acidity. On the next sesion we will add quality as a control variable.

Bivariate Plots Section

It seems to be a positive realation between Citric Acid consentrations and wine quality. As the median of citric acid is greater, the quality tend to increased.

The same seems to happend with level of alcohol within the wine.

Another important variable to check is the Total Sulfur Dioxide, which measure the amount of free and bound forms of SO2 and it is used as an antioxidant. I found on google that over 50 ppm of, SO2 becomes evident in the nose and taste of wine.

## Warning: Removed 9 rows containing non-finite values (stat_boxplot).

Accordingly to what we said above, there is a threshold around 50 ppm. Wines which got a 5 are near this value and better wines tend to have less proportion of this substance. However, it is worth to take into account that wines quality 3 and 4 seems to have the same proportion of SO2 as the better ones.

We can see now from this last chart, that there is a strong negative relation between acetic acid and wine quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Now we can see that the skewed distributions we saw earlier in alcohol and citric acid variables may have to do with the influence of quality of the wine. From the boxplots we plotted we checked this statement.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

We did see now a strong negative relation between volatile acidity and wine quality.

What was the strongest relationship you found?

We found four types of relationships. Two positive and two negatives. The first two are alcohol and citric acid, and the second ones are total sulfur dioxide and volatile acidity.

Multivariate Plots Section

In this plot we can see how the relatioships we saw at the beggining of this analysis using boxplots are now integrated into one plot. On the to the top right corner of the plot the quality of the wines tends to be higher, this means that wines with higher concentration of alcohol and citric acid tend perform better in a test.

Regarding this second plot now the argument changed, that is the quality tends to be better on the bottom left corner. This means that wines with low concentration of SO2 and acetic acid tend to perform better than wines with higher concentration. Besides, is it worth to notice, that we can see an imaginary vertical line around 50 ppm where good wines tend to be on the left as we said earlier.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

We can see clearer now the two different drivers of quality. The two positive variables, alcohol and citric acid, and the two negative variables total sulfur dioxide and volatile acidity.

Were there any interesting or surprising interactions between features?

The sorprising interaction is with respect to total sulfur dioxide. We didn’t hope to find a negative relationship between such variable and wine quality. However, this may be caused by its concentration. Wines with a concentration greater than 50 ppm tend to get less score than wines with ppm values around 15 and 35. Again, one posible explanaition is that though total sulfur dioxide may be useful as an antibacterial and antioxidant, in important concentration can be detected by smell or taste.


Final Plots and Summary

Plot One

Description One

This plot summerized the two main positive variables we found that may exaplained the scores each one may have taken. Separatly we can see a positive relation between citric acid and alcohol and wine quality.

Plot Two

## Warning: Removed 9 rows containing non-finite values (stat_boxplot).

Description Two

This plot summerized the two main negative variables we found that may exaplained the scores each one may have taken. Separatly we can see a negative relation between volatile acidity and total sulfur dioxide and wine quality. However, it is worth to take into account that wines quality 3 and 4 seems to have the same proportion of SO2 as the better ones.

Plot Three

Description Three

Finally, we plotted the positive and negative relationships we found in the analysis. Wines with higher concentration of alcohol and citric acid tend perform better in a test. And, on the other side, Wines with low concentration of SO2 and acetic acid tend to perform better than wines with higher concentration. Again, is it worth to notice, that it can be seen an imaginary vertical line around 50 ppm where good wines tend to be on the left.


Reflection

To summarize, though wine tasting may seem an ancient practice done by professionals and some times seen as subjective and/or relegated to a small group of people, we were able to find some patterns concerning those decisions. It can be said now, that four variables can be the main drivers of quality in wines. Citric Acid and Alochol were found to be the main positive attributes a wine should have, while Volatile Acidity and Total Sulfur Dioxide were the main negative ones. In spite of that, more work need to be done regarding the latter variable. We know that SO2 helps as an antibacterial and antioxidant, but in high concentrations it may be detected and becomes a negative drive for its bad taste and smell. The right amount perhaps can be found around 15 and 35 ppm, but this is still only a rough aproximation.