Red wine quiality exploration

Thi is an stream of counciousness - exploration of the Red Wine Quality data set from Modeling wine preferences by data mining from physicochemical. The main objective of this exploration is to get a sense on to what degree variables in the data are related to variable ‘quality’

library(dplyr); library(ggplot2); library(GGally); library(reshape2)
library(tidyr); library(gridExtra); library(knitr)
theme_set(theme_bw())

Helpler functions

Some little wrangling before we start

First, variable “X” is an index variable. We’ll drop it from the analysis. Also, we’ll reformate variable of analysis, quality, to become an ordered factor

‘quality’: the main variable analyis

Despite ‘quality’ can take values between 0 and 10, we found the minimun is on 3 and the maximum on 8. The following plot shows the distribution of the values.

A high amount of wines have a mid-score qualification around values 5 and 6.

Summaries and plotting pairs

Before we go on checking covariates individually, we’ll take a look to the covariates statistical summaries, and relations between them and ‘quality’ (to do the latter, we have to reformat ‘quality’ into numeric)

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol      quality
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40   3: 10  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   4: 53  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20   5:681  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42   6:638  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   7:199  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90   8: 18

##              quality              alcohol     volatile.acidity 
##                1.000                0.476               -0.391 
##            sulphates          citric.acid total.sulfur.dioxide 
##                0.251                0.226               -0.185 
##              density            chlorides        fixed.acidity 
##               -0.175               -0.129                0.124 
##                   pH  free.sulfur.dioxide       residual.sugar 
##               -0.058               -0.051                0.014

Variables presenting higher correlations with ‘quality’ are ‘alcohol’ and ‘volatile.acidity’. The latter could have also a bimodal distribution for what we can distinguish in this plot. Perhaps this behaviour is caused by ‘quality’ and can be worth to check it out.

Some of the covariates have very skewed distributions, some of them with outliers. There are clear outliers in ‘chlorides’ and ’total.sulfur.dioxide. Outliers can affect some statistical measures like correlation. Next, we present histograms of these variables, which give us a clearer picture of the data distribution and the outliers.

Tough total.sulfur.dioxide and free.sulfur.dioxide don’t present high correlation values when compared with ‘quality’, boxplots draw an interesting relatonship, with higher sulfur values for middle quality scores. This relationship must be object of a closer inspection.

Concrete patterns are also presented in boxplots for ‘sulphates’ and ‘residual sugar’. The remaining variables don’t seem to show any particular pattern, but the size of the plot dificulties any deep visual analysis. We’ll have to check these variables individually anyway.

Now we’ll proceed to examine more in deep the relationship between quality and the variables mentioned above, beggining with where a relation can be spotted at first glance. The analysis will be enriched with the inclusion of additional variables when considered appropiate.

Volatile acidity

the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

Volatile acidity has a correlation of -0.39 with quality. The next set of plots includes a histogram of ‘volatile.acidity’, histogram facet by ‘quality’ and boxplox combined with scatter-jitter points. This will be the default presentation when drawing a boxplot, mainly due to the low amount of insances of wines with extreme scores which can give us the wrong picture if only boxplots are drawed, for scores of 3 and 8

The general relationship is clear: higher scored wines have in general lower levels of acetic acid.

A bimodal distribution can be pointed taking a look at the histogram on the left. This particular behaviour could be explained by the different punctuations for the levels of quality with the highest amount of wines (scores 5 and 6): the histogram for scores equal 6 seems to be shifted to the right relativetely to the histogram for scores equal to 5. We can inspect this more in detail summarizing ’volatile.acidity by quality.

## Source: local data frame [6 x 4]
## 
##   quality  mean    sd     n
##    (fctr) (dbl) (dbl) (int)
## 1       3  0.88  0.33    10
## 2       4  0.69  0.22    53
## 3       5  0.58  0.16   681
## 4       6  0.50  0.16   638
## 5       7  0.40  0.15   199
## 6       8  0.42  0.14    18

Quality, at least taken as two interval value variale, doesn’t seem to explain the bimodality of ‘volatile.acidity’, but at least we can see more clearly the “shift” in the distribution when grouping for quality scores.

One covariable that could explain this behaviour is ‘citric.acid’, since is highly correlated with ‘volatile.acidity’. Aprox 1/3 of its values are bewteeen 0 and 0.1

## 
## (-0.001,0.1]    (0.1,0.2]    (0.2,0.3]    (0.3,0.4]    (0.4,0.5] 
##          438          193          291          234          253 
##    (0.5,0.6]    (0.6,0.7]    (0.7,0.8]    (0.8,0.9]      (0.9,1] 
##          112           62           15            0            1

We’ll proceed to split this variable in two: one part for values between 0 and 0.1 and other part for the rest of the values (as a side note, the table shows values lower than 0. Thin point will be addressed later).

This division has almost removed the bimodality of the distribution and also shows how for lower concentrations of ‘citric.acid’ we have higher concentrations of ‘volatile.acidity. Next logical step would be examine the relationship for this two variables together with ’quality’

Citric.acid

found in small quantities, citric acid can add ‘freshness’ and flavor to wines

‘citric.acid’ has a correlation of approx. 0.22 with ‘quality’. Lets check the behaviour of the former and his relationship with the a little more in detail.

citric.acid has an assimetric distribution with a high number of values closer to 0. There is an outlier with value 1. From the histogram and the points in the boxplot one can make out a high concentration of values around 0.5. The next table reveals that the value in question is 0.49

## 
##    0 0.49 0.24 0.02 0.26  0.1 
##  132   68   51   50   38   35

Taking the boxplot as a reference, the trend is also clear for this variable: wines with higher score have higher citric acid concentration in median.

Finally, it can be worth to analyse this relationship combined with the variable ‘volatile.acidity’, given the results obtined in regards to this variable earlier in this report.

Next we presented a scatter plot for ‘volatile.acidity’ and ‘citric.acid’ colored by ‘quality’ (‘quality’ values equal to 0 have been omitted)

Now, if there were an inverse relationship between ‘citric.acid’ with ‘quality’, and ‘volatile.acidity’ was a counfunding variable, we should observe lower scores at high levels of ‘citric.acid’ for each level of ‘volatile.acidity’. We can’t observe such thing by looking at this plot. Maybe we can have a better look if we use ‘cut.quality’ as color variable instead.

Neither the scatter plot nor the smooth, give us a reason to believe that ‘volatile.acidity’ could be a counfunding variable for ‘citric.acid’. Just for good measure, we’ll display the distribution of ‘quality’ facet by ‘citric.cut’ to see more clearly if there are some significative change depending on the value of this last variable.

Both distributions are very similar. At most, one could say there is a higher proportion of wines scored with a 7, for wines with higher citric acid concentrations. These lasts figures confirm, that the relationship between ‘quality’ and ‘citric.acid’(if there is any) is positive and not negative. So, contrary to what is suggested in the commentary for this variable in the data set description, we’ll assume by the moment, that ‘citric.acid’ has either no relationship with ‘quality’ or just a slightly positive one.

Alcohol and density

the percent alcohol content of the wine

the density of water is close to that of water depending on the percent alcohol and sugar content

Alcohol has the highest correlation with quality: around 0.47. Looking at the figures we can see that the amount of alcohol seems to remain stable for wines qualified up to 5. But from 6 to 8 there are, in proportion, more wines with higher alcohol concentration.

Alcohol is also highly correlated with ‘density’ (approx. -0.5), we should include this variable someway when looking the relationship of ‘alcohol’ with the main variable. Next we present the set of three graphs for ‘density’ and then the relation of this variable with ‘alcohol’, together with ‘quality’

Almost all wines with low scores have little amounts of alcohol and hight density. Lower scores have little presence in high levels of alcohol. However it seems that in this part of the plot, when controlling for alcohol, the wines with lower density score worse. So, maybe wines with higher density score worse just because they have higher levels of alcohol.

Perhaps we’ll get a better picture of what is happening at the right side of the plot if we split quality in three.

Now one could say that high quality wines have higher density for each level ol alcohol, but this pattern is doubious, in part because there are too few instances of wine with high quality to make a conclusion. If we compute means and medians by quality the pattern isn’t clear either.

## Source: local data frame [3 x 7]
## 
##   quality.three mean_alcohol median_alcohol mean_density median_density
##          (fctr)        (dbl)          (dbl)        (dbl)          (dbl)
## 1         (2,4]        10.22           10.0    0.9966887        0.99660
## 2         (4,6]        10.25           10.0    0.9968673        0.99680
## 3         (6,8]        11.52           11.6    0.9960303        0.99572
## Variables not shown: sd_density (dbl), n (int)

Sulphates

a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

Wines with more sulphates seem to have higher median score. Also, there is a clear outlier. Again, the few wine instances win higher an lower scores prevent us from reaching definitive conclusions.

Free and total sulfure dioxide

free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

As free and total sulfure dioxide are highly correlated, it make sense to analyse them together in order to examine their influence in the quality of wine.

First let’s display the three set plots for this two covariates

It seems that the two covariates have more or less the same behaviour with respect to ‘quaity’. it is a strange one: the wines with more sulfur concentrations recieve mid-scores. There are outliers in ‘total.sulfur.dioxide’ boxplot, though that difficult visualization of possible trends. Here we plot this last figure removing the outliers.

This shape in both boxplots, draw maybe a cuadratic relationship between ‘quaity’ and the two sulfurs.

Next, we’ll plot a scatter plot of the two sulfurs colored by cut.quality, and with a smoother.

Except for the very low levels of ‘free.sulfur.dioxide’ it seems that for every level of this covariate, wines with higher concentrations of ‘total.sulfur.dioxide’ are scored worse.

Chlorides

the amount of salt in the wine

Two things can be said by ooking at this plots: there is a very slight negative correlation between ‘chlorides’ and ‘quality’, and cloride has clear outliers. Just to take a look to what ‘chlorides’ distribution look like without outliers, we’ll deaw the histogram again removing them from the picture.

‘chlorides’ seems to be more or less symetrically distributed with large tail at the right side.

‘chlorides’ is also somewhat correlated with ‘sulphates’. Let’s see how this relates to ‘cut.quality’

This correlation seems due to the higher/more rare values of chloride. Also for these values, wines have lower scores.

##            cut.quality
##             (0,5] (5,10]
##   (0,0.3]     729    848
##   (0.3,0.8]    15      7

ph

From the boxplots it appears that the bests qualified wines have in median less pH. But again, the low number of observations for wines with extreme qualificationes make these conclusions weaker.

Final plots and summary

Plot one: correlations with quality

(This plot derivates from the correlations table between quality and the rest of covariates)

Variable with the highest correlation is ‘alcohol’ followed by ‘volatile.acidity’, ‘sulphates’ and ‘citric.acid’. Though this plot presents the most obvious relationships based on correlations, some other interesting relationships remain concealed, like ‘total.sulfir.dioxide’ or ‘free.sulfur.dioxide’, which seem to have a quadratic relation with quality. At the same time these two variables combined together show a clearer relation (Plot3). Something similar happens with density and alcohol combined together.

Plot 2: quality distribution for low and high levels of citric acid

One thing that resonates from this plot when you take into acount this description from the data set : ‘found in small quantities, citric acid can add ’freshness’ and flavor to wines’, however ‘citric.acid’ has a low but positive correlation with quality.

One third of wines have levels of citric acid between 0 and 0.1: we’ll consider this, ‘low levels of citric acid’. As can be seen in the histograms, quality distribution for the wines with low concentrations of citric acid is by no means shifted to the right (that is, with proportianlly more wines at higher scores) with respect to the wines with higher conentrations, contradicting the sentence in the data set description, mentioned previously.

Plot three: free and total sulfur dioxides, and quality

‘total.sulfur.dioxide’ has a correlation of about -0.18 with ‘quality’, far lower than ‘alcohol’ and ‘volatile.acidity’. But when combining it with ‘free.sulfur.dioxide’ and coloring by ‘quality’ previously transformed to two score intervals (low: 0 to 5; high: 6 to 10), the relation turns pretty clear: for each level of free.sulfur.dioxide wines with more quality have higher conentrations of Total sulfur dioxide. At the same time we observe that in each level of Free sulfur dioxide we have aroud the same proportion of the two classes of wines (low and hight quality).

Reflection

Studying the relationship between numeric variables can be seem very simple at first. But in order to not be deceived by the data, possible counfunders have to be taken into account. This requires a more in deep analysis.

Here, three to four variables showed a clear correlation with the main variable, quality, and I guess is proper to say that are strongly correlated with this variable of analysis. But there are some caviats to talk about regarding ’quality. First is numeric variable, but not a continuous one. It is a discrete variable representing a score, so we have to be very carefull when taking measures as correlation.

Second, the very distribution of this variable makes the analysis more dificult: a high proportion of wines have mid-scores, and higher and lower scores have very few instances. It makes the analysis less reliable. One solution is turn ‘quality’ into a qualitative variable by splitting it in two or three intervals, which has been done here for some concrete analysis. The downside of this option is that information is lost.

Regarding the counfunding variables, a controlled analysis can be performed precisely by keeping ‘quality’ split in two: a logistic regression.

A logistic regression model is more proper model to analyse the relationships examined earlier, since we can observe the effect on each varable controlled by the influence of the other covariates

## 
## Calls:
## m1: glm(formula = cut.quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid, family = "binomial", data = wine)
## m2: glm(formula = cut.quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid + total.sulfur.dioxide + density + chlorides, 
##     family = "binomial", data = wine)
## m3: glm(formula = cut.quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid + total.sulfur.dioxide + I(total.sulfur.dioxide^2) + 
##     density + chlorides + fixed.acidity + pH + free.sulfur.dioxide + 
##     I(free.sulfur.dioxide^2) + residual.sugar, family = "binomial", 
##     data = wine)
## 
## ===========================================================
##                                m1         m2         m3    
## -----------------------------------------------------------
## (Intercept)                -9.323***  -62.258     46.427   
##                            (0.799)    (45.597)   (80.912)  
## alcohol                     1.000***    0.935***   0.852***
##                            (0.069)     (0.084)    (0.104)  
## volatile.acidity           -3.732***   -3.473***  -3.295***
##                            (0.440)     (0.480)    (0.489)  
## sulphates                   2.119***    2.693***   2.767***
##                            (0.385)     (0.441)    (0.452)  
## citric.acid                -1.002**    -0.840     -1.281*  
##                            (0.381)     (0.471)    (0.564)  
## total.sulfur.dioxide                   -0.012***  -0.022***
##                                        (0.002)    (0.006)  
## density                                54.094    -54.561   
##                                       (45.439)   (82.594)  
## chlorides                              -3.856**   -3.723*  
##                                        (1.491)    (1.572)  
## I(total.sulfur.dioxide^2)                          0.000   
##                                                   (0.000)  
## fixed.acidity                                      0.142   
##                                                   (0.099)  
## pH                                                -0.405   
##                                                   (0.720)  
## free.sulfur.dioxide                                0.072** 
##                                                   (0.023)  
## I(free.sulfur.dioxide^2)                          -0.001*  
##                                                   (0.000)  
## residual.sugar                                     0.086   
##                                                   (0.057)  
## -----------------------------------------------------------
## Aldrich-Nelson R-sq.           0.235      0.252      0.259 
## McFadden R-sq.                 0.222      0.244      0.253 
## Cox-Snell R-sq.                0.265      0.286      0.295 
## Nagelkerke R-sq.               0.353      0.382      0.394 
## phi                            1.000      1.000      1.000 
## Likelihood-ratio             491.451    538.543    559.171 
## p                              0.000      0.000      0.000 
## Log-likelihood              -858.761   -835.215   -824.901 
## Deviance                    1717.522   1670.430   1649.803 
## AIC                         1727.522   1686.430   1677.803 
## BIC                         1754.408   1729.447   1753.082 
## N                           1599       1599       1599     
## ===========================================================

Apart from the signifiance of the variables with high correlation, it must be noticed, that free.sulfur.dioxide, a variable with low correlation with ‘quality’ is also significative. Also ‘free.sulfur.dioxide’ is also significative when we take the square.

One of this model downsides is that it assumes a very rigid structure. Given the distribution of quality values, it would make more sense to compute ‘qualitiy’ directly as a qualitative variable and try another methods including non-parametric ones (trees, svm, discriminant analysis, etc).

The analysis performed here also raises some ‘why’ questions. Why is the abundance of alcohol a sign of good quality and ‘why’ the abundance of acetic acid a sign of bad quality? If we took a sample of non-experts, would we get the same answers? After all, non-experts also consume a lot of wine (maybe in higher share than experts), so one could think their opinions matter even more than experts opinions.

Subsequently, it would be interesting to take a sample including also opinions from non-experts and take these opinions as a new covariate. Or even more interesting, create a new variable measuring the differences in scores between experts and non-experts for each wine. Make this new variable the main variable of analysis, and check which variables influence this differences the most.

References

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
Stackoverflow: stackoverflow.com
functions and ggplot: how to deal with column names and environment: http://stackoverflow.com/questions/5106782/use-of-ggplot-within-another-function-in-r
Remove axis ticks from ggpairs: http://stackoverflow.com/questions/30721091/how-to-remove-axis-scale-in-ggpairs