Thi is an stream of counciousness - exploration of the Red Wine Quality data set from Modeling wine preferences by data mining from physicochemical. The main objective of this exploration is to get a sense on to what degree variables in the data are related to variable ‘quality’
library(dplyr); library(ggplot2); library(GGally); library(reshape2)
library(tidyr); library(gridExtra); library(knitr)
theme_set(theme_bw())
First, variable “X” is an index variable. We’ll drop it from the analysis. Also, we’ll reformate variable of analysis, quality, to become an ordered factor
Despite ‘quality’ can take values between 0 and 10, we found the minimun is on 3 and the maximum on 8. The following plot shows the distribution of the values.
A high amount of wines have a mid-score qualification around values 5 and 6.
Before we go on checking covariates individually, we’ll take a look to the covariates statistical summaries, and relations between them and ‘quality’ (to do the latter, we have to reformat ‘quality’ into numeric)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol quality
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40 3: 10
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 4: 53
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20 5:681
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42 6:638
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 7:199
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90 8: 18
## quality alcohol volatile.acidity
## 1.000 0.476 -0.391
## sulphates citric.acid total.sulfur.dioxide
## 0.251 0.226 -0.185
## density chlorides fixed.acidity
## -0.175 -0.129 0.124
## pH free.sulfur.dioxide residual.sugar
## -0.058 -0.051 0.014
Variables presenting higher correlations with ‘quality’ are ‘alcohol’ and ‘volatile.acidity’. The latter could have also a bimodal distribution for what we can distinguish in this plot. Perhaps this behaviour is caused by ‘quality’ and can be worth to check it out.
Some of the covariates have very skewed distributions, some of them with outliers. There are clear outliers in ‘chlorides’ and ’total.sulfur.dioxide. Outliers can affect some statistical measures like correlation. Next, we present histograms of these variables, which give us a clearer picture of the data distribution and the outliers.
Tough total.sulfur.dioxide and free.sulfur.dioxide don’t present high correlation values when compared with ‘quality’, boxplots draw an interesting relatonship, with higher sulfur values for middle quality scores. This relationship must be object of a closer inspection.
Concrete patterns are also presented in boxplots for ‘sulphates’ and ‘residual sugar’. The remaining variables don’t seem to show any particular pattern, but the size of the plot dificulties any deep visual analysis. We’ll have to check these variables individually anyway.
Now we’ll proceed to examine more in deep the relationship between quality and the variables mentioned above, beggining with where a relation can be spotted at first glance. The analysis will be enriched with the inclusion of additional variables when considered appropiate.
the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
Volatile acidity has a correlation of -0.39 with quality. The next set of plots includes a histogram of ‘volatile.acidity’, histogram facet by ‘quality’ and boxplox combined with scatter-jitter points. This will be the default presentation when drawing a boxplot, mainly due to the low amount of insances of wines with extreme scores which can give us the wrong picture if only boxplots are drawed, for scores of 3 and 8
The general relationship is clear: higher scored wines have in general lower levels of acetic acid.
A bimodal distribution can be pointed taking a look at the histogram on the left. This particular behaviour could be explained by the different punctuations for the levels of quality with the highest amount of wines (scores 5 and 6): the histogram for scores equal 6 seems to be shifted to the right relativetely to the histogram for scores equal to 5. We can inspect this more in detail summarizing ’volatile.acidity by quality.
## Source: local data frame [6 x 4]
##
## quality mean sd n
## (fctr) (dbl) (dbl) (int)
## 1 3 0.88 0.33 10
## 2 4 0.69 0.22 53
## 3 5 0.58 0.16 681
## 4 6 0.50 0.16 638
## 5 7 0.40 0.15 199
## 6 8 0.42 0.14 18
Quality, at least taken as two interval value variale, doesn’t seem to explain the bimodality of ‘volatile.acidity’, but at least we can see more clearly the “shift” in the distribution when grouping for quality scores.
One covariable that could explain this behaviour is ‘citric.acid’, since is highly correlated with ‘volatile.acidity’. Aprox 1/3 of its values are bewteeen 0 and 0.1
##
## (-0.001,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5]
## 438 193 291 234 253
## (0.5,0.6] (0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]
## 112 62 15 0 1
We’ll proceed to split this variable in two: one part for values between 0 and 0.1 and other part for the rest of the values (as a side note, the table shows values lower than 0. Thin point will be addressed later).
This division has almost removed the bimodality of the distribution and also shows how for lower concentrations of ‘citric.acid’ we have higher concentrations of ‘volatile.acidity. Next logical step would be examine the relationship for this two variables together with ’quality’
found in small quantities, citric acid can add ‘freshness’ and flavor to wines
‘citric.acid’ has a correlation of approx. 0.22 with ‘quality’. Lets check the behaviour of the former and his relationship with the a little more in detail.
citric.acid has an assimetric distribution with a high number of values closer to 0. There is an outlier with value 1. From the histogram and the points in the boxplot one can make out a high concentration of values around 0.5. The next table reveals that the value in question is 0.49
##
## 0 0.49 0.24 0.02 0.26 0.1
## 132 68 51 50 38 35
Taking the boxplot as a reference, the trend is also clear for this variable: wines with higher score have higher citric acid concentration in median.
Finally, it can be worth to analyse this relationship combined with the variable ‘volatile.acidity’, given the results obtined in regards to this variable earlier in this report.
Next we presented a scatter plot for ‘volatile.acidity’ and ‘citric.acid’ colored by ‘quality’ (‘quality’ values equal to 0 have been omitted)
Now, if there were an inverse relationship between ‘citric.acid’ with ‘quality’, and ‘volatile.acidity’ was a counfunding variable, we should observe lower scores at high levels of ‘citric.acid’ for each level of ‘volatile.acidity’. We can’t observe such thing by looking at this plot. Maybe we can have a better look if we use ‘cut.quality’ as color variable instead.
Neither the scatter plot nor the smooth, give us a reason to believe that ‘volatile.acidity’ could be a counfunding variable for ‘citric.acid’. Just for good measure, we’ll display the distribution of ‘quality’ facet by ‘citric.cut’ to see more clearly if there are some significative change depending on the value of this last variable.
Both distributions are very similar. At most, one could say there is a higher proportion of wines scored with a 7, for wines with higher citric acid concentrations. These lasts figures confirm, that the relationship between ‘quality’ and ‘citric.acid’(if there is any) is positive and not negative. So, contrary to what is suggested in the commentary for this variable in the data set description, we’ll assume by the moment, that ‘citric.acid’ has either no relationship with ‘quality’ or just a slightly positive one.
the percent alcohol content of the wine
the density of water is close to that of water depending on the percent alcohol and sugar content
Alcohol has the highest correlation with quality: around 0.47. Looking at the figures we can see that the amount of alcohol seems to remain stable for wines qualified up to 5. But from 6 to 8 there are, in proportion, more wines with higher alcohol concentration.
Alcohol is also highly correlated with ‘density’ (approx. -0.5), we should include this variable someway when looking the relationship of ‘alcohol’ with the main variable. Next we present the set of three graphs for ‘density’ and then the relation of this variable with ‘alcohol’, together with ‘quality’
Almost all wines with low scores have little amounts of alcohol and hight density. Lower scores have little presence in high levels of alcohol. However it seems that in this part of the plot, when controlling for alcohol, the wines with lower density score worse. So, maybe wines with higher density score worse just because they have higher levels of alcohol.
Perhaps we’ll get a better picture of what is happening at the right side of the plot if we split quality in three.
Now one could say that high quality wines have higher density for each level ol alcohol, but this pattern is doubious, in part because there are too few instances of wine with high quality to make a conclusion. If we compute means and medians by quality the pattern isn’t clear either.
## Source: local data frame [3 x 7]
##
## quality.three mean_alcohol median_alcohol mean_density median_density
## (fctr) (dbl) (dbl) (dbl) (dbl)
## 1 (2,4] 10.22 10.0 0.9966887 0.99660
## 2 (4,6] 10.25 10.0 0.9968673 0.99680
## 3 (6,8] 11.52 11.6 0.9960303 0.99572
## Variables not shown: sd_density (dbl), n (int)
a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
Wines with more sulphates seem to have higher median score. Also, there is a clear outlier. Again, the few wine instances win higher an lower scores prevent us from reaching definitive conclusions.
free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
As free and total sulfure dioxide are highly correlated, it make sense to analyse them together in order to examine their influence in the quality of wine.
First let’s display the three set plots for this two covariates
It seems that the two covariates have more or less the same behaviour with respect to ‘quaity’. it is a strange one: the wines with more sulfur concentrations recieve mid-scores. There are outliers in ‘total.sulfur.dioxide’ boxplot, though that difficult visualization of possible trends. Here we plot this last figure removing the outliers.
This shape in both boxplots, draw maybe a cuadratic relationship between ‘quaity’ and the two sulfurs.
Next, we’ll plot a scatter plot of the two sulfurs colored by cut.quality, and with a smoother.
Except for the very low levels of ‘free.sulfur.dioxide’ it seems that for every level of this covariate, wines with higher concentrations of ‘total.sulfur.dioxide’ are scored worse.
the amount of salt in the wine
Two things can be said by ooking at this plots: there is a very slight negative correlation between ‘chlorides’ and ‘quality’, and cloride has clear outliers. Just to take a look to what ‘chlorides’ distribution look like without outliers, we’ll deaw the histogram again removing them from the picture.
‘chlorides’ seems to be more or less symetrically distributed with large tail at the right side.
‘chlorides’ is also somewhat correlated with ‘sulphates’. Let’s see how this relates to ‘cut.quality’
This correlation seems due to the higher/more rare values of chloride. Also for these values, wines have lower scores.
## cut.quality
## (0,5] (5,10]
## (0,0.3] 729 848
## (0.3,0.8] 15 7
From the boxplots it appears that the bests qualified wines have in median less pH. But again, the low number of observations for wines with extreme qualificationes make these conclusions weaker.
(This plot derivates from the correlations table between quality and the rest of covariates)
Variable with the highest correlation is ‘alcohol’ followed by ‘volatile.acidity’, ‘sulphates’ and ‘citric.acid’. Though this plot presents the most obvious relationships based on correlations, some other interesting relationships remain concealed, like ‘total.sulfir.dioxide’ or ‘free.sulfur.dioxide’, which seem to have a quadratic relation with quality. At the same time these two variables combined together show a clearer relation (Plot3). Something similar happens with density and alcohol combined together.
One thing that resonates from this plot when you take into acount this description from the data set : ‘found in small quantities, citric acid can add ’freshness’ and flavor to wines’, however ‘citric.acid’ has a low but positive correlation with quality.
One third of wines have levels of citric acid between 0 and 0.1: we’ll consider this, ‘low levels of citric acid’. As can be seen in the histograms, quality distribution for the wines with low concentrations of citric acid is by no means shifted to the right (that is, with proportianlly more wines at higher scores) with respect to the wines with higher conentrations, contradicting the sentence in the data set description, mentioned previously.
‘total.sulfur.dioxide’ has a correlation of about -0.18 with ‘quality’, far lower than ‘alcohol’ and ‘volatile.acidity’. But when combining it with ‘free.sulfur.dioxide’ and coloring by ‘quality’ previously transformed to two score intervals (low: 0 to 5; high: 6 to 10), the relation turns pretty clear: for each level of free.sulfur.dioxide wines with more quality have higher conentrations of Total sulfur dioxide. At the same time we observe that in each level of Free sulfur dioxide we have aroud the same proportion of the two classes of wines (low and hight quality).
Studying the relationship between numeric variables can be seem very simple at first. But in order to not be deceived by the data, possible counfunders have to be taken into account. This requires a more in deep analysis.
Here, three to four variables showed a clear correlation with the main variable, quality, and I guess is proper to say that are strongly correlated with this variable of analysis. But there are some caviats to talk about regarding ’quality. First is numeric variable, but not a continuous one. It is a discrete variable representing a score, so we have to be very carefull when taking measures as correlation.
Second, the very distribution of this variable makes the analysis more dificult: a high proportion of wines have mid-scores, and higher and lower scores have very few instances. It makes the analysis less reliable. One solution is turn ‘quality’ into a qualitative variable by splitting it in two or three intervals, which has been done here for some concrete analysis. The downside of this option is that information is lost.
Regarding the counfunding variables, a controlled analysis can be performed precisely by keeping ‘quality’ split in two: a logistic regression.
A logistic regression model is more proper model to analyse the relationships examined earlier, since we can observe the effect on each varable controlled by the influence of the other covariates
##
## Calls:
## m1: glm(formula = cut.quality ~ alcohol + volatile.acidity + sulphates +
## citric.acid, family = "binomial", data = wine)
## m2: glm(formula = cut.quality ~ alcohol + volatile.acidity + sulphates +
## citric.acid + total.sulfur.dioxide + density + chlorides,
## family = "binomial", data = wine)
## m3: glm(formula = cut.quality ~ alcohol + volatile.acidity + sulphates +
## citric.acid + total.sulfur.dioxide + I(total.sulfur.dioxide^2) +
## density + chlorides + fixed.acidity + pH + free.sulfur.dioxide +
## I(free.sulfur.dioxide^2) + residual.sugar, family = "binomial",
## data = wine)
##
## ===========================================================
## m1 m2 m3
## -----------------------------------------------------------
## (Intercept) -9.323*** -62.258 46.427
## (0.799) (45.597) (80.912)
## alcohol 1.000*** 0.935*** 0.852***
## (0.069) (0.084) (0.104)
## volatile.acidity -3.732*** -3.473*** -3.295***
## (0.440) (0.480) (0.489)
## sulphates 2.119*** 2.693*** 2.767***
## (0.385) (0.441) (0.452)
## citric.acid -1.002** -0.840 -1.281*
## (0.381) (0.471) (0.564)
## total.sulfur.dioxide -0.012*** -0.022***
## (0.002) (0.006)
## density 54.094 -54.561
## (45.439) (82.594)
## chlorides -3.856** -3.723*
## (1.491) (1.572)
## I(total.sulfur.dioxide^2) 0.000
## (0.000)
## fixed.acidity 0.142
## (0.099)
## pH -0.405
## (0.720)
## free.sulfur.dioxide 0.072**
## (0.023)
## I(free.sulfur.dioxide^2) -0.001*
## (0.000)
## residual.sugar 0.086
## (0.057)
## -----------------------------------------------------------
## Aldrich-Nelson R-sq. 0.235 0.252 0.259
## McFadden R-sq. 0.222 0.244 0.253
## Cox-Snell R-sq. 0.265 0.286 0.295
## Nagelkerke R-sq. 0.353 0.382 0.394
## phi 1.000 1.000 1.000
## Likelihood-ratio 491.451 538.543 559.171
## p 0.000 0.000 0.000
## Log-likelihood -858.761 -835.215 -824.901
## Deviance 1717.522 1670.430 1649.803
## AIC 1727.522 1686.430 1677.803
## BIC 1754.408 1729.447 1753.082
## N 1599 1599 1599
## ===========================================================
Apart from the signifiance of the variables with high correlation, it must be noticed, that free.sulfur.dioxide, a variable with low correlation with ‘quality’ is also significative. Also ‘free.sulfur.dioxide’ is also significative when we take the square.
One of this model downsides is that it assumes a very rigid structure. Given the distribution of quality values, it would make more sense to compute ‘qualitiy’ directly as a qualitative variable and try another methods including non-parametric ones (trees, svm, discriminant analysis, etc).
The analysis performed here also raises some ‘why’ questions. Why is the abundance of alcohol a sign of good quality and ‘why’ the abundance of acetic acid a sign of bad quality? If we took a sample of non-experts, would we get the same answers? After all, non-experts also consume a lot of wine (maybe in higher share than experts), so one could think their opinions matter even more than experts opinions.
Subsequently, it would be interesting to take a sample including also opinions from non-experts and take these opinions as a new covariate. Or even more interesting, create a new variable measuring the differences in scores between experts and non-experts for each wine. Make this new variable the main variable of analysis, and check which variables influence this differences the most.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
Stackoverflow: stackoverflow.com
functions and ggplot: how to deal with column names and environment: http://stackoverflow.com/questions/5106782/use-of-ggplot-within-another-function-in-r
Remove axis ticks from ggpairs: http://stackoverflow.com/questions/30721091/how-to-remove-axis-scale-in-ggpairs