This dataset is public available for research. The details are described in [Cortez et al., 2009].
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at:
Description of attributes:
By performing this analysis, we seek to answer the following questions:
First of all, let’s load the dataset for red wines in a data frame. We will also store an aditional variable made of the quality scores as a factor. We will use it later on for certain type of charts.
reds <- read.csv("./data/wineQualityReds.csv")
reds$quality_as_factor <- factor(reds$quality, levels=c(0,1,2,3,4,5,6,7,8,9,10))
Some basic descriptive statistics will familiarise ourselves with the data. Although the summary command will give us some of them, we will retrieve them individually and store them in variables so we can use them accross the report.
Considering our forst question, let us summarise the quality scores given to each wine in terms of its central tendency and variability.
reds_quality_min <- min(reds$quality)
reds_quality_max <- max(reds$quality)
reds_quality_mean <- mean(reds$quality)
reds_quality_median <- median(reds$quality)
reds_quality_iqr <- IQR(reds$quality)
reds_quality_q1 <- reds_quality_median - reds_quality_iqr
reds_quality_q3 <- reds_quality_median + reds_quality_iqr
The summary command gives us already quite a lot of information. We can see that, although the range of posible scores is from 0 to 10, in our dataset the minimum score is 3 and the maximum is 8. The mean is 5.6360225 and the median is 6, very close to each other.
What about the mode?
reds_quality_mode <- names(which.max(table(reds$quality)))
So the most given quality score is 5.
But how disperse is the distribution of quality scores? The IQR goes from 5 for the lower qartile, to 7 for the upper one, around the median of 6 or second quartile. We could then consider any quality value of less than 3.5 or more than 8.5 an outlier. Do we have any of them?
reds_outliers <- sum(reds$quality < reds_quality_q1 - 1.5*reds_quality_iqr)
reds_outliers <- reds_outliers + sum(reds$quality > reds_quality_q3 + 1.5*reds_quality_iqr)
That is, we have 10 outliers.
What about the dispersion around the mean?
reds_quality_sd <- sd(reds$quality)
With a standard deviation of 0.8075694, our distribution of quality scores is not very disperse.
But, let’s visualise all this information to reinforce our answer on question 1.
ggplot(data=reds, aes(x=quality)) +
geom_bar(binwidth=1, color='black', fill='white') +
coord_cartesian(xlim=c(0,10)) +
geom_vline(xintercept = reds_quality_median, linetype='longdash', alpha=.5) +
geom_vline(xintercept = reds_quality_q1 - 1.5*reds_quality_iqr, linetype='longdash', alpha=.5) +
geom_vline(xintercept = reds_quality_q3 + 1.5*reds_quality_iqr, linetype='longdash', alpha=.5) +
geom_vline(xintercept = reds_quality_mean, linetype=1, color='blue', alpha=.5) +
xlab("Wine Quality") +
ylab("Number of Samples")
So if we consider the average quality score to be 5 in a 0 to 10 scale, we can see that our quality scores are beyond average under multiple criteria.
Let have a look at how correlated our different variables are.
cor(x=reds[,2:12], y=reds$quality)
## [,1]
## fixed.acidity 0.12405165
## volatile.acidity -0.39055778
## citric.acid 0.22637251
## residual.sugar 0.01373164
## chlorides -0.12890656
## free.sulfur.dioxide -0.05065606
## total.sulfur.dioxide -0.18510029
## density -0.17491923
## pH -0.05773139
## sulphates 0.25139708
## alcohol 0.47616632
From previous we see that the following variables are correlated with quality:
Volatile acidity (—)
Citric acid (++)
Chlorides (-)
We will concentrate in the first three variables that show stronger correlation with quality.
Alcohol contributes to a wine’s body perception during taste and also to its balance vs acidity and tanicity (just for reds).
First of all, a quick summary of the differnet alcohol values.
summary(reds$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
reds_alcohol_mean <- mean(reds$alcohol)
reds_alcohol_median <- median(reds$alcohol)
In order to understand the relationship between alcohol and quality, we will do two things. First we will inspect how the alcohol values are distributed accross the different quality scores. Then we will analyse how they change together.
For example, what is the average alcohol level for each quality score? We can quickly see this by using the dplyr library or just by using the tapply command.
tapply(reds$alcohol, reds$quality, mean)
## 3 4 5 6 7 8
## 9.955000 10.265094 9.899706 10.629519 11.465913 12.094444
And in a visual way, alcohol by quality level, together with the median for the whole distribution and the mean for the quality scores.
ggplot(data=reds, aes(x=quality_as_factor, y=alcohol)) +
geom_boxplot() +
geom_hline(show_guide=T, yintercept=reds_alcohol_mean, linetype='longdash', alpha=.5, color='blue') +
geom_vline(xintercept = reds_quality_mean-reds_quality_min+1, linetype='longdash', color='blue', alpha=.5) +
xlab("Wine Quality") +
ylab("Alcohol")
We can see how wine scores that are beyond the mean quality value of 5.6360225, also tend to show values beyond the mean alcohol value of 10.4229831.
We already see a tendency in the boxplots. But let’s use a scatter plot here, including a linear regression line.
ggplot(data=reds, aes(x=as.numeric(quality), y=alcohol)) +
geom_jitter(alpha=1/3) +
geom_smooth(method='lm', aes(group = 1))+
geom_hline(yintercept=reds_alcohol_mean, linetype='longdash', alpha=.5, color='blue') +
geom_vline(xintercept = reds_quality_mean, linetype='longdash', color='blue', alpha=.5) +
xlab("Wine Quality") +
ylab("Alcohol")
And as a third variable, let’s perform a similar analysis for citric acid. Remember that this one is not so strongly correlated as the previous ones.
summary(reds$citric.acid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
reds_citric.acid_mean <- mean(reds$citric.acid)
reds_citric.acid_median <- median(reds$citric.acid)
reds_citric.acid_sd <- sd(reds$citric.acid)
reds_citric.acid_max <- max(reds$citric.acid)
caout <- (reds_citric.acid_max - reds_citric.acid_mean) / reds_citric.acid_sd
The mean value for the citric acid is 0.2709756 with a standard deviation of 0.1948011. As a note, the maximum value for citric acid is 1. That is 3.7424031 standard deviations beyond the mean! Again, probably an outlier.
And now the relationship between this variable and the quality score.
tapply(reds$citric.acid, reds$quality, mean)
## 3 4 5 6 7 8
## 0.1710000 0.1741509 0.2436858 0.2738245 0.3751759 0.3911111
And in a visual way, citric acid by quality level, together with the mean for the whole distribution and the mean for the quality scores.
ggplot(data=reds, aes(x=quality_as_factor, y=citric.acid)) +
geom_boxplot() +
geom_hline(show_guide=T, yintercept=reds_citric.acid_mean, linetype='longdash', alpha=.5, color='blue') +
geom_vline(xintercept = reds_quality_mean-reds_quality_min+1, linetype='longdash', color='blue', alpha=.5) +
xlab("Wine Quality") +
ylab("Citric Acid")
ggplot(data=reds, aes(x=as.numeric(quality), y=citric.acid)) +
geom_jitter(alpha=1/3) +
geom_smooth(method='lm', aes(group = 1))+
geom_hline(yintercept=reds_citric.acid_mean, linetype='longdash', alpha=.5, color='blue') +
geom_vline(xintercept = reds_quality_mean, linetype='longdash', color='blue', alpha=.5) +
xlab("Wine Quality") +
ylab("Citric Acid")
Again, we can see a tendency. Good wines tend to have higher citric acid levels.
What characteristics can we see influence wine quality negatively?
Let’s perform a similar analysis for volatile acidity.
summary(reds$volatile.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
reds_volatile.acidity_mean <- mean(reds$volatile.acidity)
reds_volatile.acidity_median <- median(reds$volatile.acidity)
reds_volatile.acidity_sd <- sd(reds$volatile.acidity)
reds_volatile.acidity_max <- max(reds$volatile.acidity)
vaout <- (reds_volatile.acidity_max - reds_volatile.acidity_mean) / reds_volatile.acidity_sd
The mean value for the volatile acidity is 0.5278205 with a standard deviation of 0.1790597. As a note, the maximum value for volatile acidity is 1.58. That is 5.8761378 standard deviations beyond the mean! Probably an outlier…
And now the relationship between this variable and the quality score.
tapply(reds$volatile.acidity, reds$quality, mean)
## 3 4 5 6 7 8
## 0.8845000 0.6939623 0.5770411 0.4974843 0.4039196 0.4233333
And in a visual way, volatile acidity by quality level, together with the median for the whole distribution and the mean for the quality scores.
ggplot(data=reds, aes(x=quality_as_factor, y=volatile.acidity)) +
geom_boxplot() +
geom_hline(show_guide=T, yintercept=reds_volatile.acidity_mean, linetype='longdash', alpha=.5, color='blue') +
geom_vline(xintercept = reds_quality_mean-reds_quality_min+1, linetype='longdash', color='blue', alpha=.5) +
xlab("Wine Quality") +
ylab("Volatile Acidity")
ggplot(data=reds, aes(x=as.numeric(quality), y=volatile.acidity)) +
geom_jitter(alpha=1/3) +
geom_smooth(method='lm', aes(group = 1))+
geom_hline(yintercept=reds_volatile.acidity_mean, linetype='longdash', alpha=.5, color='blue') +
geom_vline(xintercept = reds_quality_mean, linetype='longdash', color='blue', alpha=.5) +
xlab("Wine Quality") +
ylab("Volatile Acidity")
In each step we can see the negative influence of volatile acidity in a wine’s quality score.
Let us say we consider faulty wines those that do not get a score of at least 5 in the scale given (0 to 10). Let us then define a new variable faulty that takes a boolean value based on that.
reds <- transform(reds, is.faulty = quality<5)
How many faulty wines do we have?
faulty.reds <- table(reds$is.faulty)
We can see a faulty wines ratio of 0.0410156 .
From those variables we determined were highly correlated, let us see which ones are characteristic of faulty wines and at what levels. Boxplots are ideal for this purpose. Let us start with volatile acidity.
ggplot(data=reds, aes(x=is.faulty, y=volatile.acidity)) +
geom_boxplot() +
geom_hline(show_guide=T, yintercept=reds_volatile.acidity_mean, linetype='longdash', alpha=.5, color='blue') +
geom_vline(xintercept = reds_quality_mean-reds_quality_min+1, linetype='longdash', color='blue', alpha=.5) +
xlab("Faulty Wine") +
ylab("Volatile Acidity")
summary(subset(reds, is.faulty)$volatile.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2300 0.5650 0.6800 0.7242 0.8825 1.5800
What about alcohol? Maybe there is a minimum threshold under that a wine is perceived as faulty?
ggplot(data=reds, aes(x=is.faulty, y=alcohol)) +
geom_boxplot() +
geom_hline(show_guide=T, yintercept=reds_alcohol_mean, linetype='longdash', alpha=.5, color='blue') +
geom_vline(xintercept = reds_quality_mean-reds_quality_min+1, linetype='longdash', color='blue', alpha=.5) +
xlab("Faulty Wine") +
ylab("Alcohol")
summary(subset(reds, is.faulty)$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.60 10.00 10.22 11.00 13.10
In this case the answer is clearly no. Although faulty wines tend to have slightly less alcohol than non faulty ones, their ranges pretty much overlap. We will talk a bit more about this in the summary.
These were the two mainly correlated variables. But let us explore some other variables.
Next let us try with citric acid that typically defines fresh wines.
ggplot(data=reds, aes(x=is.faulty, y=citric.acid)) +
geom_boxplot() +
geom_hline(show_guide=T, yintercept=reds_citric.acid_mean, linetype='longdash', alpha=.5, color='blue') +
geom_vline(xintercept = reds_quality_mean-reds_quality_min+1, linetype='longdash', color='blue', alpha=.5) +
xlab("Faulty Wine") +
ylab("Citric Acid")
summary(subset(reds, is.faulty)$citric.acid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0200 0.0800 0.1737 0.2700 1.0000
This might be and important one, yes. We can confirm this using the correlation between both variables.
cor(x=reds$volatile.acidity, y=reds$citric.acid)
## [1] -0.5524957
Recalling our original questions:
We already saw that our dataset contains wines over average. This is specially clear in the scores distribution chart.
library(ggplot2)
ggplot(data=reds, aes(x=quality)) +
geom_bar(binwidth=1, color='black', fill='white') +
coord_cartesian(xlim=c(0,10)) +
geom_vline(xintercept = reds_quality_median, linetype='longdash', alpha=.5) +
geom_vline(xintercept = reds_quality_q1 - 1.5*reds_quality_iqr, linetype='longdash', alpha=.5) +
geom_vline(xintercept = reds_quality_q3 + 1.5*reds_quality_iqr, linetype='longdash', alpha=.5) +
geom_vline(xintercept = reds_quality_mean, linetype=1, color='blue', alpha=.5) +
xlab("Wine Quality") +
ylab("# Samples")
So if we consider the average quality score to be 5 in a 0 to 10 scale, we can see that our quality scores are beyond average under multiple criteria.
About good wines. It seems clear that they have certain levels of alcohol and citric acid, and lack of defects. We already saw charts and distribution in previous sections showing a positive correlation between these variable and quality. So let’s try a new plot here. Let’s plot alcohol vs citric acid so we devide the chart area y quadrants. Then let’s color each plot using the quality score and see where good wines fall.
reds$quality.cut <- cut(reds$quality, breaks=c(0,4,6,10))
ggplot(data=reds, aes(x=alcohol, y=citric.acid)) +
coord_cartesian(
xlim=c(quantile(reds$alcohol,.01),quantile(reds$alcohol,.99)),
ylim=c(quantile(reds$citric.acid,.01),quantile(reds$citric.acid,.99))
) +
geom_jitter(alpha=.5, aes(size=quality.cut, color=quality.cut)) +
geom_vline(xintercept = reds_alcohol_mean, linetype='longdash', color='blue', alpha=.5) +
geom_hline(yintercept = reds_citric.acid_mean, linetype='longdash', color='blue', alpha=.5) +
xlab("Alcohol") +
ylab("Citric Acid")
In order to improve clarity, we have defined a cut in the scores.
Something interesting emerge from this chart. And is that good wines concentrate in the ipper right quadrant. That is, even when alcohol and citric acid sometimes produce high quality wines in isolation it is when we have certain levels of both when we have a more high quality scores. This is related with the concept of gustative balance in a wine, where the alcohol body and sweet feeling balances with citric acid.
What about defects? We already saw thet volatile acidity is very determinant for faulty wines. We also saw that alcohol is not. Both faulty and correct wines have similar distributions of levels of alcohol. And we finally saw that a high level of citric acid is associated with less faulty wines.
What about wines that, under the previous premises, have high level of volatile acidity? Our plot is getting a bit too complicated, so let’s try using tables. We will make use of our previously defined variable isFaulty. We also want to assign one of the previous quadrants to each wine. They are separated by the axis mean values, so it should be easy.
reds <- transform(
reds,
quadrant = ifelse(alcohol>reds_alcohol_mean,
ifelse(citric.acid>reds_citric.acid_mean,4,2),
ifelse(citric.acid>reds_citric.acid_mean,3,1))
)
These are the quadrants:
Let’s table each quadrant to see what we have.
table(reds$quadrant)
##
## 1 2 3 4
## 561 291 355 392
Let’s see the proportions of faulty wines in each quadrant.
sort(tapply(reds$is.faulty, reds$quadrant, mean))
## 4 3 1 2
## 0.005102041 0.036619718 0.048128342 0.072164948
That is, there is a lower proportion of faulty wines in the 4th quadrant, etc.
So finally, what about the volatile acidity level?
sort(tapply(reds$volatile.acidity>reds_volatile.acidity_mean, reds$quadrant, mean))
## 4 3 2 1
## 0.1045918 0.2901408 0.6907216 0.7807487
Again, the proportion of wines with a high level of volatile acidity, typically produced by acetic bacteries present in vinegar, is lower in the 4th quadrant and higher in the 1st one. We can also see that there is a considerable gap between 3rd quadrant and the lower two, due to the reduction effect of citric acid on volatile acidity (negative correlation).
cor(x=reds$volatile.acidity, y=reds$citric.acid)
## [1] -0.5524957
There are many other factors that are related with good wines. Many of them are related with smells and flavours and not with chemical properties and gustative perceptions like these that we have in our dataset. Although our variables are kind of explanatory of what we have, we have also seen some cases where the must be other explanations for high or low quality levels.
However, within our limitations, we have discover the very interesting concept of gustative balance between alcohol and acidity, and how important is for a wine to be free of defects.