The purpose of this project is to determine what physicochemical properties affect white wine quality through exploratory data analysis.
The Wine Quality dataset is downloaded from the following website.
[Website] http://www3.dsi.uminho.pt/pcortez/wine/
[Data Link] http://www3.dsi.uminho.pt/pcortez/wine/winequality.zip
[Citation] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
The zip file contains two csv files, one for red wines (winequality-red.csv) and one for white wines (winequality-white.csv). In this project only the white wine data is used.
The white wine csv file is loaded into R as follows.
# Load csv file
ww <- read.table("./input/winequality-white.csv", header=T, sep=";")
str(ww)
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
The first eleven variables are physicochemical properties acquired from a computerized system iLab and the last variable quality
is the score between 0 (very bad) and 10 (excellent) based on sensory data. Each sample was evaluated by a minimum of three sensory assessors (using blind tastes) and the final sensory score was given by the median of these evaluations.
For a sake of convenience of analysis, original variable names and independent variable names (physicochemical properties) are defined.
# Define original variable names
ORIGINAL <- colnames(ww)
# Define independent variable names
INDEPENDENT <- colnames(ww)[1:11]
To enable to look at the quality as a classification, a factor variable which is converted from the numerical one is added.
ww$quality.f <- as.factor(ww$quality)
summary(ww$quality.f)
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
The following libraries are used in this project.
# For visualization
library(ggplot2)
library(gridExtra)
library(corrplot)
library(dplyr)
# For modeling
library(psych)
library(memisc)
library(nnet)
library(rpart)
library(e1071)
library(randomForest)
summary(ww)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
##
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
##
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
##
## quality quality.f
## Min. :3.000 3: 20
## 1st Qu.:5.000 4: 163
## Median :6.000 5:1457
## Mean :5.878 6:2198
## 3rd Qu.:6.000 7: 880
## Max. :9.000 8: 175
## 9: 5
All variables except residual.sugar
have very similar median and mean. This makes us imagine symetrical distributions. Also, by looking at 1st quarter and 3rd quarter, it feels that they are relatively close to the center values. So how do they look like? Blue lines show the mean and red lines how the median of distribution. Also dot-lines are 1st and 3rd quantiles.
dens <- lapply(ORIGINAL, FUN=function(var) {
ggplot(ww, aes_string(x=var)) +
geom_density(fill='gray') +
geom_vline(aes(xintercept=mean(ww[, var])), color='blue', size=1) +
geom_vline(aes(xintercept=median(ww[, var])), color='red', size=1) +
geom_vline(aes(xintercept=quantile(ww[, var], 0.25)),
linetype='dashed', size=0.5) +
geom_vline(aes(xintercept=quantile(ww[, var], 0.75)),
linetype='dashed', size=0.5)
})
do.call(grid.arrange, args=c(dens, list(ncol=3)))
As expected, most of them, except residual.sugar
and alcohol
, have symetrical distribution althoguh some have outliers in positive side. The dependent variiable quality
is a little bit exceptional since it’s a discrete number variable.
To have a better sense about the distributions, let’s standardize and see.
ww_scaled <- data.frame(scale(ww[, ORIGINAL]))
summary(ww_scaled)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. :-3.61998 Min. :-1.9668 Min. :-2.7615 Min. :-1.1418
## 1st Qu.:-0.65743 1st Qu.:-0.6770 1st Qu.:-0.5304 1st Qu.:-0.9250
## Median :-0.06492 Median :-0.1810 Median :-0.1173 Median :-0.2349
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.52758 3rd Qu.: 0.4143 3rd Qu.: 0.4612 3rd Qu.: 0.6917
## Max. : 8.70422 Max. : 8.1528 Max. :10.9553 Max. :11.7129
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :-1.6831 Min. :-1.95848 Min. :-3.0439
## 1st Qu.:-0.4473 1st Qu.:-0.72370 1st Qu.:-0.7144
## Median :-0.1269 Median :-0.07691 Median :-0.1026
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.1935 3rd Qu.: 0.62867 3rd Qu.: 0.6739
## Max. :13.7417 Max. :14.91679 Max. : 7.0977
## density pH sulphates
## Min. :-2.31280 Min. :-3.10109 Min. :-2.3645
## 1st Qu.:-0.77063 1st Qu.:-0.65077 1st Qu.:-0.6996
## Median :-0.09608 Median :-0.05475 Median :-0.1739
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.69298 3rd Qu.: 0.60750 3rd Qu.: 0.5271
## Max. :15.02976 Max. : 4.18365 Max. : 5.1711
## alcohol quality
## Min. :-2.04309 Min. :-3.2495
## 1st Qu.:-0.82419 1st Qu.:-0.9913
## Median :-0.09285 Median : 0.1379
## Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.71974 3rd Qu.: 0.1379
## Max. : 2.99502 Max. : 3.5252
Since the 1st and 3rd quarter of standard normal distribution is -0.674 and 0.674 respectively, some of them look really close to it. So let’s check the normality by Q-Q plot.
par(mfrow=c(4,3), mar=c(2,2,2,2))
dummy <- lapply(ORIGINAL, FUN=function(var) {
qqnorm(ww_scaled[, var], main=var)
qqline(ww_scaled[, var])
})
Frankly speaking, all of them follow normal distribution in the duration between -2 and 2 standard deviations. (Again, quality
is exceptional since it’s a discrete number distribution.) But as it gets close to the end of positive, outliers are observed. Only alcohol
has narrower distribution in the positive side as compared to normal distribution. On the other hand, if you look at negative sides, most of them has narrower distiribution than normal distribution. Those observations imply that distributions of independent variables are close to normal distribution at the center, but outliers in positive sides add a taste of right skewness.
To take an overview of relationships between two variables, let’s start with correlation matrix.
corrplot.mixed(cor(ww[, ORIGINAL]))
In terms of relationships between quality
and independent variables, alcohol
has the strongest correlation 0.44 with quality
. The second is density
-0.31 though, it seems natural since the correlation between density
and alcohol
is -0.78. (The density
does not seem to provide any new information about quality since it’s already explained by alcohol.)
In terms of relationships between independent variables, some strong correlations are observed.
residual.sugar
- density
free.sulfur.dioxide
- total.sulfur.dioxide
total.sulfur.dioxide
- density
To see relationships between quality
and independent variables closely, let’s create boxplots which desceribe each independent variable per quality level. Here I remove outliers in the positive side to visualize the mojority clearer.
showBoxplt <- function(df, y, x='quality.f', lower=0, upper=.99) {
ggplot(df, aes_string(x=x, y=y, fill=x)) +
geom_boxplot() +
ylim(quantile(df[, y], prob=lower),
quantile(df[, y], prob=upper)) +
theme(legend.position="none")
}
boxp1 <- lapply(INDEPENDENT, FUN=function(var) showBoxplt(ww, var))
do.call(grid.arrange, args=c(boxp1, list(ncol=3)))
By looking at the center lines and position of boxes, we can feel the correlation in alcohol
and density
which we found in the correlation matrix. As quality
goes up, alcohol
also goes up and density
goes down. Meanwhile the others look almost flat.
We can see a loose trend in alcohol
and density
.
But for the others, for example pH
and sulphates
shown above look flat and no distiction among quality levels. Remember, quality 9 which looks a little different has just five data points.
Since it’s hard to find significant factor at this point, let’s create a new variable which has fewer quality levels and see if it can provide new insights.
# [Quality conversion 1]
# Create a simpler classification of quality.
# 3, 4, 5 -> bad
# 6 -> normal
# 7, 8, 9 -> good
ww$quality.f2 <- ifelse(ww$quality == 3 |
ww$quality == 4 |
ww$quality == 5,
"bad",
ifelse(ww$quality == 6,
"normal",
"good"))
ww$quality.f2 <- factor(ww$quality.f2, levels=c("bad", "normal", "good"))
summary(ww[, c('quality.f', 'quality.f2')])
## quality.f quality.f2
## 3: 20 bad :1640
## 4: 163 normal:2198
## 5:1457 good :1060
## 6:2198
## 7: 880
## 8: 175
## 9: 5
So now how do boxplots look like?
boxp2 <- lapply(INDEPENDENT,
FUN=function(var) showBoxplt(ww, var, 'quality.f2'))
do.call(grid.arrange, args=c(boxp2, list(ncol=3)))
While significant differences among quality levels in density
and alcohol
got clearer, the others got flatter. This feels like that the other physicochemical properties do not really matter for quality.
The same zoom-up goes as follows.
They show a nice hierarchical distinction.
They got flatter.
To take a look at it from a different angle, let’s create density plots. Vertical lines show the mean of each distribution.
showDensplt <- function(df, x, cls='quality.f2', lower=0, upper=.99) {
# Compute mean of distribution per level of cls
mu <- summarise_(group_by_(ww, cls), paste0('mean(', x, ')'))
colnames(mu)[2] <- x
ggplot(df, aes_string(x=x, fill=cls)) +
geom_density(alpha=0.3) +
geom_vline(aes_string(xintercept=x, color=cls), data=mu, size=1) +
xlim(quantile(df[, x], prob=lower),
quantile(df[, x], prob=upper)) +
theme(legend.position="none")
}
dens <- lapply(INDEPENDENT, FUN=function(col) showDensplt(ww, col))
do.call(grid.arrange, args=c(dens, list(ncol=3)))
The trends and flatness observed in boxplots look like as follows by density plots.
The hierarchical distinction observed in the previous boxplots are shown as a wider range of vertical lines.
The distribution and mean are very similar among quality levels.
Since it’s still hard to find any significance other than alcohol
and density
, let’s try simpler quality classifications for extreme qualities.
# [Quality conversion 2]
# Create factor variables for extreme qualities.
# excellent: 9,8 -> True
# inferior: 3,4 -> True
ww$excellent <- ifelse(ww$quality == 8 | ww$quality == 9, T, F)
ww$inferior <- ifelse(ww$quality == 3 | ww$quality == 4, T, F)
summary(ww[, c('excellent', 'inferior')])
## excellent inferior
## Mode :logical Mode :logical
## FALSE:4718 FALSE:4715
## TRUE :180 TRUE :183
## NA's :0 NA's :0
So can we see any significant independent variable in extreme cases? First take a look at excellet wines.
boxp3 <- lapply(INDEPENDENT,
FUN=function(var) showBoxplt(ww, y=var, x='excellent'))
do.call(grid.arrange, args=c(boxp3, list(ncol=3)))
While the gap in alcohol
and density
got more significant, not so much difference can be observed in the others. Maybe pH
of excellent=TRUE
is slightly bigger than excellent=FALSE
.
How about inferior wines?
boxp4 <- lapply(INDEPENDENT,
FUN=function(var) showBoxplt(ww, y=var, x='inferior'))
do.call(grid.arrange, args=c(boxp4, list(ncol=3)))
What is interesting is that alcohol
does not have significant difference when it is observed from an inferior viewpoint. Also the difference in free.sulfur.dioxide
seems to be significant.
Those observations imply that:
So now let’s turn into higher dimensional space. The idea here is that one variable is visualized by color so that other two variables can be mapped on 2D space. We might be able to find certain rules displayed by color on 2D space.
First, let’s try a higher dimensional version of scatter plot. Since the strongest factor to determine quality
we found so far is alcohol
, I’d like to map alcohol
and quality
on 2D axis and one independent variable into color. Each independent variable used for color is scaled in the following color scheme in its own data range after removing top 1 % as outliers.
showScatpltColored <- function(df, color, x='alcohol', y='quality',
lower=0.00, upper=0.99) {
ggplot(df, aes_string(x=x, y=y, color=color)) +
geom_point(size=1, alpha=0.5, position='jitter') +
xlim(quantile(df[, x], prob=lower),
quantile(df[, x], prob=upper)) +
scale_colour_gradientn(colours=c("blue", "violet", "red")) +
ggtitle(color) +
xlab(NULL) +
ylab(NULL) +
theme(legend.position="none")
}
scat_wc <- lapply(INDEPENDENT, FUN=function(var) showScatpltColored(ww, var))
do.call(grid.arrange, args=c(scat_wc, list(ncol=3)))
Since alcohol
is mapped on x-axis and quality
is mapped on y-axis, horizontal color distinction is related to alcohol
and vertical color distinction is related to quality
. So what we want to find is vertical distinction in a sense. Unfortunately, we cannot see vertical color distinction in any plot, although horizontal color distinction is observed in alcohol
(which is obvious) and density
(which is also expected because of the correlation with alcohol).
Next, let’s convert quality
into colors. Meaning, arbitrary two independent variables are mapped on 2D axis. Since I don’t know which combination might lead a better insight, I’m going to check all the combinations. What we want to find here is any chunck of color on 2D.
showScatpltColored2 <- function(df, x, y, color='quality.f2',
lower=0.00, upper=0.99){
ggplot(df, aes_string(x=x, y=y, color=color)) +
geom_point(alpha=0.3, position='jitter') +
xlim(quantile(df[, x], prob=lower),
quantile(df[, x], prob=upper)) +
ylim(quantile(df[, y], prob=lower),
quantile(df[, y], prob=upper))
}
scat_wc2 <- lapply(combn(INDEPENDENT, 2, simplify=F),
FUN=function(var) showScatpltColored2(ww, var[1], var[2]))
# Just check the number of plots
length(scat_wc2)
## [1] 55
After checking 55 scatter plots that are all the combinations of independent variables (11 choose 2), I noticed that an interesting pattern is observed in a scatter plot by density
and residual.sugar
. Let’s zoom up the plot.
# Colored by simplified quality level
scat.qf2 <- ggplot(ww, aes(x=density, y=residual.sugar, color=quality.f2)) +
geom_point(position='jitter') +
xlim(min(ww$density), quantile(ww$density, prob=0.99)) +
ylim(min(ww$residual.sugar), quantile(ww$residual.sugar, prob=0.99))
# Colored by original quality level
scat.qf <- ggplot(ww, aes(x=density, y=residual.sugar, color=quality.f)) +
geom_point() +
scale_color_brewer() +
xlim(min(ww$density), quantile(ww$density, prob=0.99)) +
ylim(min(ww$residual.sugar), quantile(ww$residual.sugar, prob=0.99))
do.call(grid.arrange, args=c(list(scat.qf2, scat.qf), list(ncol=2)))
The left one uses the simplified quality level and the right one uses the original quality level. In both plots, holding density
, higher residual.sugar
seem to have better quality
. Maybe the distinction is not so deterministic, but it’s recognizable in a sense. It seems to be hard to express the effect by a simple linear coeffecient due to the wide spread and overlap of data points.
Based on the exploratory analysis in the previous section, there does not seem to be any simple linear relationship between quality and physicochemical properties. If this observation is correct, linear regression model would not perform so well in terms of quality prediction by physicochemical properties. So let’s start there to confirm it.
Before start I’d like to define all formulas used in this section and an utility function to summarize model performance.
# Original numeric variable prediction
fml1 <- as.formula(paste("quality", "~",
paste(INDEPENDENT, collapse=' + ')))
# As a classification prediction
fml2 <- as.formula(paste("quality.f", "~",
paste(INDEPENDENT, collapse=' + ')))
# As a simpler classification ("bad", "normal", "good") prediction
fml3 <- as.formula(paste("quality.f2", "~",
paste(INDEPENDENT, collapse=' + ')))
# Extreme case detection for excellent ones
fml.e <- as.formula(paste("excellent", "~",
paste(INDEPENDENT, collapse=' + ')))
# Extreme case detection for inferior ones
fml.i <- as.formula(paste("inferior", "~",
paste(INDEPENDENT, collapse=' + ')))
describePerformance <- function(org, pred) {
cat("[Contingency Table]\n")
print(table(org, pred))
cat("\n")
cat("[Contingency Table by Proportion]\n")
print(round(prop.table(table(org, pred), 1), 3))
cat("\n")
cat("[Overall Accuracy]\n")
cat(sum(org == pred) / length(org) * 100, '%')
cat("\n\n")
cat("[Cohen's kappa]\n")
kp <- cohen.kappa(cbind(org, pred))
cat("Unweighted:", kp$kappa, '\n')
cat(" Weighted:", kp$weighted.kappa)
}
Since this is an extension of exploratory analysis, all data is used for modeling (no validation and test set) and meticulous performance tuning is not performed. For linear regression, all physicochemical properties are used as independent variables first and simply drop one by one whose p-value is not significant (> 0.05).
m1 <- lm(fml1, ww)
m2 <- update(m1, ~ . - citric.acid)
m3 <- update(m2, ~ . - chlorides)
m4 <- update(m3, ~ . - total.sulfur.dioxide)
mtable(m1, m2, m3, m4)
##
## Calls:
## m1: lm(formula = fml1, data = ww)
## m2: lm(formula = quality ~ fixed.acidity + volatile.acidity + residual.sugar +
## chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol, data = ww)
## m3: lm(formula = quality ~ fixed.acidity + volatile.acidity + residual.sugar +
## free.sulfur.dioxide + total.sulfur.dioxide + density + pH +
## sulphates + alcohol, data = ww)
## m4: lm(formula = quality ~ fixed.acidity + volatile.acidity + residual.sugar +
## free.sulfur.dioxide + density + pH + sulphates + alcohol,
## data = ww)
##
## =====================================================================
## m1 m2 m3 m4
## ---------------------------------------------------------------------
## (Intercept) 150.193*** 149.901*** 151.256*** 154.106***
## (18.804) (18.760) (18.492) (18.100)
## fixed.acidity 0.066** 0.066** 0.068*** 0.068***
## (0.021) (0.021) (0.020) (0.020)
## volatile.acidity -1.863*** -1.868*** -1.872*** -1.888***
## (0.114) (0.112) (0.112) (0.110)
## citric.acid 0.022
## (0.096)
## residual.sugar 0.081*** 0.081*** 0.082*** 0.083***
## (0.008) (0.008) (0.007) (0.007)
## chlorides -0.247 -0.234
## (0.547) (0.543)
## free.sulfur.dioxide 0.004*** 0.004*** 0.004*** 0.003***
## (0.001) (0.001) (0.001) (0.001)
## total.sulfur.dioxide -0.000 -0.000 -0.000
## (0.000) (0.000) (0.000)
## density -150.284*** -149.987*** -151.398*** -154.291***
## (19.075) (19.029) (18.743) (18.344)
## pH 0.686*** 0.684*** 0.692*** 0.694***
## (0.105) (0.105) (0.103) (0.103)
## sulphates 0.631*** 0.632*** 0.634*** 0.629***
## (0.100) (0.100) (0.100) (0.100)
## alcohol 0.193*** 0.194*** 0.194*** 0.193***
## (0.024) (0.024) (0.024) (0.024)
## ---------------------------------------------------------------------
## R-squared 0.282 0.282 0.282 0.282
## adj. R-squared 0.280 0.280 0.281 0.281
## sigma 0.751 0.751 0.751 0.751
## F 174.344 191.810 213.138 239.730
## p 0.000 0.000 0.000 0.000
## Log-likelihood -5543.740 -5543.767 -5543.860 -5544.144
## Deviance 2758.329 2758.359 2758.463 2758.783
## AIC 11113.480 11111.534 11109.719 11108.288
## BIC 11197.936 11189.493 11181.182 11173.254
## N 4898 4898 4898 4898
## =====================================================================
As expected, it’s not performing so well. If you look at R-squared in the final model, it’s just 0.282 which means that only 28 % of variance of quality
is explained by those independent variables.
cat("RMSE:", sqrt(mean((ww$quality - m4$fitted.values)^2)))
## RMSE: 0.7504978
describePerformance(ww$quality, round(m4$fitted.values, 0))
## [Contingency Table]
## pred
## org 3 4 5 6 7
## 3 0 1 7 11 1
## 4 0 5 85 72 1
## 5 0 3 605 834 15
## 6 1 0 301 1717 179
## 7 0 0 17 647 216
## 8 0 0 1 112 62
## 9 0 0 0 1 4
##
## [Contingency Table by Proportion]
## pred
## org 3 4 5 6 7
## 3 0.000 0.050 0.350 0.550 0.050
## 4 0.000 0.031 0.521 0.442 0.006
## 5 0.000 0.002 0.415 0.572 0.010
## 6 0.000 0.000 0.137 0.781 0.081
## 7 0.000 0.000 0.019 0.735 0.245
## 8 0.000 0.000 0.006 0.640 0.354
## 9 0.000 0.000 0.000 0.200 0.800
##
## [Overall Accuracy]
## 51.91915 %
##
## [Cohen's kappa]
## Unweighted: 0.2114556
## Weighted: 0.4035553
The root mean square error (RMSE) is 0.75, meaning the deviation from the true quality value is 0.75 as the average. This does not feel so bad though, the model seems to be struggling to predict deviant cases from the mean quality. Since number of the deviant cases is small, this inability is not reflected fairly to the RMSE. The more sensitive measurement to the inbalance is Cohen’s kappa and it shows low accuracy level.
This implies that there is not any simple linear relationship between quality and physicochemical properties so that if a linear regression model is applied to the data, the prediction is dragged to the center value which is the majority and it’s hard to predict quality as it goes far from the mean.
So let’s try some non-linear models. How does Support Vector Machine perform?
# Original numeric variable prediction
svm1 <- svm(fml1, ww)
cat("RMSE:", sqrt(mean((ww$quality - svm1$fitted)^2)))
## RMSE: 0.6327585
describePerformance(ww$quality, round(svm1$fitted, 0))
## [Contingency Table]
## pred
## org 4 5 6 7 8
## 3 4 9 7 0 0
## 4 23 103 37 0 0
## 5 0 955 493 9 0
## 6 0 312 1747 139 0
## 7 0 9 524 346 1
## 8 0 1 67 105 2
## 9 0 0 1 4 0
##
## [Contingency Table by Proportion]
## pred
## org 4 5 6 7 8
## 3 0.200 0.450 0.350 0.000 0.000
## 4 0.141 0.632 0.227 0.000 0.000
## 5 0.000 0.655 0.338 0.006 0.000
## 6 0.000 0.142 0.795 0.063 0.000
## 7 0.000 0.010 0.595 0.393 0.001
## 8 0.000 0.006 0.383 0.600 0.011
## 9 0.000 0.000 0.200 0.800 0.000
##
## [Overall Accuracy]
## 62.73989 %
##
## [Cohen's kappa]
## Unweighted: 0.4084
## Weighted: 0.6046272
Although it’s still not performing well for deviant cases, the performance is better than linear regression model. Especially there is a big improvement in Cohen’s kappa.
# As a classification prediction
svm2 <- svm(fml2, ww)
describePerformance(ww$quality.f, svm2$fitted)
## [Contingency Table]
## pred
## org 3 4 5 6 7 8 9
## 3 1 0 7 12 0 0 0
## 4 0 20 96 45 2 0 0
## 5 0 1 930 522 4 0 0
## 6 0 0 333 1793 72 0 0
## 7 0 0 21 584 275 0 0
## 8 0 0 1 119 55 0 0
## 9 0 0 0 3 2 0 0
##
## [Contingency Table by Proportion]
## pred
## org 3 4 5 6 7 8 9
## 3 0.050 0.000 0.350 0.600 0.000 0.000 0.000
## 4 0.000 0.123 0.589 0.276 0.012 0.000 0.000
## 5 0.000 0.001 0.638 0.358 0.003 0.000 0.000
## 6 0.000 0.000 0.152 0.816 0.033 0.000 0.000
## 7 0.000 0.000 0.024 0.664 0.312 0.000 0.000
## 8 0.000 0.000 0.006 0.680 0.314 0.000 0.000
## 9 0.000 0.000 0.000 0.600 0.400 0.000 0.000
##
## [Overall Accuracy]
## 61.6374 %
##
## [Cohen's kappa]
## Unweighted: 0.3797619
## Weighted: 0.5289101
When it’s converted to a classification prediction, the performance got worse. Maybe it’s because rounding off in regression model is working in a good way.
# As a simpler classification ("bad", "normal", "good") prediction
svm3 <- svm(fml3, ww)
describePerformance(ww$quality.f2, svm3$fitted)
## [Contingency Table]
## pred
## org bad normal good
## bad 1112 511 17
## normal 370 1661 167
## good 28 566 466
##
## [Contingency Table by Proportion]
## pred
## org bad normal good
## bad 0.678 0.312 0.010
## normal 0.168 0.756 0.076
## good 0.026 0.534 0.440
##
## [Overall Accuracy]
## 66.12903 %
##
## [Cohen's kappa]
## Unweighted: 0.4512158
## Weighted: 0.6147176
The simpler classification gives the model a slight improvement. But not so much.
# Extreme case detection for excellent ones
ww$excellent <- as.factor(ww$excellent)
svm4 <- svm(fml.e, ww)
describePerformance(ww$excellent, svm4$fitted)
## [Contingency Table]
## pred
## org FALSE TRUE
## FALSE 4718 0
## TRUE 180 0
##
## [Contingency Table by Proportion]
## pred
## org FALSE TRUE
## FALSE 1 0
## TRUE 1 0
##
## [Overall Accuracy]
## 96.32503 %
##
## [Cohen's kappa]
## Unweighted: 0
## Weighted: 0
# Extreme case detection for inferior ones
ww$inferior <- as.factor(ww$inferior)
svm5 <- svm(fml.i, ww)
describePerformance(ww$inferior, svm5$fitted)
## [Contingency Table]
## pred
## org FALSE TRUE
## FALSE 4715 0
## TRUE 171 12
##
## [Contingency Table by Proportion]
## pred
## org FALSE TRUE
## FALSE 1.000 0.000
## TRUE 0.934 0.066
##
## [Overall Accuracy]
## 96.50878 %
##
## [Cohen's kappa]
## Unweighted: 0.1190258
## Weighted: 0.1190258
It seems to be very difficult for SVM to predict extreme cases. This implies that there is not any significant distinction for extreme qualities even for the boundry scheme of SVM in the high dimensional space.
Lastly let’s see how Random Forest works.
# Original numeric variable prediction
set.seed(20160211)
rf1 <- randomForest(fml1, ww)
cat("RMSE:", sqrt(mean((ww$quality - rf1$predicted)^2)))
## RMSE: 0.5859156
describePerformance(ww$quality, round(rf1$predicted, 0))
## [Contingency Table]
## pred
## org 4 5 6 7 8
## 3 0 10 10 0 0
## 4 8 116 38 1 0
## 5 0 1010 436 11 0
## 6 0 232 1827 139 0
## 7 0 7 357 516 0
## 8 0 3 44 99 29
## 9 0 0 1 4 0
##
## [Contingency Table by Proportion]
## pred
## org 4 5 6 7 8
## 3 0.000 0.500 0.500 0.000 0.000
## 4 0.049 0.712 0.233 0.006 0.000
## 5 0.000 0.693 0.299 0.008 0.000
## 6 0.000 0.106 0.831 0.063 0.000
## 7 0.000 0.008 0.406 0.586 0.000
## 8 0.000 0.017 0.251 0.566 0.166
## 9 0.000 0.000 0.200 0.800 0.000
##
## [Overall Accuracy]
## 69.21192 %
##
## [Cohen's kappa]
## Unweighted: 0.5183606
## Weighted: 0.6755489
Even for Random Forest it is still hard to predict deviant cases. Meanwhile it shows a better performance than SVM overall because the prediction accuracy for center values (5, 6, 7) got better.
# As a classification prediction
rf2 <- randomForest(fml2, ww)
describePerformance(ww$quality.f, rf2$predicted)
## [Contingency Table]
## pred
## org 3 4 5 6 7 8 9
## 3 0 0 8 12 0 0 0
## 4 0 44 79 38 2 0 0
## 5 0 8 1039 400 10 0 0
## 6 0 4 247 1837 108 2 0
## 7 0 0 15 334 527 4 0
## 8 0 0 2 50 43 80 0
## 9 0 0 0 3 2 0 0
##
## [Contingency Table by Proportion]
## pred
## org 3 4 5 6 7 8 9
## 3 0.000 0.000 0.400 0.600 0.000 0.000 0.000
## 4 0.000 0.270 0.485 0.233 0.012 0.000 0.000
## 5 0.000 0.005 0.713 0.275 0.007 0.000 0.000
## 6 0.000 0.002 0.112 0.836 0.049 0.001 0.000
## 7 0.000 0.000 0.017 0.380 0.599 0.005 0.000
## 8 0.000 0.000 0.011 0.286 0.246 0.457 0.000
## 9 0.000 0.000 0.000 0.600 0.400 0.000 0.000
##
## [Overall Accuracy]
## 72.00898 %
##
## [Cohen's kappa]
## Unweighted: 0.5654909
## Weighted: 0.701945
When it’s converted to a classification prediction, it outperforms the regression version. Here improvements in the prediction accuracy for quality level 4 and 8 are observed. It feels that the tree-based rule logic and randomness of the model enables to predict some deviant cases although it still cannot output quality level 3 and 9.
# As a simpler classification ("bad", "normal", "good") prediction
rf3 <- randomForest(fml3, ww)
describePerformance(ww$quality.f2, rf3$predicted)
## [Contingency Table]
## pred
## org bad normal good
## bad 1218 405 17
## normal 291 1743 164
## good 20 327 713
##
## [Contingency Table by Proportion]
## pred
## org bad normal good
## bad 0.743 0.247 0.010
## normal 0.132 0.793 0.075
## good 0.019 0.308 0.673
##
## [Overall Accuracy]
## 75.01021 %
##
## [Cohen's kappa]
## Unweighted: 0.6028426
## Weighted: 0.7315294
This is the best performance so far. Simplification of quality level worked well for Random Forest.
# Extreme case detection for excellent ones
rf4 <- randomForest(fml.e, ww)
describePerformance(ww$excellent, rf4$predicted)
## [Contingency Table]
## pred
## org FALSE TRUE
## FALSE 4716 2
## TRUE 103 77
##
## [Contingency Table by Proportion]
## pred
## org FALSE TRUE
## FALSE 1.000 0.000
## TRUE 0.572 0.428
##
## [Overall Accuracy]
## 97.85627 %
##
## [Cohen's kappa]
## Unweighted: 0.5852975
## Weighted: 0.5852975
While SVM could not detect any excellent ones, Random Forest could detect about 42 % of them.
# Extreme case detection for inferior ones
rf5 <- randomForest(fml.i, ww)
describePerformance(ww$inferior, rf5$predicted)
## [Contingency Table]
## pred
## org FALSE TRUE
## FALSE 4703 12
## TRUE 142 41
##
## [Contingency Table by Proportion]
## pred
## org FALSE TRUE
## FALSE 0.997 0.003
## TRUE 0.776 0.224
##
## [Overall Accuracy]
## 96.85586 %
##
## [Cohen's kappa]
## Unweighted: 0.3363202
## Weighted: 0.3363202
This is slightly better than SVM though, detection rate of inferior ones is about 22 %.
As a summary, I’m going to revisit findings with plots.
ggplot(ww, aes(x=quality.f2, y=alcohol, fill=quality.f2)) +
geom_boxplot() +
stat_summary(fun.data=mean_cl_normal, geom="errorbar",
color='red', size=1) +
ylim(quantile(ww$alcohol, prob=0.00),
quantile(ww$alcohol, prob=0.99)) +
ggtitle("Range of Alcohol per Quality Level") +
xlab("Quality") +
ylab("Alcohol (%)") +
theme(title=element_text(size=18, face="bold"),
axis.title=element_text(size=14),
axis.text=element_text(size=12),
legend.position="none")
The 95 % confidence intervals are added by red lines that have relatively narrow range because of large numbers of data points. It shows that mean of each group are significantly different at that confidence level. This can imply that higher alcohol volume produce better quality wine although the causality cannot be clear only from this data analysis. (It is unclear whether high alcohol volume directly contributes to better quality.)
# Compute mean of distribution per level of cls
mu <- summarise(group_by(ww, quality.f2), mean(density))
colnames(mu)[2] <- "density"
ggplot(ww, aes(x=density, fill=quality.f2)) +
geom_density(alpha=0.3) +
geom_vline(aes(xintercept=density, color=quality.f2),
data=mu, size=1, show.legend=F) +
xlim(quantile(ww$density, prob=0.00),
quantile(ww$density, prob=0.99)) +
guides(fill=guide_legend(title="Quality", reverse=T)) +
ggtitle("Distribution of Density per Quality Level") +
xlab("Density (g/ml)") +
ylab("") +
theme(title=element_text(size=18, face="bold"),
axis.title=element_text(size=14),
axis.text=element_text(size=12),
legend.text=element_text(size=12))
This is a density plot of density
. As quality
goes up, the center of distribution of density
gets smaller. Since density
has a high correlation coeffecient with alcohol
, this is something expected. The lower density
possibly produce better quality wine although the causality is not deterministic here as well. (This might be just due to the negative correlation (-0.78) between alcohol
and density
.)
ggplot(ww, aes(x=density, y=residual.sugar, color=quality.f2)) +
geom_point(position='jitter') +
geom_smooth(aes(color=quality.f2), method=lm) +
xlim(min(ww$density),
quantile(ww$density, prob=0.99)) +
ylim(min(ww$residual.sugar),
quantile(ww$residual.sugar, prob=0.99)) +
guides(color=guide_legend(title="Quality", reverse=T)) +
ggtitle("Quality on Density and Residual Suagr Dimension") +
xlab("Density (g/mL)") +
ylab("Residual Sugar (g/L)") +
theme(title=element_text(size=18, face="bold"),
axis.title=element_text(size=14),
axis.text=element_text(size=12),
legend.text=element_text(size=12))
A regression line is added to each quality group with the shadow of confidence interval. When density
is low, the distinction among the quality levels is clear, but as density
increases the margin gets narrower and at the high end they cross. This means that the distinction in the 2D space is only applicable where density
is low to middle.
Through exploratory data analysis, I found that alcohol
, density
and residual.sugar
are possible factors which affect quality level in an interpretable manner. Our sense of taste feels quite complex and the result of modeling implies that accurate prediction of wine quality requires more complexity which is not interpretable through visualization in a few dimensional space.
When I started out this project, I thought that it would not be so difficult to find hidden relationships between quality and physicochemical properties because the data set has only eleven independent variables and all of them are clean. But I soon noticed I was wrong after performing some visualizations through arbitrary variable choices. What I came up with ater the struggle was a brute force method which plots all the combinations of variables. This way I could check if there is any interesting pattern in different type of 2D spaces effeciently. By systemizing recursive plotting through concise function calls, I also could minimize the amount of code. The plot in the section 6.3 is one of them that I could find by the brute force method and would have been difficult to be found otherwise.
I came up with new quality level representations to make more sense for visualization. Those are transformation of the dependent variable and it worked. But I could not utilize transformation of independent variables and feature engineering. Based on the observation in the modeling section, there is a tendency that more complex non-linear models perform better. But number of models I applied in this project is limitted and model tuning was not performed. So those things related to modeling are left for the future project.