Abstract
Data
Library
Exploratory Analysis
4.1. Univariate Analysis
4.2. Bivariate Analysis
4.3. Multivariate Analysis
Modeling
5.1. Linear Regression
5.2. Support Vector Machine
5.3. Random Forest
Final Plots and Summary
6.1. Alcohol
6.2. Density
6.3. Residual Sugar with Density
Reflection
Reference

1. Abstract

The purpose of this project is to determine what physicochemical properties affect white wine quality through exploratory data analysis.

2. Data

The Wine Quality dataset is downloaded from the following website.

[Website] http://www3.dsi.uminho.pt/pcortez/wine/
[Data Link] http://www3.dsi.uminho.pt/pcortez/wine/winequality.zip
[Citation] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

The zip file contains two csv files, one for red wines (winequality-red.csv) and one for white wines (winequality-white.csv). In this project only the white wine data is used.

The white wine csv file is loaded into R as follows.

# Load csv file
ww <- read.table("./input/winequality-white.csv", header=T, sep=";")
str(ww)

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

The first eleven variables are physicochemical properties acquired from a computerized system iLab and the last variable quality is the score between 0 (very bad) and 10 (excellent) based on sensory data. Each sample was evaluated by a minimum of three sensory assessors (using blind tastes) and the final sensory score was given by the median of these evaluations.

For a sake of convenience of analysis, original variable names and independent variable names (physicochemical properties) are defined.

# Define original variable names
ORIGINAL <- colnames(ww)
# Define independent variable names
INDEPENDENT <- colnames(ww)[1:11]

To enable to look at the quality as a classification, a factor variable which is converted from the numerical one is added.

ww$quality.f <- as.factor(ww$quality)
summary(ww$quality.f)

##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

3. Library

The following libraries are used in this project.

# For visualization
library(ggplot2)
library(gridExtra)
library(corrplot)
library(dplyr)
# For modeling
library(psych)
library(memisc)
library(nnet)
library(rpart)
library(e1071)
library(randomForest)

4. Exploratory Analysis

4.1. Univariate Analysis

summary(ww)

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##                                                                     
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##                                                            
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##                                                                   
##     quality      quality.f
##  Min.   :3.000   3:  20   
##  1st Qu.:5.000   4: 163   
##  Median :6.000   5:1457   
##  Mean   :5.878   6:2198   
##  3rd Qu.:6.000   7: 880   
##  Max.   :9.000   8: 175   
##                  9:   5

All variables except residual.sugar have very similar median and mean. This makes us imagine symetrical distributions. Also, by looking at 1st quarter and 3rd quarter, it feels that they are relatively close to the center values. So how do they look like? Blue lines show the mean and red lines how the median of distribution. Also dot-lines are 1st and 3rd quantiles.

dens <- lapply(ORIGINAL, FUN=function(var) {
  ggplot(ww, aes_string(x=var)) + 
    geom_density(fill='gray') +
    geom_vline(aes(xintercept=mean(ww[, var])), color='blue', size=1) +
    geom_vline(aes(xintercept=median(ww[, var])), color='red', size=1) +
    geom_vline(aes(xintercept=quantile(ww[, var], 0.25)), 
               linetype='dashed', size=0.5) + 
    geom_vline(aes(xintercept=quantile(ww[, var], 0.75)), 
               linetype='dashed', size=0.5)
  })
do.call(grid.arrange, args=c(dens, list(ncol=3)))

As expected, most of them, except residual.sugar and alcohol, have symetrical distribution althoguh some have outliers in positive side. The dependent variiable quality is a little bit exceptional since it’s a discrete number variable.

To have a better sense about the distributions, let’s standardize and see.

ww_scaled <- data.frame(scale(ww[, ORIGINAL]))
summary(ww_scaled)

##  fixed.acidity      volatile.acidity   citric.acid      residual.sugar   
##  Min.   :-3.61998   Min.   :-1.9668   Min.   :-2.7615   Min.   :-1.1418  
##  1st Qu.:-0.65743   1st Qu.:-0.6770   1st Qu.:-0.5304   1st Qu.:-0.9250  
##  Median :-0.06492   Median :-0.1810   Median :-0.1173   Median :-0.2349  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.52758   3rd Qu.: 0.4143   3rd Qu.: 0.4612   3rd Qu.: 0.6917  
##  Max.   : 8.70422   Max.   : 8.1528   Max.   :10.9553   Max.   :11.7129  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :-1.6831   Min.   :-1.95848    Min.   :-3.0439     
##  1st Qu.:-0.4473   1st Qu.:-0.72370    1st Qu.:-0.7144     
##  Median :-0.1269   Median :-0.07691    Median :-0.1026     
##  Mean   : 0.0000   Mean   : 0.00000    Mean   : 0.0000     
##  3rd Qu.: 0.1935   3rd Qu.: 0.62867    3rd Qu.: 0.6739     
##  Max.   :13.7417   Max.   :14.91679    Max.   : 7.0977     
##     density               pH             sulphates      
##  Min.   :-2.31280   Min.   :-3.10109   Min.   :-2.3645  
##  1st Qu.:-0.77063   1st Qu.:-0.65077   1st Qu.:-0.6996  
##  Median :-0.09608   Median :-0.05475   Median :-0.1739  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.69298   3rd Qu.: 0.60750   3rd Qu.: 0.5271  
##  Max.   :15.02976   Max.   : 4.18365   Max.   : 5.1711  
##     alcohol            quality       
##  Min.   :-2.04309   Min.   :-3.2495  
##  1st Qu.:-0.82419   1st Qu.:-0.9913  
##  Median :-0.09285   Median : 0.1379  
##  Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.71974   3rd Qu.: 0.1379  
##  Max.   : 2.99502   Max.   : 3.5252

Since the 1st and 3rd quarter of standard normal distribution is -0.674 and 0.674 respectively, some of them look really close to it. So let’s check the normality by Q-Q plot.

par(mfrow=c(4,3), mar=c(2,2,2,2))
dummy <- lapply(ORIGINAL, FUN=function(var) {
  qqnorm(ww_scaled[, var], main=var)
  qqline(ww_scaled[, var])
})

Frankly speaking, all of them follow normal distribution in the duration between -2 and 2 standard deviations. (Again, quality is exceptional since it’s a discrete number distribution.) But as it gets close to the end of positive, outliers are observed. Only alcohol has narrower distribution in the positive side as compared to normal distribution. On the other hand, if you look at negative sides, most of them has narrower distiribution than normal distribution. Those observations imply that distributions of independent variables are close to normal distribution at the center, but outliers in positive sides add a taste of right skewness.

4.2. Bivariate Analysis

To take an overview of relationships between two variables, let’s start with correlation matrix.

corrplot.mixed(cor(ww[, ORIGINAL]))

In terms of relationships between quality and independent variables, alcohol has the strongest correlation 0.44 with quality. The second is density -0.31 though, it seems natural since the correlation between density and alcohol is -0.78. (The density does not seem to provide any new information about quality since it’s already explained by alcohol.)

In terms of relationships between independent variables, some strong correlations are observed.

0.84 residual.sugar - density
0.62 free.sulfur.dioxide - total.sulfur.dioxide
0.53 total.sulfur.dioxide - density

To see relationships between quality and independent variables closely, let’s create boxplots which desceribe each independent variable per quality level. Here I remove outliers in the positive side to visualize the mojority clearer.

showBoxplt <- function(df, y, x='quality.f', lower=0, upper=.99) {
  ggplot(df, aes_string(x=x, y=y, fill=x)) + 
    geom_boxplot() + 
    ylim(quantile(df[, y], prob=lower), 
         quantile(df[, y], prob=upper)) +
    theme(legend.position="none")
}
boxp1 <- lapply(INDEPENDENT, FUN=function(var) showBoxplt(ww, var))
do.call(grid.arrange, args=c(boxp1, list(ncol=3)))

By looking at the center lines and position of boxes, we can feel the correlation in alcohol and density which we found in the correlation matrix. As quality goes up, alcohol also goes up and density goes down. Meanwhile the others look almost flat.

We can see a loose trend in alcohol and density.

But for the others, for example pH and sulphates shown above look flat and no distiction among quality levels. Remember, quality 9 which looks a little different has just five data points.

Since it’s hard to find significant factor at this point, let’s create a new variable which has fewer quality levels and see if it can provide new insights.

# [Quality conversion 1]
# Create a simpler classification of quality.
#   3, 4, 5 -> bad
#      6    -> normal
#   7, 8, 9 -> good
ww$quality.f2 <- ifelse(ww$quality == 3 | 
                          ww$quality == 4 | 
                          ww$quality == 5, 
                        "bad", 
                        ifelse(ww$quality == 6, 
                               "normal",
                               "good"))
ww$quality.f2 <- factor(ww$quality.f2, levels=c("bad", "normal", "good"))
summary(ww[, c('quality.f', 'quality.f2')])

##  quality.f  quality.f2  
##  3:  20    bad   :1640  
##  4: 163    normal:2198  
##  5:1457    good  :1060  
##  6:2198                 
##  7: 880                 
##  8: 175                 
##  9:   5

So now how do boxplots look like?

boxp2 <- lapply(INDEPENDENT, 
                FUN=function(var) showBoxplt(ww, var, 'quality.f2'))
do.call(grid.arrange, args=c(boxp2, list(ncol=3)))

While significant differences among quality levels in density and alcohol got clearer, the others got flatter. This feels like that the other physicochemical properties do not really matter for quality.

The same zoom-up goes as follows.

They show a nice hierarchical distinction.

They got flatter.

To take a look at it from a different angle, let’s create density plots. Vertical lines show the mean of each distribution.

showDensplt <- function(df, x, cls='quality.f2', lower=0, upper=.99) {
  # Compute mean of distribution per level of cls
  mu <- summarise_(group_by_(ww, cls), paste0('mean(', x, ')'))
  colnames(mu)[2] <- x
  
  ggplot(df, aes_string(x=x, fill=cls)) + 
    geom_density(alpha=0.3) +
    geom_vline(aes_string(xintercept=x, color=cls), data=mu, size=1) +
    xlim(quantile(df[, x], prob=lower), 
         quantile(df[, x], prob=upper)) +
    theme(legend.position="none")
}
dens <- lapply(INDEPENDENT, FUN=function(col) showDensplt(ww, col))
do.call(grid.arrange, args=c(dens, list(ncol=3)))

The trends and flatness observed in boxplots look like as follows by density plots.

The hierarchical distinction observed in the previous boxplots are shown as a wider range of vertical lines.

The distribution and mean are very similar among quality levels.

Since it’s still hard to find any significance other than alcohol and density, let’s try simpler quality classifications for extreme qualities.

# [Quality conversion 2]
# Create factor variables for extreme qualities.
#   excellent: 9,8 -> True
#   inferior:  3,4 -> True
ww$excellent <- ifelse(ww$quality == 8 | ww$quality == 9, T, F)
ww$inferior <- ifelse(ww$quality == 3 | ww$quality == 4, T, F)
summary(ww[, c('excellent', 'inferior')])

##  excellent        inferior      
##  Mode :logical   Mode :logical  
##  FALSE:4718      FALSE:4715     
##  TRUE :180       TRUE :183      
##  NA's :0         NA's :0

So can we see any significant independent variable in extreme cases? First take a look at excellet wines.

boxp3 <- lapply(INDEPENDENT, 
                FUN=function(var) showBoxplt(ww, y=var, x='excellent'))
do.call(grid.arrange, args=c(boxp3, list(ncol=3)))

While the gap in alcohol and density got more significant, not so much difference can be observed in the others. Maybe pH of excellent=TRUE is slightly bigger than excellent=FALSE.

How about inferior wines?

boxp4 <- lapply(INDEPENDENT, 
                FUN=function(var) showBoxplt(ww, y=var, x='inferior'))
do.call(grid.arrange, args=c(boxp4, list(ncol=3)))

What is interesting is that alcohol does not have significant difference when it is observed from an inferior viewpoint. Also the difference in free.sulfur.dioxide seems to be significant.

Those observations imply that:

Excellent wines tend to have high alcohol volume, but high alcohol volume does not promise to be excellent.
Infeior wines tend to have lower free.sulfur.dioxide, but low free.sulfur.dioxide does not promise to be inferior.

4.3. Multivariate Analysis

So now let’s turn into higher dimensional space. The idea here is that one variable is visualized by color so that other two variables can be mapped on 2D space. We might be able to find certain rules displayed by color on 2D space.

First, let’s try a higher dimensional version of scatter plot. Since the strongest factor to determine quality we found so far is alcohol, I’d like to map alcohol and quality on 2D axis and one independent variable into color. Each independent variable used for color is scaled in the following color scheme in its own data range after removing top 1 % as outliers.

Low = Blue
Mid = Violet
High = Red

showScatpltColored <- function(df, color, x='alcohol', y='quality',
                          lower=0.00, upper=0.99) {
  ggplot(df, aes_string(x=x, y=y, color=color)) +
    geom_point(size=1, alpha=0.5, position='jitter') + 
    xlim(quantile(df[, x], prob=lower), 
         quantile(df[, x], prob=upper)) +
    scale_colour_gradientn(colours=c("blue", "violet", "red")) +
    ggtitle(color) +
    xlab(NULL) +
    ylab(NULL) +
    theme(legend.position="none")
}
scat_wc <- lapply(INDEPENDENT, FUN=function(var) showScatpltColored(ww, var))
do.call(grid.arrange, args=c(scat_wc, list(ncol=3)))

Since alcohol is mapped on x-axis and quality is mapped on y-axis, horizontal color distinction is related to alcohol and vertical color distinction is related to quality. So what we want to find is vertical distinction in a sense. Unfortunately, we cannot see vertical color distinction in any plot, although horizontal color distinction is observed in alcohol (which is obvious) and density (which is also expected because of the correlation with alcohol).

Next, let’s convert quality into colors. Meaning, arbitrary two independent variables are mapped on 2D axis. Since I don’t know which combination might lead a better insight, I’m going to check all the combinations. What we want to find here is any chunck of color on 2D.

showScatpltColored2 <- function(df, x, y, color='quality.f2',
                           lower=0.00, upper=0.99){
  ggplot(df, aes_string(x=x, y=y, color=color)) +
    geom_point(alpha=0.3, position='jitter') + 
    xlim(quantile(df[, x], prob=lower), 
         quantile(df[, x], prob=upper)) +
    ylim(quantile(df[, y], prob=lower), 
         quantile(df[, y], prob=upper))
}
scat_wc2 <- lapply(combn(INDEPENDENT, 2, simplify=F), 
                   FUN=function(var) showScatpltColored2(ww, var[1], var[2]))
# Just check the number of plots
length(scat_wc2)

## [1] 55

After checking 55 scatter plots that are all the combinations of independent variables (11 choose 2), I noticed that an interesting pattern is observed in a scatter plot by density and residual.sugar. Let’s zoom up the plot.

# Colored by simplified quality level
scat.qf2 <- ggplot(ww, aes(x=density, y=residual.sugar, color=quality.f2)) +
  geom_point(position='jitter') + 
  xlim(min(ww$density), quantile(ww$density, prob=0.99)) +
  ylim(min(ww$residual.sugar), quantile(ww$residual.sugar, prob=0.99))
# Colored by original quality level
scat.qf <- ggplot(ww, aes(x=density, y=residual.sugar, color=quality.f)) +
  geom_point() + 
  scale_color_brewer() +
  xlim(min(ww$density), quantile(ww$density, prob=0.99)) +
  ylim(min(ww$residual.sugar), quantile(ww$residual.sugar, prob=0.99))

do.call(grid.arrange, args=c(list(scat.qf2, scat.qf), list(ncol=2)))

The left one uses the simplified quality level and the right one uses the original quality level. In both plots, holding density, higher residual.sugar seem to have better quality. Maybe the distinction is not so deterministic, but it’s recognizable in a sense. It seems to be hard to express the effect by a simple linear coeffecient due to the wide spread and overlap of data points.

5. Modeling

Based on the exploratory analysis in the previous section, there does not seem to be any simple linear relationship between quality and physicochemical properties. If this observation is correct, linear regression model would not perform so well in terms of quality prediction by physicochemical properties. So let’s start there to confirm it.

Before start I’d like to define all formulas used in this section and an utility function to summarize model performance.

# Original numeric variable prediction
fml1 <- as.formula(paste("quality", "~", 
                         paste(INDEPENDENT, collapse=' + ')))
# As a classification prediction
fml2 <- as.formula(paste("quality.f", "~", 
                         paste(INDEPENDENT, collapse=' + ')))
# As a simpler classification ("bad", "normal", "good") prediction
fml3 <- as.formula(paste("quality.f2", "~", 
                         paste(INDEPENDENT, collapse=' + ')))
# Extreme case detection for excellent ones
fml.e <- as.formula(paste("excellent", "~", 
                          paste(INDEPENDENT, collapse=' + ')))
# Extreme case detection for inferior ones
fml.i <- as.formula(paste("inferior", "~", 
                          paste(INDEPENDENT, collapse=' + ')))

describePerformance <- function(org, pred) {
  cat("[Contingency Table]\n")
  print(table(org, pred))
  cat("\n")
  
  cat("[Contingency Table by Proportion]\n")
  print(round(prop.table(table(org, pred), 1), 3))
  cat("\n")
  
  cat("[Overall Accuracy]\n")
  cat(sum(org == pred) / length(org) * 100, '%')
  cat("\n\n")
  
  cat("[Cohen's kappa]\n")
  kp <- cohen.kappa(cbind(org, pred))
  cat("Unweighted:", kp$kappa, '\n')
  cat("  Weighted:", kp$weighted.kappa)
}

5.1. Linear Regression

Since this is an extension of exploratory analysis, all data is used for modeling (no validation and test set) and meticulous performance tuning is not performed. For linear regression, all physicochemical properties are used as independent variables first and simply drop one by one whose p-value is not significant (> 0.05).

m1 <- lm(fml1, ww)
m2 <- update(m1, ~ . - citric.acid)
m3 <- update(m2, ~ . - chlorides)
m4 <- update(m3, ~ . - total.sulfur.dioxide)
mtable(m1, m2, m3, m4)

## 
## Calls:
## m1: lm(formula = fml1, data = ww)
## m2: lm(formula = quality ~ fixed.acidity + volatile.acidity + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol, data = ww)
## m3: lm(formula = quality ~ fixed.acidity + volatile.acidity + residual.sugar + 
##     free.sulfur.dioxide + total.sulfur.dioxide + density + pH + 
##     sulphates + alcohol, data = ww)
## m4: lm(formula = quality ~ fixed.acidity + volatile.acidity + residual.sugar + 
##     free.sulfur.dioxide + density + pH + sulphates + alcohol, 
##     data = ww)
## 
## =====================================================================
##                           m1          m2          m3          m4     
## ---------------------------------------------------------------------
## (Intercept)            150.193***  149.901***  151.256***  154.106***
##                        (18.804)    (18.760)    (18.492)    (18.100)  
## fixed.acidity            0.066**     0.066**     0.068***    0.068***
##                         (0.021)     (0.021)     (0.020)     (0.020)  
## volatile.acidity        -1.863***   -1.868***   -1.872***   -1.888***
##                         (0.114)     (0.112)     (0.112)     (0.110)  
## citric.acid              0.022                                       
##                         (0.096)                                      
## residual.sugar           0.081***    0.081***    0.082***    0.083***
##                         (0.008)     (0.008)     (0.007)     (0.007)  
## chlorides               -0.247      -0.234                           
##                         (0.547)     (0.543)                          
## free.sulfur.dioxide      0.004***    0.004***    0.004***    0.003***
##                         (0.001)     (0.001)     (0.001)     (0.001)  
## total.sulfur.dioxide    -0.000      -0.000      -0.000               
##                         (0.000)     (0.000)     (0.000)              
## density               -150.284*** -149.987*** -151.398*** -154.291***
##                        (19.075)    (19.029)    (18.743)    (18.344)  
## pH                       0.686***    0.684***    0.692***    0.694***
##                         (0.105)     (0.105)     (0.103)     (0.103)  
## sulphates                0.631***    0.632***    0.634***    0.629***
##                         (0.100)     (0.100)     (0.100)     (0.100)  
## alcohol                  0.193***    0.194***    0.194***    0.193***
##                         (0.024)     (0.024)     (0.024)     (0.024)  
## ---------------------------------------------------------------------
## R-squared                  0.282       0.282       0.282       0.282 
## adj. R-squared             0.280       0.280       0.281       0.281 
## sigma                      0.751       0.751       0.751       0.751 
## F                        174.344     191.810     213.138     239.730 
## p                          0.000       0.000       0.000       0.000 
## Log-likelihood         -5543.740   -5543.767   -5543.860   -5544.144 
## Deviance                2758.329    2758.359    2758.463    2758.783 
## AIC                    11113.480   11111.534   11109.719   11108.288 
## BIC                    11197.936   11189.493   11181.182   11173.254 
## N                       4898        4898        4898        4898     
## =====================================================================

As expected, it’s not performing so well. If you look at R-squared in the final model, it’s just 0.282 which means that only 28 % of variance of quality is explained by those independent variables.

cat("RMSE:", sqrt(mean((ww$quality - m4$fitted.values)^2)))

## RMSE: 0.7504978

describePerformance(ww$quality, round(m4$fitted.values, 0))

## [Contingency Table]
##    pred
## org    3    4    5    6    7
##   3    0    1    7   11    1
##   4    0    5   85   72    1
##   5    0    3  605  834   15
##   6    1    0  301 1717  179
##   7    0    0   17  647  216
##   8    0    0    1  112   62
##   9    0    0    0    1    4
## 
## [Contingency Table by Proportion]
##    pred
## org     3     4     5     6     7
##   3 0.000 0.050 0.350 0.550 0.050
##   4 0.000 0.031 0.521 0.442 0.006
##   5 0.000 0.002 0.415 0.572 0.010
##   6 0.000 0.000 0.137 0.781 0.081
##   7 0.000 0.000 0.019 0.735 0.245
##   8 0.000 0.000 0.006 0.640 0.354
##   9 0.000 0.000 0.000 0.200 0.800
## 
## [Overall Accuracy]
## 51.91915 %
## 
## [Cohen's kappa]
## Unweighted: 0.2114556 
##   Weighted: 0.4035553

The root mean square error (RMSE) is 0.75, meaning the deviation from the true quality value is 0.75 as the average. This does not feel so bad though, the model seems to be struggling to predict deviant cases from the mean quality. Since number of the deviant cases is small, this inability is not reflected fairly to the RMSE. The more sensitive measurement to the inbalance is Cohen’s kappa and it shows low accuracy level.

This implies that there is not any simple linear relationship between quality and physicochemical properties so that if a linear regression model is applied to the data, the prediction is dragged to the center value which is the majority and it’s hard to predict quality as it goes far from the mean.

5.2. Support Vector Machine

So let’s try some non-linear models. How does Support Vector Machine perform?

# Original numeric variable prediction
svm1 <- svm(fml1, ww)
cat("RMSE:", sqrt(mean((ww$quality - svm1$fitted)^2)))

## RMSE: 0.6327585

describePerformance(ww$quality, round(svm1$fitted, 0))

## [Contingency Table]
##    pred
## org    4    5    6    7    8
##   3    4    9    7    0    0
##   4   23  103   37    0    0
##   5    0  955  493    9    0
##   6    0  312 1747  139    0
##   7    0    9  524  346    1
##   8    0    1   67  105    2
##   9    0    0    1    4    0
## 
## [Contingency Table by Proportion]
##    pred
## org     4     5     6     7     8
##   3 0.200 0.450 0.350 0.000 0.000
##   4 0.141 0.632 0.227 0.000 0.000
##   5 0.000 0.655 0.338 0.006 0.000
##   6 0.000 0.142 0.795 0.063 0.000
##   7 0.000 0.010 0.595 0.393 0.001
##   8 0.000 0.006 0.383 0.600 0.011
##   9 0.000 0.000 0.200 0.800 0.000
## 
## [Overall Accuracy]
## 62.73989 %
## 
## [Cohen's kappa]
## Unweighted: 0.4084 
##   Weighted: 0.6046272

Although it’s still not performing well for deviant cases, the performance is better than linear regression model. Especially there is a big improvement in Cohen’s kappa.

# As a classification prediction
svm2 <- svm(fml2, ww)
describePerformance(ww$quality.f, svm2$fitted)

## [Contingency Table]
##    pred
## org    3    4    5    6    7    8    9
##   3    1    0    7   12    0    0    0
##   4    0   20   96   45    2    0    0
##   5    0    1  930  522    4    0    0
##   6    0    0  333 1793   72    0    0
##   7    0    0   21  584  275    0    0
##   8    0    0    1  119   55    0    0
##   9    0    0    0    3    2    0    0
## 
## [Contingency Table by Proportion]
##    pred
## org     3     4     5     6     7     8     9
##   3 0.050 0.000 0.350 0.600 0.000 0.000 0.000
##   4 0.000 0.123 0.589 0.276 0.012 0.000 0.000
##   5 0.000 0.001 0.638 0.358 0.003 0.000 0.000
##   6 0.000 0.000 0.152 0.816 0.033 0.000 0.000
##   7 0.000 0.000 0.024 0.664 0.312 0.000 0.000
##   8 0.000 0.000 0.006 0.680 0.314 0.000 0.000
##   9 0.000 0.000 0.000 0.600 0.400 0.000 0.000
## 
## [Overall Accuracy]
## 61.6374 %
## 
## [Cohen's kappa]
## Unweighted: 0.3797619 
##   Weighted: 0.5289101

When it’s converted to a classification prediction, the performance got worse. Maybe it’s because rounding off in regression model is working in a good way.

# As a simpler classification ("bad", "normal", "good") prediction
svm3 <- svm(fml3, ww)
describePerformance(ww$quality.f2, svm3$fitted)

## [Contingency Table]
##         pred
## org       bad normal good
##   bad    1112    511   17
##   normal  370   1661  167
##   good     28    566  466
## 
## [Contingency Table by Proportion]
##         pred
## org        bad normal  good
##   bad    0.678  0.312 0.010
##   normal 0.168  0.756 0.076
##   good   0.026  0.534 0.440
## 
## [Overall Accuracy]
## 66.12903 %
## 
## [Cohen's kappa]
## Unweighted: 0.4512158 
##   Weighted: 0.6147176

The simpler classification gives the model a slight improvement. But not so much.

# Extreme case detection for excellent ones
ww$excellent <- as.factor(ww$excellent)
svm4 <- svm(fml.e, ww)
describePerformance(ww$excellent, svm4$fitted)

## [Contingency Table]
##        pred
## org     FALSE TRUE
##   FALSE  4718    0
##   TRUE    180    0
## 
## [Contingency Table by Proportion]
##        pred
## org     FALSE TRUE
##   FALSE     1    0
##   TRUE      1    0
## 
## [Overall Accuracy]
## 96.32503 %
## 
## [Cohen's kappa]
## Unweighted: 0 
##   Weighted: 0

# Extreme case detection for inferior ones
ww$inferior <- as.factor(ww$inferior)
svm5 <- svm(fml.i, ww)
describePerformance(ww$inferior, svm5$fitted)

## [Contingency Table]
##        pred
## org     FALSE TRUE
##   FALSE  4715    0
##   TRUE    171   12
## 
## [Contingency Table by Proportion]
##        pred
## org     FALSE  TRUE
##   FALSE 1.000 0.000
##   TRUE  0.934 0.066
## 
## [Overall Accuracy]
## 96.50878 %
## 
## [Cohen's kappa]
## Unweighted: 0.1190258 
##   Weighted: 0.1190258

It seems to be very difficult for SVM to predict extreme cases. This implies that there is not any significant distinction for extreme qualities even for the boundry scheme of SVM in the high dimensional space.

5.3. Random Forest

Lastly let’s see how Random Forest works.

# Original numeric variable prediction
set.seed(20160211)
rf1 <- randomForest(fml1, ww)
cat("RMSE:", sqrt(mean((ww$quality - rf1$predicted)^2)))

## RMSE: 0.5859156

describePerformance(ww$quality, round(rf1$predicted, 0))

## [Contingency Table]
##    pred
## org    4    5    6    7    8
##   3    0   10   10    0    0
##   4    8  116   38    1    0
##   5    0 1010  436   11    0
##   6    0  232 1827  139    0
##   7    0    7  357  516    0
##   8    0    3   44   99   29
##   9    0    0    1    4    0
## 
## [Contingency Table by Proportion]
##    pred
## org     4     5     6     7     8
##   3 0.000 0.500 0.500 0.000 0.000
##   4 0.049 0.712 0.233 0.006 0.000
##   5 0.000 0.693 0.299 0.008 0.000
##   6 0.000 0.106 0.831 0.063 0.000
##   7 0.000 0.008 0.406 0.586 0.000
##   8 0.000 0.017 0.251 0.566 0.166
##   9 0.000 0.000 0.200 0.800 0.000
## 
## [Overall Accuracy]
## 69.21192 %
## 
## [Cohen's kappa]
## Unweighted: 0.5183606 
##   Weighted: 0.6755489

Even for Random Forest it is still hard to predict deviant cases. Meanwhile it shows a better performance than SVM overall because the prediction accuracy for center values (5, 6, 7) got better.

# As a classification prediction
rf2 <- randomForest(fml2, ww)
describePerformance(ww$quality.f, rf2$predicted)

## [Contingency Table]
##    pred
## org    3    4    5    6    7    8    9
##   3    0    0    8   12    0    0    0
##   4    0   44   79   38    2    0    0
##   5    0    8 1039  400   10    0    0
##   6    0    4  247 1837  108    2    0
##   7    0    0   15  334  527    4    0
##   8    0    0    2   50   43   80    0
##   9    0    0    0    3    2    0    0
## 
## [Contingency Table by Proportion]
##    pred
## org     3     4     5     6     7     8     9
##   3 0.000 0.000 0.400 0.600 0.000 0.000 0.000
##   4 0.000 0.270 0.485 0.233 0.012 0.000 0.000
##   5 0.000 0.005 0.713 0.275 0.007 0.000 0.000
##   6 0.000 0.002 0.112 0.836 0.049 0.001 0.000
##   7 0.000 0.000 0.017 0.380 0.599 0.005 0.000
##   8 0.000 0.000 0.011 0.286 0.246 0.457 0.000
##   9 0.000 0.000 0.000 0.600 0.400 0.000 0.000
## 
## [Overall Accuracy]
## 72.00898 %
## 
## [Cohen's kappa]
## Unweighted: 0.5654909 
##   Weighted: 0.701945

When it’s converted to a classification prediction, it outperforms the regression version. Here improvements in the prediction accuracy for quality level 4 and 8 are observed. It feels that the tree-based rule logic and randomness of the model enables to predict some deviant cases although it still cannot output quality level 3 and 9.

# As a simpler classification ("bad", "normal", "good") prediction
rf3 <- randomForest(fml3, ww)
describePerformance(ww$quality.f2, rf3$predicted)

## [Contingency Table]
##         pred
## org       bad normal good
##   bad    1218    405   17
##   normal  291   1743  164
##   good     20    327  713
## 
## [Contingency Table by Proportion]
##         pred
## org        bad normal  good
##   bad    0.743  0.247 0.010
##   normal 0.132  0.793 0.075
##   good   0.019  0.308 0.673
## 
## [Overall Accuracy]
## 75.01021 %
## 
## [Cohen's kappa]
## Unweighted: 0.6028426 
##   Weighted: 0.7315294

This is the best performance so far. Simplification of quality level worked well for Random Forest.

# Extreme case detection for excellent ones
rf4 <- randomForest(fml.e, ww)
describePerformance(ww$excellent, rf4$predicted)

## [Contingency Table]
##        pred
## org     FALSE TRUE
##   FALSE  4716    2
##   TRUE    103   77
## 
## [Contingency Table by Proportion]
##        pred
## org     FALSE  TRUE
##   FALSE 1.000 0.000
##   TRUE  0.572 0.428
## 
## [Overall Accuracy]
## 97.85627 %
## 
## [Cohen's kappa]
## Unweighted: 0.5852975 
##   Weighted: 0.5852975

While SVM could not detect any excellent ones, Random Forest could detect about 42 % of them.

# Extreme case detection for inferior ones
rf5 <- randomForest(fml.i, ww)
describePerformance(ww$inferior, rf5$predicted)

## [Contingency Table]
##        pred
## org     FALSE TRUE
##   FALSE  4703   12
##   TRUE    142   41
## 
## [Contingency Table by Proportion]
##        pred
## org     FALSE  TRUE
##   FALSE 0.997 0.003
##   TRUE  0.776 0.224
## 
## [Overall Accuracy]
## 96.85586 %
## 
## [Cohen's kappa]
## Unweighted: 0.3363202 
##   Weighted: 0.3363202

This is slightly better than SVM though, detection rate of inferior ones is about 22 %.

6. Final Plots and Summary

As a summary, I’m going to revisit findings with plots.

6.1. Alcohol

ggplot(ww, aes(x=quality.f2, y=alcohol, fill=quality.f2)) + 
  geom_boxplot() + 
  stat_summary(fun.data=mean_cl_normal, geom="errorbar", 
               color='red', size=1) +
  ylim(quantile(ww$alcohol, prob=0.00), 
       quantile(ww$alcohol, prob=0.99)) +
  ggtitle("Range of Alcohol per Quality Level") + 
  xlab("Quality") +
  ylab("Alcohol (%)") +
  theme(title=element_text(size=18, face="bold"),
        axis.title=element_text(size=14), 
        axis.text=element_text(size=12),
        legend.position="none")

The 95 % confidence intervals are added by red lines that have relatively narrow range because of large numbers of data points. It shows that mean of each group are significantly different at that confidence level. This can imply that higher alcohol volume produce better quality wine although the causality cannot be clear only from this data analysis. (It is unclear whether high alcohol volume directly contributes to better quality.)

6.2. Density

# Compute mean of distribution per level of cls
mu <- summarise(group_by(ww, quality.f2), mean(density))
colnames(mu)[2] <- "density"

ggplot(ww, aes(x=density, fill=quality.f2)) + 
  geom_density(alpha=0.3) +
  geom_vline(aes(xintercept=density, color=quality.f2), 
             data=mu, size=1, show.legend=F) +
  xlim(quantile(ww$density, prob=0.00), 
       quantile(ww$density, prob=0.99)) +
  guides(fill=guide_legend(title="Quality", reverse=T)) +
  ggtitle("Distribution of Density per Quality Level") + 
  xlab("Density (g/ml)") +
  ylab("") +
  theme(title=element_text(size=18, face="bold"),
      axis.title=element_text(size=14), 
      axis.text=element_text(size=12),
      legend.text=element_text(size=12))

This is a density plot of density. As quality goes up, the center of distribution of density gets smaller. Since density has a high correlation coeffecient with alcohol, this is something expected. The lower density possibly produce better quality wine although the causality is not deterministic here as well. (This might be just due to the negative correlation (-0.78) between alcohol and density.)

6.3. Residual Sugar with Density

ggplot(ww, aes(x=density, y=residual.sugar, color=quality.f2)) +
  geom_point(position='jitter') + 
  geom_smooth(aes(color=quality.f2), method=lm) +
  xlim(min(ww$density), 
       quantile(ww$density, prob=0.99)) +
  ylim(min(ww$residual.sugar), 
       quantile(ww$residual.sugar, prob=0.99)) +
  guides(color=guide_legend(title="Quality", reverse=T)) +
  ggtitle("Quality on Density and Residual Suagr Dimension") + 
  xlab("Density (g/mL)") +
  ylab("Residual Sugar (g/L)") +
  theme(title=element_text(size=18, face="bold"),
      axis.title=element_text(size=14), 
      axis.text=element_text(size=12),
      legend.text=element_text(size=12))

A regression line is added to each quality group with the shadow of confidence interval. When density is low, the distinction among the quality levels is clear, but as density increases the margin gets narrower and at the high end they cross. This means that the distinction in the 2D space is only applicable where density is low to middle.

Through exploratory data analysis, I found that alcohol, density and residual.sugar are possible factors which affect quality level in an interpretable manner. Our sense of taste feels quite complex and the result of modeling implies that accurate prediction of wine quality requires more complexity which is not interpretable through visualization in a few dimensional space.

7. Reflection

When I started out this project, I thought that it would not be so difficult to find hidden relationships between quality and physicochemical properties because the data set has only eleven independent variables and all of them are clean. But I soon noticed I was wrong after performing some visualizations through arbitrary variable choices. What I came up with ater the struggle was a brute force method which plots all the combinations of variables. This way I could check if there is any interesting pattern in different type of 2D spaces effeciently. By systemizing recursive plotting through concise function calls, I also could minimize the amount of code. The plot in the section 6.3 is one of them that I could find by the brute force method and would have been difficult to be found otherwise.

I came up with new quality level representations to make more sense for visualization. Those are transformation of the dependent variable and it worked. But I could not utilize transformation of independent variables and feature engineering. Based on the observation in the modeling section, there is a tendency that more complex non-linear models perform better. But number of models I applied in this project is limitted and model tuning was not performed. So those things related to modeling are left for the future project.

8. Reference

R Markdown Reference Guide - http://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf
Knitr with R Markdown, knitr in a knutshell - http://kbroman.org/knitr_knutshell/pages/Rmarkdown.html
Chunk options and package options, knitr - http://yihui.name/knitr/options/
Set theme elements, ggplot2 2.0.0 - http://docs.ggplot2.org/current/theme.html

Quality and Physicochemical Properties of White Wine

Teru Watanabe

Friday, February 12, 2016

Table of Contents