Predicting Red Wine Quality: Exploratory Data Analysis

by Jekaterina Novikova

All the packages used in the analysis are listed below:

library(ggplot2)
library(GGally) # for ggpairs
library(memisc)
library(gridExtra)

Dataset Analysis

Loading Data

The used dataset consists of 13 variable, eleven of which are the characteristics of red wine that potentially influence its quality. The first variable X represents the user ID, and the last one quality - the evaluation of perceived wine quality.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Adding New Variables

The variable quality is of numeric type int, which is not convenient for the analysis. So first of all, I change the type of the quality variable to factor and add it to the dataframe as a new variable quality.factor. In addition, I create three categories of quality - good (>= 7), bad (<=4), and medium (5 and 6).

df$quality.factor <- factor(df$quality)
df$quality.cat <- NA
df$quality.cat <- ifelse(df$quality>=7, 'good', NA)
df$quality.cat <- ifelse(df$quality<=4, 'bad', df$quality.cat)
df$quality.cat <- ifelse(df$quality==5, 'medium', df$quality.cat)
df$quality.cat <- ifelse(df$quality==6, 'medium', df$quality.cat)

df$quality.cat <- factor(df$quality.cat, levels = c("bad", "medium", "good"))

Univariate Analysis and Plots

For each variable in the dataset, I plot its frequency histogram and boxplot, showing the change of perceived quality influenced by each characteristic.

Normal Distribution of Frequency

The following variables have a normal or close-to-normal distribution: fixed.acidity, volatile.acidity, density, pH and alcohol. Distribution of the variable citric.acid frequency is not normal but I will not transform this data for the purpose of the analysis.

n1 <- qplot(x = fixed.acidity, data = df, 
      binwidth = 0.1) +
  scale_x_continuous(breaks = seq(4, 16, 1))

n2 <- qplot(x = volatile.acidity, data = df, 
      binwidth = 0.01) +
  scale_x_continuous(breaks = seq(0.12, 1.58, 0.1))

n3 <- qplot(x = citric.acid, data = df, 
      binwidth = 0.01) +
  scale_x_continuous(breaks = seq(0, 1, 0.1))

n4 <- qplot(x = density, data = df)

n5 <- qplot(x = pH, data = df)

n6 <- qplot(x = alcohol, data = df)

grid.arrange(n1, n2, n3, n4, n5, n6, ncol = 2)

Transforming Data

The following list of variables is not a normal or close-to-normal distribution: residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide and sulphates. The hitograms of all these variables are right-skewed a lot and need some transformation.

l1 <- qplot(x = residual.sugar, data = df) +
  scale_x_continuous(breaks = seq(0, 16, 0.5))

l2 <- qplot(x = chlorides, data = df, 
      binwidth = 0.01) +
  scale_x_continuous(breaks = seq(0, 1, 0.1))

l3 <- qplot(x = free.sulfur.dioxide, data = df, 
      binwidth = 0.5)

l4 <- qplot(x = total.sulfur.dioxide, data = df, 
      binwidth = 0.5)

l5 <- qplot(x = sulphates, data = df)

grid.arrange(l1, l2, l3, l4, l5, ncol = 2)

I transform all this data using the log10-transformation to make the data look more like normal distribution.

l1a <- qplot(x = log10(residual.sugar), data = df)

l2a <- qplot(x = log10(chlorides), data = df)

l3a <- qplot(x = log10(free.sulfur.dioxide), data = df)

l4a <- qplot(x = log10(total.sulfur.dioxide), data = df)

l5a <- qplot(x = log10(sulphates), data = df)

grid.arrange(l1a, l2a, l3a, l4a, l5a, ncol = 2)

Bivariate Analysis

In this section, I analyse relationships between wine characteristics and its perceived quality, as well as possible correlations between different characteristics.

Relationship between Wine Characteristics and Its Quality

In order to compare statistical data of each variable visually, I use boxplots.

Increasing Quality of Wine

The following set of boxplots show all the cases when peceived wine quality increases together with increasing values of a characteristic’s variable.

p1up = qplot(x = quality.cat, y = alcohol, 
      data = df,
      geom = "boxplot")

p2up = qplot(x = quality.cat, y = sulphates, 
      data = df,
      geom = "boxplot")

p3up = qplot(x = quality.cat, y = citric.acid, 
      data = df,
      geom = "boxplot")

p4up = qplot(x = quality.cat, y = fixed.acidity, 
      data = df,
      geom = "boxplot")

grid.arrange(p1up, p2up, p3up, p4up, ncol = 2)

Decreasing Quality of Wine

The next set of boxplots, on the contrary, show all the cases when peceived wine quality decreases while the values of variables increase.

p1d = qplot(x = quality.cat, y = volatile.acidity, 
      data = df,
      geom = "boxplot")
p2d = qplot(x = quality.cat, y = pH, 
      data = df,
      geom = "boxplot")
p3d = qplot(x = quality.cat, y = density, 
      data = df,
      geom = "boxplot")
p4d = qplot(x = quality.factor, y = density, 
      data = df,
      geom = "boxplot")

grid.arrange(p1d, p2d, p3d, ncol = 2)

ggpairs

The ggpairs output uses groups histograms for qualitative/qualitative variables and scatterplots for quantitative/quantitative variables in the lower triangle of the plot. In the upper triangle, it provides boxplots for the qualitative/quantitative pairs of variables, and correlation coefficients for quantitative/quantitative pairs.

df.subset <- df[,2:13]
ggpairs(df.subset, params = c(shape = I('.'), outlier.shape = I('.')))

By focusing on the pH column, I see that there could be a relationship between density and pH, as well as between pH and citric.acid. There is also a relationship between pH and fixed.acidity. The correlation between pH and these three variables is similar and always negative.

dens <- ggplot(aes(x = pH, y = density), data = df) + 
  geom_point(alpha = 1/5, position = position_jitter(h = 0)) +
  coord_trans(x = "log10") +
  geom_smooth(method = "lm", color = "red")

citr.ac <- ggplot(aes(x = pH, y = citric.acid), data = df) + 
  geom_point(alpha = 1/5, position = position_jitter(h = 0)) +
  coord_trans(x = "log10") +
  geom_smooth(method = "lm", color = "red")

fix.ac <- ggplot(aes(x = pH, y = fixed.acidity), data = df) + 
  geom_point(alpha = 1/5, position = position_jitter(h = 0)) +
  coord_trans(x = "log10") +
  geom_smooth(method = "lm", color = "red")

grid.arrange(dens, citr.ac, fix.ac, ncol = 2)

The highest positive correlation, according to a ggpairs matrix, is between density and fixed.acidity, as well as between fixed.acidity and citric.acid.

pos1 <- ggplot(aes(x = fixed.acidity, y = density), data = df) + 
  geom_jitter(alpha = 1/5) +
  geom_smooth(method = "lm", color = "red")

pos2 <- ggplot(aes(x = fixed.acidity, y = citric.acid), data = df) + 
  geom_jitter(alpha = 1/5) +
  geom_smooth(method = "lm", color = "red")

grid.arrange(pos1, pos2, ncol = 2)

Multivariate Analysis and Plots

In the previous boxplots we have seen that increasing levels of both sulphates and alcohol increase a perceived quality of red wine. Now, I create a scatterplot to see wheather a combination of these two may help distinguish between different quality levels.

ggplot(aes(x = log10(sulphates), y = alcohol, colour = quality.factor), 
       data = df) + 
  geom_point(aes(size = quality.factor)) +
  scale_color_brewer(type = 'div', palette="Set1") +
  scale_x_continuous(lim=c(quantile(log10(df$sulphates), 0.01),
                           quantile(log10(df$sulphates), 0.99)))+
  scale_y_continuous(lim=c(quantile(df$alcohol, 0.01),
                           quantile(df$alcohol, 0.99)))

The plot reveals a clear pattern, showing most of orange and yellow dots (high-quality wine) in the place where both alcohol and sulphates level are high. There is also a visible range of violet dots in the middle of the plot, and the zone of mostly green dots in the bottom-left corner. This implies that such a combination of variables lets distinguish between different levels of medium-quality wines (5 and 6).

The previous plots show there is a positive corelation between the variables of density and fixed.acidity, so I create a dcatterplot to see wheather these two variables explain the quality changes well.

ggplot(aes(x = fixed.acidity, y = density, colour = quality.factor), 
       data = df) + 
  geom_point(size = 4) +
  #geom_point() +
  scale_color_brewer(type = 'div', palette="Set1") +
  scale_x_continuous(lim=c(quantile(df$fixed.acidity, 0.01),
                           quantile(df$fixed.acidity, 0.99))) +
  scale_y_continuous(lim=c(quantile(df$density, 0.01),
                           quantile(df$density, 0.99)))

Although the plot is not very clear, it reveals some patterns in presented data. It is visible here that the majority of green and violet dots are concentrated in the upper part, while the majority of orange dots are concentrated in the bottom part of the plot. Thus, this combination of variables may be useful to distinguish medium quality wine from the high quality.

Finally, I will analyze the influence of pH and sulfur.dioxide on a quality of red wine.

The left plot below shows the impact of the pH-total.sulfur.dioxide combination on quality. The zone of green dots (medium-quality wine) is immediately visible here.

The variable total.sulfur.dioxide is highly correlated with free.sulfur.dioxide, so I create the right plot to see wheather adding another variable adds any value. It seems that the area of orange dots (high-quality wine) is more visible on the right plot (left-bottom corner), while the green dots area is still clearly distinguished.

p1 <- ggplot(aes(x = pH, y = total.sulfur.dioxide, colour = quality.factor), 
       data = df) + 
  geom_point(aes(size = quality.factor)) +
  scale_color_brewer(type = 'div', palette="Set1") +
  scale_x_continuous(lim=c(quantile(df$pH, 0.01),
                           quantile(df$pH, 0.99))) +
  scale_y_continuous(lim=c(quantile(df$total.sulfur.dioxide, 0.01),
                           quantile(df$total.sulfur.dioxide, 0.99)))

p2 <- ggplot(aes(x = log10(total.sulfur.dioxide), 
                 y = log10(free.sulfur.dioxide), colour = quality.factor), 
             data = df) + 
  geom_point(aes(size = quality.factor)) +
  #geom_point(aes(size = 12)) + 
  scale_color_brewer(type = 'div', palette="Set1") +
  scale_x_continuous(lim=c(quantile(log10(df$total.sulfur.dioxide),
                                    0.01),
                           quantile(log10(df$total.sulfur.dioxide),
                                    0.99))) +
  scale_y_continuous(lim=c(quantile(log10(df$free.sulfur.dioxide),
                                    0.01),
                           quantile(log10(df$free.sulfur.dioxide),
                                    0.99)))
grid.arrange(p1, p2, ncol = 2)

Regression Model

I mostly use combinations of two and more variables for the multiple regression model predicting the quality of red wine. First combination consists of all the variables that increase the quality with their increasing levels. Next combination is density and fixed.acidity as its visual representation implied its value for predicting the quality. Next goes volatile.acidity, as this variable has the highest negative correlation coefficient with the quality variable. And the last combination consists of pH, total.sulfur.dioxide and free.sulfur.dioxide, based on the last step of the previous EDA.

m1 <- lm(quality ~ alcohol*sulphates*citric.acid*fixed.acidity, data = df)
m2 <- update(m1, ~ . + density*fixed.acidity)
m3 <- update(m2, ~ . + volatile.acidity)
m4 <- update(m3, ~ . + pH*total.sulfur.dioxide*free.sulfur.dioxide)


mtable(m1, m2, m3, m4)

## 
## Calls:
## m1: lm(formula = quality ~ alcohol * sulphates * citric.acid * fixed.acidity, 
##     data = df)
## m2: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity + 
##     density + alcohol:sulphates + alcohol:citric.acid + sulphates:citric.acid + 
##     alcohol:fixed.acidity + sulphates:fixed.acidity + citric.acid:fixed.acidity + 
##     fixed.acidity:density + alcohol:sulphates:citric.acid + alcohol:sulphates:fixed.acidity + 
##     alcohol:citric.acid:fixed.acidity + sulphates:citric.acid:fixed.acidity + 
##     alcohol:sulphates:citric.acid:fixed.acidity, data = df)
## m3: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity + 
##     density + volatile.acidity + alcohol:sulphates + alcohol:citric.acid + 
##     sulphates:citric.acid + alcohol:fixed.acidity + sulphates:fixed.acidity + 
##     citric.acid:fixed.acidity + fixed.acidity:density + alcohol:sulphates:citric.acid + 
##     alcohol:sulphates:fixed.acidity + alcohol:citric.acid:fixed.acidity + 
##     sulphates:citric.acid:fixed.acidity + alcohol:sulphates:citric.acid:fixed.acidity, 
##     data = df)
## m4: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity + 
##     density + volatile.acidity + pH + total.sulfur.dioxide + 
##     free.sulfur.dioxide + alcohol:sulphates + alcohol:citric.acid + 
##     sulphates:citric.acid + alcohol:fixed.acidity + sulphates:fixed.acidity + 
##     citric.acid:fixed.acidity + fixed.acidity:density + pH:total.sulfur.dioxide + 
##     pH:free.sulfur.dioxide + total.sulfur.dioxide:free.sulfur.dioxide + 
##     alcohol:sulphates:citric.acid + alcohol:sulphates:fixed.acidity + 
##     alcohol:citric.acid:fixed.acidity + sulphates:citric.acid:fixed.acidity + 
##     pH:total.sulfur.dioxide:free.sulfur.dioxide + alcohol:sulphates:citric.acid:fixed.acidity, 
##     data = df)
## 
## ==============================================================================================
##                                                        m1         m2         m3         m4    
## ----------------------------------------------------------------------------------------------
## (Intercept)                                           6.004    -16.407   -28.327    -47.730   
##                                                      (8.287)   (52.025)  (50.640)   (52.164)  
## alcohol                                              -0.248     -0.031    -0.006      0.166   
##                                                      (0.791)    (0.788)   (0.767)    (0.772)  
## sulphates                                            -1.259      1.733     1.544      1.326   
##                                                     (11.824)   (11.869)  (11.550)   (11.605)  
## citric.acid                                          24.816     31.954    33.083     36.242*  
##                                                     (18.651)   (18.861)  (18.354)   (18.386)  
## fixed.acidity                                         0.078     10.120     8.308      7.842   
##                                                      (1.148)    (5.646)   (5.497)    (5.526)  
## alcohol x sulphates                                   0.400      0.151     0.199      0.126   
##                                                      (1.120)    (1.121)   (1.091)    (1.098)  
## alcohol x citric.acid                                -1.790     -2.483    -2.711     -3.038   
##                                                      (1.769)    (1.784)   (1.736)    (1.742)  
## sulphates x citric.acid                             -55.541*   -64.136*  -62.580*   -61.781*  
##                                                     (25.491)   (25.795)  (25.101)   (25.161)  
## alcohol x fixed.acidity                               0.004     -0.029    -0.014     -0.039   
##                                                      (0.111)    (0.111)   (0.108)    (0.108)  
## sulphates x fixed.acidity                            -0.679     -1.025    -0.781     -0.833   
##                                                      (1.639)    (1.640)   (1.596)    (1.602)  
## citric.acid x fixed.acidity                          -3.359     -4.130    -4.076     -4.447*  
##                                                      (2.308)    (2.325)   (2.262)    (2.263)  
## alcohol x sulphates x citric.acid                     4.484      5.231*    5.177*     5.190*  
##                                                      (2.400)    (2.427)   (2.362)    (2.369)  
## alcohol x sulphates x fixed.acidity                   0.052      0.082     0.048      0.067   
##                                                      (0.158)    (0.158)   (0.154)    (0.155)  
## alcohol x citric.acid x fixed.acidity                 0.273      0.349     0.342      0.385   
##                                                      (0.221)    (0.222)   (0.216)    (0.216)  
## sulphates x citric.acid x fixed.acidity               6.979*     7.874*    7.330*     7.300*  
##                                                      (3.168)    (3.191)   (3.105)    (3.109)  
## alcohol x sulphates x citric.acid x fixed.acidity    -0.597*    -0.675*   -0.621*    -0.634*  
##                                                      (0.302)    (0.304)   (0.295)    (0.296)  
## density                                                         19.745    31.965     55.838   
##                                                                (52.053)  (50.668)   (52.270)  
## fixed.acidity x density                                         -9.685    -7.953     -7.355   
##                                                                 (5.584)   (5.436)    (5.465)  
## volatile.acidity                                                          -1.106***  -0.985***
##                                                                           (0.117)    (0.119)  
## pH                                                                                   -1.566***
##                                                                                      (0.375)  
## total.sulfur.dioxide                                                                 -0.072** 
##                                                                                      (0.024)  
## free.sulfur.dioxide                                                                  -0.207** 
##                                                                                      (0.066)  
## pH x total.sulfur.dioxide                                                             0.021** 
##                                                                                      (0.007)  
## pH x free.sulfur.dioxide                                                              0.065** 
##                                                                                      (0.020)  
## total.sulfur.dioxide x free.sulfur.dioxide                                            0.003***
##                                                                                      (0.001)  
## pH x total.sulfur.dioxide x free.sulfur.dioxide                                      -0.001***
##                                                                                      (0.000)  
## ----------------------------------------------------------------------------------------------
## R-squared                                              0.333      0.342      0.377      0.391 
## adj. R-squared                                         0.327      0.335      0.370      0.381 
## sigma                                                  0.663      0.659      0.641      0.635 
## F                                                     52.712     48.362     53.221     40.373 
## p                                                      0.000      0.000      0.000      0.000 
## Log-likelihood                                     -1602.739  -1591.863  -1547.716  -1530.317 
## Deviance                                             695.015    685.624    648.792    634.825 
## AIC                                                 3239.478   3221.725   3135.432   3114.633 
## BIC                                                 3330.889   3323.891   3242.975   3259.816 
## N                                                   1599       1599       1599       1599     
## ==============================================================================================

The given model explains 39% of cases in the given dataset. The highest R-squared = 0.333 is provided by the first combination of parameters (alcohol, sulphates, citric.acid, fixed.acidity). Next three sets of features add 0.01-0.03 to the previous R-squared value.

This model has limitations. It is based on the limited data that does not provide very high (more than 8) and very low (less than 3) quality scores. Collecting the data with more cases of extreme scores, as well as additional data with existing low-quality scores (3 and 4), could significantly improve the model’s predictive power.

Final Plots and Summary

Plot One

Description One

Alcohol and citric acid are two characteristics that increase a perceived quality of wine the most. pH and volatile acidity, on the contrary, reduce a perceived quality the most.

Plot Two

Description Two

Alcohol and sulphates, together with other quality increasing characteristics, are doing the hardest job in predicting red wine quality.

Summary

Multiple regression model is able to explain up to 39% of existing cases in the dataset. Additional dataset with more data of extreme quality cases (both high and low-quality) should help improve the results of this model. Moreover, more sophisticated prediction models should be able to provide more accurate predictions for the quality of wine based on its chemical characteristics.