Introduction

Wine is an alcoholic drink made from fermented grapes. Yeast consumes the sugar in the grapes and converts it to ethanol, carbon dioxide, and heat. Different varieties of grapes and strains of yeasts produce different styles of wine. These variations result from the complex interactions between the biochemical development of the grape, the reactions involved in fermentation, the terroir, and the production process. Many countries enact legal appellations intended to define styles and qualities of wine. These typically restrict the geographical origin and permitted varieties of grapes, as well as other aspects of wine production. Wines not made from grapes include rice wine and fruit wines such as plum, cherry, pomegranate, currant and elderberry. Source: Wikipedia

The dataset for the following analysis is collected from the UCI Machine Learning Repository. The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

Exploratory Data Analysis

First, load the dataset.

library(dplyr)
library(gridExtra)
library(knitr)
library(kableExtra)
library(ggplot2)
library(corrplot)
library(GGally)
library(pander)

wine <- read.csv("winequality-red.csv", sep = ';')
summary(wine)
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
kable(head(wine), "html") %>%
  kable_styling("striped") %>%
  scroll_box(width = "100%")
fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5
7.8 0.88 0.00 2.6 0.098 25 67 0.9968 3.20 0.68 9.8 5
7.8 0.76 0.04 2.3 0.092 15 54 0.9970 3.26 0.65 9.8 5
11.2 0.28 0.56 1.9 0.075 17 60 0.9980 3.16 0.58 9.8 6
7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5
7.4 0.66 0.00 1.8 0.075 13 40 0.9978 3.51 0.56 9.4 5

Univariate Analysis

p1 <- ggplot(wine) + geom_histogram(aes(alcohol), color="black", fill="#ce2d4f")
p2 <- ggplot(wine) + geom_histogram(aes(chlorides), color="black", fill="#ce6d8b")
p3 <- ggplot(wine) + geom_histogram(aes(citric.acid), color="black", fill="#cebbc9")
p4 <- ggplot(wine) + geom_histogram(aes(density), color="black", fill="#4056f4")
p5 <- ggplot(wine) + geom_histogram(aes(fixed.acidity), color="black", fill="#470ff4")
p6 <- ggplot(wine) + geom_histogram(aes(free.sulfur.dioxide), color="black", fill="#e54b4b")
p7 <- ggplot(wine) + geom_histogram(aes(pH), color="black", fill="#ffa987")
p8 <- ggplot(wine) + geom_histogram(aes(quality), color="black", fill="#c8d5b9")
p9 <- ggplot(wine) + geom_histogram(aes(residual.sugar), color="black", fill="#4a7c59")
p10 <- ggplot(wine) + geom_histogram(aes(sulphates), color="black", fill="#c4b7cb")
p11 <- ggplot(wine) + geom_histogram(aes(total.sulfur.dioxide), color="black", fill="#98e2c6")
p12 <- ggplot(wine) + geom_histogram(aes(volatile.acidity), color="black", fill="#06bee1")

grid.arrange(p1, p2, p3, p4, ncol= 2)

grid.arrange(p5, p6, p7, p8, ncol= 2)

grid.arrange(p9, p10, p11, p12, ncol= 2)

Observations:

  • some ofthe variables have normal distributions (density, fixed acidity, pH, volatile acidity).
  • some variables are skewed little bit towards the lower end of the values (chlorides, citric acid, residual sugar, total sulfur dioxide).
  • variable quality has only 6 discrete values.
wine$rating[5 >= wine$quality] <- 'Average'
wine$rating[5 < wine$quality & wine$quality <8] <- 'Good'
wine$rating[8 <= wine$quality] <- 'Excellent'
wine$rating <- as.factor(wine$rating)
wine$rating <- relevel(wine$rating, 'Average')
qplot(x=rating,data = wine)

Bivariate Analysis

Now, we will explore each variable with respect to quality

p1 <- ggplot(wine, aes(group = cut_width(quality, 1))) + 
  geom_boxplot(aes(quality, alcohol), colour = "#9cb380")

p2 <- ggplot(wine, aes(group = cut_width(quality, 1))) + 
  geom_boxplot(aes(quality, chlorides), colour = "#522a27")

p3 <- ggplot(wine, aes(group = cut_width(quality, 1))) + 
  geom_boxplot(aes(quality, citric.acid), colour = "#c73e1d")

p4 <- ggplot(wine, aes(group = cut_width(quality, 1))) + 
  geom_boxplot(aes(quality, density), colour = "#c59849")

p5 <- ggplot(wine, aes(group = cut_width(quality, 1))) + 
  geom_boxplot(aes(quality, fixed.acidity), colour = "#1f2041")

p6 <- ggplot(wine, aes(group = cut_width(quality, 1))) + 
  geom_boxplot(aes(quality, free.sulfur.dioxide), colour = "#4b3f72")

p7 <- ggplot(wine, aes(group = cut_width(quality, 1))) + 
  geom_boxplot(aes(quality, pH), colour = "#417b5a")

p8 <- ggplot(wine, aes(group = cut_width(quality, 1))) + 
  geom_boxplot(aes(quality, residual.sugar), colour = "#417b5a")

p9 <- ggplot(wine, aes(group = cut_width(quality, 1))) + 
  geom_boxplot(aes(quality, sulphates), colour = "#17183b")

p10 <- ggplot(wine, aes(group = cut_width(quality, 1))) + 
  geom_boxplot(aes(quality, total.sulfur.dioxide), colour = "#a11692")

p11 <- ggplot(wine, aes(group = cut_width(quality, 1))) + 
  geom_boxplot(aes(quality, volatile.acidity), colour = "#a11692")

grid.arrange(p1, p2, ncol= 2)

grid.arrange(p3, p4, ncol= 2)

grid.arrange(p5, p6, ncol= 2)

grid.arrange(p7, p8, ncol= 2)

grid.arrange(p9, p10, ncol= 2)

grid.arrange(p11, ncol= 2)

Observations:

  • Chlorides have a minimal effect on the Quality of wine.
  • Citric acid seems to have a positive correlation with Wine Quality. Better wines have higher Citric Acid.
  • Better wines seems to have lower densities. But may be it will be wise not to draw any conclusions here. Because there might be a possibility that the low density is due to higher alcohol content which actually is the driving factor for better wines.
  • Fixed Acidity has almost no effect on the Quality. The median values of fixed acidity remains almost unchanged with increase in quality.
  • Too low concentration of Free Sulphur Dioxide produces poor wine and too high concentration results in average wine.
  • Residual Sugar almost has no effect on the Quality of the Wine.
  • Volatile acid seems to have a negative impact on the quality of the wine. As volatile acid level goes up, the quality of the wine degrades.

Let’s view the correlation plot to get more insights about the relationships between different variables.

corMatrix <- cor(wine[, -13])
corrplot(corMatrix, order = "FPC", method = "color", type = "lower", 
         tl.cex = 0.6, tl.col = 'black')

emphasize.strong.cells(which(abs(corMatrix) > .3 & corMatrix != 1, arr.ind = TRUE))
pandoc.table(corMatrix)
## 
## ---------------------------------------------------------------------------
##           &nbsp;            fixed.acidity   volatile.acidity   citric.acid 
## -------------------------- --------------- ------------------ -------------
##     **fixed.acidity**             1             -0.2561        **0.6717**  
## 
##    **volatile.acidity**        -0.2561             1           **-0.5525** 
## 
##      **citric.acid**         **0.6717**       **-0.5525**           1      
## 
##     **residual.sugar**         0.1148           0.001918         0.1436    
## 
##       **chlorides**            0.09371           0.0613          0.2038    
## 
##  **free.sulfur.dioxide**       -0.1538          -0.0105         -0.06098   
## 
##  **total.sulfur.dioxide**      -0.1132          0.07647          0.03553   
## 
##        **density**            **0.668**         0.02203        **0.3649**  
## 
##           **pH**             **-0.683**          0.2349        **-0.5419** 
## 
##       **sulphates**             0.183            -0.261        **0.3128**  
## 
##        **alcohol**            -0.06167          -0.2023          0.1099    
## 
##        **quality**             0.1241         **-0.3906**        0.2264    
## ---------------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## ------------------------------------------------------------------------------
##           &nbsp;            residual.sugar   chlorides    free.sulfur.dioxide 
## -------------------------- ---------------- ------------ ---------------------
##     **fixed.acidity**           0.1148        0.09371           -0.1538       
## 
##    **volatile.acidity**        0.001918        0.0613           -0.0105       
## 
##      **citric.acid**            0.1436         0.2038          -0.06098       
## 
##     **residual.sugar**            1           0.05561            0.187        
## 
##       **chlorides**            0.05561           1             0.005562       
## 
##  **free.sulfur.dioxide**        0.187         0.005562             1          
## 
##  **total.sulfur.dioxide**       0.203          0.0474         **0.6677**      
## 
##        **density**            **0.3553**       0.2006          -0.02195       
## 
##           **pH**               -0.08565        -0.265           0.07038       
## 
##       **sulphates**            0.005527      **0.3713**         0.05166       
## 
##        **alcohol**             0.04208        -0.2211          -0.06941       
## 
##        **quality**             0.01373        -0.1289          -0.05066       
## ------------------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## -----------------------------------------------------------------------------
##           &nbsp;            total.sulfur.dioxide     density         pH      
## -------------------------- ---------------------- ------------- -------------
##     **fixed.acidity**             -0.1132           **0.668**    **-0.683**  
## 
##    **volatile.acidity**           0.07647            0.02203       0.2349    
## 
##      **citric.acid**              0.03553          **0.3649**    **-0.5419** 
## 
##     **residual.sugar**             0.203           **0.3553**     -0.08565   
## 
##       **chlorides**                0.0474            0.2006        -0.265    
## 
##  **free.sulfur.dioxide**         **0.6677**         -0.02195       0.07038   
## 
##  **total.sulfur.dioxide**            1               0.07127      -0.06649   
## 
##        **density**                0.07127               1        **-0.3417** 
## 
##           **pH**                  -0.06649         **-0.3417**        1      
## 
##       **sulphates**               0.04295            0.1485        -0.1966   
## 
##        **alcohol**                -0.2057          **-0.4962**     0.2056    
## 
##        **quality**                -0.1851            -0.1749      -0.05773   
## -----------------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## -------------------------------------------------------------------
##           &nbsp;            sulphates      alcohol       quality   
## -------------------------- ------------ ------------- -------------
##     **fixed.acidity**         0.183       -0.06167       0.1241    
## 
##    **volatile.acidity**       -0.261       -0.2023     **-0.3906** 
## 
##      **citric.acid**        **0.3128**     0.1099        0.2264    
## 
##     **residual.sugar**       0.005527      0.04208       0.01373   
## 
##       **chlorides**         **0.3713**     -0.2211       -0.1289   
## 
##  **free.sulfur.dioxide**     0.05166      -0.06941      -0.05066   
## 
##  **total.sulfur.dioxide**    0.04295       -0.2057       -0.1851   
## 
##        **density**            0.1485     **-0.4962**     -0.1749   
## 
##           **pH**             -0.1966       0.2056       -0.05773   
## 
##       **sulphates**             1          0.09359       0.2514    
## 
##        **alcohol**           0.09359          1        **0.4762**  
## 
##        **quality**            0.2514     **0.4762**         1      
## -------------------------------------------------------------------

Observations:

  • As expected, we see a strong correlation among variables representing acidity like citric acid, pH, volatile acidity and fixed acidity.
  • Volatile acidity has a positive correlation with pH. But, we know that as acidity increases, pH value decreases. This paradoxical relationship needs to be further investigated.
  • Density has a very strong correlation with fixed acidity.
  • The variables most strongly correlated to quality are volatile acidity and alcohol.
  • Alcohol has negative correlation with density. This is evident from the fact that the density of water is greater than the density of alcohol.

(Work still in progress…)