Wine is an alcoholic drink made from fermented grapes. Yeast consumes the sugar in the grapes and converts it to ethanol, carbon dioxide, and heat. Different varieties of grapes and strains of yeasts produce different styles of wine. These variations result from the complex interactions between the biochemical development of the grape, the reactions involved in fermentation, the terroir, and the production process. Many countries enact legal appellations intended to define styles and qualities of wine. These typically restrict the geographical origin and permitted varieties of grapes, as well as other aspects of wine production. Wines not made from grapes include rice wine and fruit wines such as plum, cherry, pomegranate, currant and elderberry. Source: Wikipedia
The dataset for the following analysis is collected from the UCI Machine Learning Repository. The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
First, load the dataset.
library(dplyr)
library(gridExtra)
library(knitr)
library(kableExtra)
library(ggplot2)
library(corrplot)
library(GGally)
library(pander)
wine <- read.csv("winequality-red.csv", sep = ';')
summary(wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
kable(head(wine), "html") %>%
kable_styling("striped") %>%
scroll_box(width = "100%")
| fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | quality |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
| 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
| 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
| 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 7.4 | 0.66 | 0.00 | 1.8 | 0.075 | 13 | 40 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
p1 <- ggplot(wine) + geom_histogram(aes(alcohol), color="black", fill="#ce2d4f")
p2 <- ggplot(wine) + geom_histogram(aes(chlorides), color="black", fill="#ce6d8b")
p3 <- ggplot(wine) + geom_histogram(aes(citric.acid), color="black", fill="#cebbc9")
p4 <- ggplot(wine) + geom_histogram(aes(density), color="black", fill="#4056f4")
p5 <- ggplot(wine) + geom_histogram(aes(fixed.acidity), color="black", fill="#470ff4")
p6 <- ggplot(wine) + geom_histogram(aes(free.sulfur.dioxide), color="black", fill="#e54b4b")
p7 <- ggplot(wine) + geom_histogram(aes(pH), color="black", fill="#ffa987")
p8 <- ggplot(wine) + geom_histogram(aes(quality), color="black", fill="#c8d5b9")
p9 <- ggplot(wine) + geom_histogram(aes(residual.sugar), color="black", fill="#4a7c59")
p10 <- ggplot(wine) + geom_histogram(aes(sulphates), color="black", fill="#c4b7cb")
p11 <- ggplot(wine) + geom_histogram(aes(total.sulfur.dioxide), color="black", fill="#98e2c6")
p12 <- ggplot(wine) + geom_histogram(aes(volatile.acidity), color="black", fill="#06bee1")
grid.arrange(p1, p2, p3, p4, ncol= 2)
grid.arrange(p5, p6, p7, p8, ncol= 2)
grid.arrange(p9, p10, p11, p12, ncol= 2)
Observations:
wine$rating[5 >= wine$quality] <- 'Average'
wine$rating[5 < wine$quality & wine$quality <8] <- 'Good'
wine$rating[8 <= wine$quality] <- 'Excellent'
wine$rating <- as.factor(wine$rating)
wine$rating <- relevel(wine$rating, 'Average')
qplot(x=rating,data = wine)
Now, we will explore each variable with respect to quality
p1 <- ggplot(wine, aes(group = cut_width(quality, 1))) +
geom_boxplot(aes(quality, alcohol), colour = "#9cb380")
p2 <- ggplot(wine, aes(group = cut_width(quality, 1))) +
geom_boxplot(aes(quality, chlorides), colour = "#522a27")
p3 <- ggplot(wine, aes(group = cut_width(quality, 1))) +
geom_boxplot(aes(quality, citric.acid), colour = "#c73e1d")
p4 <- ggplot(wine, aes(group = cut_width(quality, 1))) +
geom_boxplot(aes(quality, density), colour = "#c59849")
p5 <- ggplot(wine, aes(group = cut_width(quality, 1))) +
geom_boxplot(aes(quality, fixed.acidity), colour = "#1f2041")
p6 <- ggplot(wine, aes(group = cut_width(quality, 1))) +
geom_boxplot(aes(quality, free.sulfur.dioxide), colour = "#4b3f72")
p7 <- ggplot(wine, aes(group = cut_width(quality, 1))) +
geom_boxplot(aes(quality, pH), colour = "#417b5a")
p8 <- ggplot(wine, aes(group = cut_width(quality, 1))) +
geom_boxplot(aes(quality, residual.sugar), colour = "#417b5a")
p9 <- ggplot(wine, aes(group = cut_width(quality, 1))) +
geom_boxplot(aes(quality, sulphates), colour = "#17183b")
p10 <- ggplot(wine, aes(group = cut_width(quality, 1))) +
geom_boxplot(aes(quality, total.sulfur.dioxide), colour = "#a11692")
p11 <- ggplot(wine, aes(group = cut_width(quality, 1))) +
geom_boxplot(aes(quality, volatile.acidity), colour = "#a11692")
grid.arrange(p1, p2, ncol= 2)
grid.arrange(p3, p4, ncol= 2)
grid.arrange(p5, p6, ncol= 2)
grid.arrange(p7, p8, ncol= 2)
grid.arrange(p9, p10, ncol= 2)
grid.arrange(p11, ncol= 2)
Observations:
Let’s view the correlation plot to get more insights about the relationships between different variables.
corMatrix <- cor(wine[, -13])
corrplot(corMatrix, order = "FPC", method = "color", type = "lower",
tl.cex = 0.6, tl.col = 'black')
emphasize.strong.cells(which(abs(corMatrix) > .3 & corMatrix != 1, arr.ind = TRUE))
pandoc.table(corMatrix)
##
## ---------------------------------------------------------------------------
## fixed.acidity volatile.acidity citric.acid
## -------------------------- --------------- ------------------ -------------
## **fixed.acidity** 1 -0.2561 **0.6717**
##
## **volatile.acidity** -0.2561 1 **-0.5525**
##
## **citric.acid** **0.6717** **-0.5525** 1
##
## **residual.sugar** 0.1148 0.001918 0.1436
##
## **chlorides** 0.09371 0.0613 0.2038
##
## **free.sulfur.dioxide** -0.1538 -0.0105 -0.06098
##
## **total.sulfur.dioxide** -0.1132 0.07647 0.03553
##
## **density** **0.668** 0.02203 **0.3649**
##
## **pH** **-0.683** 0.2349 **-0.5419**
##
## **sulphates** 0.183 -0.261 **0.3128**
##
## **alcohol** -0.06167 -0.2023 0.1099
##
## **quality** 0.1241 **-0.3906** 0.2264
## ---------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## ------------------------------------------------------------------------------
## residual.sugar chlorides free.sulfur.dioxide
## -------------------------- ---------------- ------------ ---------------------
## **fixed.acidity** 0.1148 0.09371 -0.1538
##
## **volatile.acidity** 0.001918 0.0613 -0.0105
##
## **citric.acid** 0.1436 0.2038 -0.06098
##
## **residual.sugar** 1 0.05561 0.187
##
## **chlorides** 0.05561 1 0.005562
##
## **free.sulfur.dioxide** 0.187 0.005562 1
##
## **total.sulfur.dioxide** 0.203 0.0474 **0.6677**
##
## **density** **0.3553** 0.2006 -0.02195
##
## **pH** -0.08565 -0.265 0.07038
##
## **sulphates** 0.005527 **0.3713** 0.05166
##
## **alcohol** 0.04208 -0.2211 -0.06941
##
## **quality** 0.01373 -0.1289 -0.05066
## ------------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------------------------------------
## total.sulfur.dioxide density pH
## -------------------------- ---------------------- ------------- -------------
## **fixed.acidity** -0.1132 **0.668** **-0.683**
##
## **volatile.acidity** 0.07647 0.02203 0.2349
##
## **citric.acid** 0.03553 **0.3649** **-0.5419**
##
## **residual.sugar** 0.203 **0.3553** -0.08565
##
## **chlorides** 0.0474 0.2006 -0.265
##
## **free.sulfur.dioxide** **0.6677** -0.02195 0.07038
##
## **total.sulfur.dioxide** 1 0.07127 -0.06649
##
## **density** 0.07127 1 **-0.3417**
##
## **pH** -0.06649 **-0.3417** 1
##
## **sulphates** 0.04295 0.1485 -0.1966
##
## **alcohol** -0.2057 **-0.4962** 0.2056
##
## **quality** -0.1851 -0.1749 -0.05773
## -----------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## -------------------------------------------------------------------
## sulphates alcohol quality
## -------------------------- ------------ ------------- -------------
## **fixed.acidity** 0.183 -0.06167 0.1241
##
## **volatile.acidity** -0.261 -0.2023 **-0.3906**
##
## **citric.acid** **0.3128** 0.1099 0.2264
##
## **residual.sugar** 0.005527 0.04208 0.01373
##
## **chlorides** **0.3713** -0.2211 -0.1289
##
## **free.sulfur.dioxide** 0.05166 -0.06941 -0.05066
##
## **total.sulfur.dioxide** 0.04295 -0.2057 -0.1851
##
## **density** 0.1485 **-0.4962** -0.1749
##
## **pH** -0.1966 0.2056 -0.05773
##
## **sulphates** 1 0.09359 0.2514
##
## **alcohol** 0.09359 1 **0.4762**
##
## **quality** 0.2514 **0.4762** 1
## -------------------------------------------------------------------
Observations: