The Wine Quality dataset consists of red wine samples. There are 1599 observations total in the data set. This is an observational study. All the explanatory variablies fixed.acidity, volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol are numerical. The relationships between these variables are explored throughout this study.
Data is available on Kaggle
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. Available at: http://dx.doi.org/10.1016/j.dss.2009.05.016
#Load packages
library(tidyverse)
library(gridExtra)
library(GGally)
library(scales)
library(lattice)
library(MASS)
library(memisc)
#Read the data from the CSV file.
wine <- read.csv('wineQualityReds.csv')names(wine)## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
glimpse(wine)## Observations: 1,599
## Variables: 13
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
## $ fixed.acidity <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, ...
## $ volatile.acidity <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660,...
## $ citric.acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06,...
## $ residual.sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2...
## $ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075,...
## $ free.sulfur.dioxide <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15...
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, ...
## $ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0...
## $ pH <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30,...
## $ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46,...
## $ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, ...
## $ quality <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5,...
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add 'freshness' and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
grid.arrange(qplot(wine$fixed.acidity),
qplot(wine$volatile.acidity),
qplot(wine$citric.acid),
qplot(wine$residual.sugar),
qplot(wine$chlorides),
qplot(wine$free.sulfur.dioxide),
qplot(wine$total.sulfur.dioxide),
qplot(wine$density),
qplot(wine$pH),
qplot(wine$sulphates),
qplot(wine$alcohol),
qplot(wine$quality),
ncol = 3)Let’s begin with exploring the dataset using single variable at a time.
summary(wine$quality)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
ggplot(aes(x = quality, color = I('white')), data = wine)+
geom_bar()+ scale_x_continuous(breaks = seq(3, 8, 1))From the plot we can see that most of the wine are of quality level 5 and more than 90% are level 5 to level 7.
ggplot(aes(x=residual.sugar),data=wine)+
geom_histogram()Most of the red wine do not contain much sugar.
ggplot(aes(x=alcohol),data=wine)+
geom_histogram()ggplot(aes(x=pH),data=wine)+
geom_histogram()summary(wine$fixed.acidity)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
ggplot(aes(x=fixed.acidity),data=wine)+
geom_histogram()+
scale_x_continuous(breaks = seq(0,16,2))summary(wine$volatile.acidity)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
ggplot(aes(x=volatile.acidity),data=wine)+
geom_histogram()+
scale_x_continuous(breaks = seq(0,1.6,0.2))ggplot(aes(x=density),data=wine)+
geom_histogram()summary(wine$sulphates)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
ggplot(aes(x=sulphates),data=wine)+
geom_histogram()+
scale_x_continuous(breaks = seq(0.3,1.3,0.1),
limits = c(0.3,1.3))summary(wine$citric.acid)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
ggplot(aes(x = citric.acid, color = I('white')), data = wine)+
geom_histogram(binwidth = 0.01)+
scale_x_continuous(breaks = seq(0, 1, 0.05))sum(wine$citric.acid > 0.0 & wine$citric.acid < 0.5) /
nrow(wine) * 100## [1] 78.61163
sum(wine$citric.acid == 0)## [1] 132
summary(wine$chlorides)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
ggplot(aes(x = chlorides), data = wine) +
geom_histogram(bins = 50) +
scale_x_continuous(breaks = seq(0, 0.7, 0.1))ggplot(aes(x = chlorides, color = I('white')), data = wine) +
geom_histogram(bins = 30)+
scale_x_continuous(breaks = seq(0.01, 0.12, 0.01), limits= c(0.03, 0.12))First, let’s plot the correlation of all varianles against each other.
wine$quality <- as.numeric(wine$quality)
ggcorr(wine, geom = "circle", nbreaks = 5)# scatter plot of each variables
theme_set(theme_minimal(20))
wine_sub <- wine[,c(2:13)]
wine_sample <- wine_sub[sample.int(nrow(wine_sub),1000),]
ggpairs(wine_sample,upper = list(continuous = wrap("cor", size=3)))We can see that there are some strong correlations between some variables such as total.sulfur.dioxide and free.sulfur.dioxide, volatile.acidity and fixed.acidity, total.sulfur.dioxide and fixed.acidity, fixed.acidity and PH.
ggplot(aes(x=free.sulfur.dioxide,y=total.sulfur.dioxide),data=wine)+
geom_point()As menthioned above, there is a strong correlation between the two variables.
ggplot(aes(x=sulphates,y=pH),data=wine)+
geom_point(alpha=1/10,position = 'jitter')cor.test(wine$sulphates,wine$pH)##
## Pearson's product-moment correlation
##
## data: wine$sulphates and wine$pH
## t = -8.015, df = 1597, p-value = 2.107e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2433231 -0.1490634
## sample estimates:
## cor
## -0.1966476
There is some negative correlation between the two variables.
However, what we really concern is the relation between quality and other variables. So, let’s see the relation between quality and other variables.
ggplot(aes(x=quality,y=volatile.acidity),data=wine)+
geom_point(position = 'jitter')+
geom_smooth(method = 'lm',color='red')cor.test(wine$quality,wine$volatile.acidity)##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
It’s clear that higher quality red wine has lower volatile.acidity.
ggplot(aes(x=quality,y=citric.acid),data=wine)+
geom_point(position = 'jitter')+
geom_smooth(method = 'lm',color='red')cor.test(wine$quality,wine$citric.acid)##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
It’s clear that higher quality red wine has higher citric.acid.
ggplot(aes(x=quality,y=density),data=wine)+
geom_point(position = 'jitter')+
geom_smooth(method = 'lm',color='red')cor.test(wine$quality,wine$density)##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
It shows that higher quality red wine has lower density.
ggplot(aes(x=quality,y=sulphates),data=wine)+
geom_point(position = 'jitter')+
geom_smooth(method = 'lm',color='red')cor.test(wine$quality,wine$sulphates)##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
It shows that higher quality red wine has higher sulphates.
Let’s try a different method to explor the relationship. #### 3.27 quality VS alcohol
ggplot(aes(x=factor(quality),y=alcohol),data=wine)+
geom_point(position = 'jitter',alpha=1/5)+
geom_boxplot(alpha=1/2,color='green')+
stat_summary(fun.y = 'mean',geom='point',color='red')+
geom_smooth(method = 'lm',aes(group=1))It shows that higher quality red wine has higher alcohols.
ggplot(aes(x=factor(quality),y=chlorides),data=wine)+
geom_point(position = 'jitter',alpha=1/5)+
geom_boxplot(alpha=1/2,color='green')+
stat_summary(fun.y = 'mean',geom = 'point',color='red')+
geom_smooth(method = 'lm',aes(group=1))It seems like that higher quality red wine has lower chlorides.
All of the physiochemical properties provided were required to understand the interactions between each of them. We are determing ‘Quality’ which is a sensory perception using physiochemical properties that exhibit complex interactions to produce the desired flavors.That being said, by making available the category of flavors,a sensory preference or a profile which maps the quality, analysing the interactions would have been fruitful.The data made available for analysis has fewer to nil, lower and superior wine data adding bias to data analysis.
Overall, observing and analysing this dataset showed that wines with higher level of alcohol and citric acid has a higher quality, while those with high level of total sulfur dioxide or volatile acidity have lower quality. However, if we can have a dataset which consis more random examples with aprroximately uniform quality of wines. Then we will be able to perform a better analysis.