The scope of this analysis is to understand relationship of various parameters which impact the quality ratings for both Red and White wine.The data set utilized for the analysis is downloaded from UCI repository https://archive.ics.uci.edu/ml/datasets/Wine+Quality and consists of 6000+ sample data for combined Red and White wine types.
The analysis will have additional focus on the following key parameters:
pH value is considered to be a key parameter for the determination of quality of wine and hence the analysis focussed on determining the impact of these pH values on final quality determination.
SO2 has been always a debatable topic due to the allergic reactions associated with SO2.The current analysis tries to determine the impact of SO2 on pH values and the final quality values for the wine samples.
Alcohol content is an important parameter considered when a buyer purchases any alcoholic product and this analysis tries to unravel relationship of Alcohol content with parmaters like pH values and SO2 contents and the impact to quality.
library(ggplot2)
library (gridExtra)
## Warning: package 'gridExtra' was built under R version 3.1.1
## Loading required package: grid
library(GGally)
## Warning: package 'GGally' was built under R version 3.1.1
wine<-read.csv('redwhitewine.csv')
Summarization of the data set is as given below:
dim(wine)
## [1] 6497 13
names(wine)
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "type"
## [13] "quality"
str(wine)
## 'data.frame': 6497 obs. of 13 variables:
## $ fixed.acidity : num 4.6 4.7 4.9 5 5 5 5 5 5 5.1 ...
## $ volatile.acidity : num 0.52 0.6 0.42 0.38 0.4 0.42 0.74 1.02 1.04 0.42 ...
## $ citric.acid : num 0.15 0.17 0 0.01 0.5 0.24 0 0.04 0.24 0 ...
## $ residual.sugar : num 2.1 2.3 2.1 1.6 4.3 2 1.2 1.4 1.6 1.8 ...
## $ chlorides : num 0.054 0.058 0.048 0.048 0.046 0.06 0.041 0.045 0.05 0.044 ...
## $ free.sulfur.dioxide : num 8 17 16 26 29 19 16 41 32 18 ...
## $ total.sulfur.dioxide: num 65 106 42 60 80 50 46 85 96 88 ...
## $ density : num 0.993 0.993 0.992 0.991 0.99 ...
## $ pH : num 3.9 3.85 3.71 3.7 3.49 3.72 4.01 3.75 3.74 3.68 ...
## $ sulphates : num 0.56 0.6 0.74 0.75 0.66 0.74 0.59 0.48 0.62 0.73 ...
## $ alcohol : num 13.1 12.9 14 14 13.6 14 12.5 10.5 11.5 13.6 ...
## $ type : Factor w/ 2 levels "R","W": 1 1 1 1 1 1 1 1 1 1 ...
## $ quality : int 4 6 7 6 6 8 6 4 5 7 ...
summary(wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.80 Min. :0.08 Min. :0.000 Min. : 0.60
## 1st Qu.: 6.40 1st Qu.:0.23 1st Qu.:0.250 1st Qu.: 1.80
## Median : 7.00 Median :0.29 Median :0.310 Median : 3.00
## Mean : 7.21 Mean :0.34 Mean :0.319 Mean : 5.45
## 3rd Qu.: 7.70 3rd Qu.:0.40 3rd Qu.:0.390 3rd Qu.: 8.10
## Max. :15.90 Max. :1.58 Max. :1.660 Max. :65.80
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.009 Min. : 1.0 Min. : 6 Min. :0.987
## 1st Qu.:0.038 1st Qu.: 17.0 1st Qu.: 77 1st Qu.:0.992
## Median :0.047 Median : 29.0 Median :118 Median :0.995
## Mean :0.056 Mean : 30.5 Mean :116 Mean :0.995
## 3rd Qu.:0.065 3rd Qu.: 41.0 3rd Qu.:156 3rd Qu.:0.997
## Max. :0.611 Max. :289.0 Max. :440 Max. :1.039
## pH sulphates alcohol type quality
## Min. :2.72 Min. :0.220 Min. : 8.0 R:1599 Min. :3.00
## 1st Qu.:3.11 1st Qu.:0.430 1st Qu.: 9.5 W:4898 1st Qu.:5.00
## Median :3.21 Median :0.510 Median :10.3 Median :6.00
## Mean :3.22 Mean :0.531 Mean :10.5 Mean :5.82
## 3rd Qu.:3.32 3rd Qu.:0.600 3rd Qu.:11.3 3rd Qu.:6.00
## Max. :4.01 Max. :2.000 Max. :14.9 Max. :9.00
A distribution analysis of the key paramaters pH,Total SO2 ,Free SO2 and Alcohol is explored across Red and White wine as follows
p1<-ggplot(aes(x=pH),
data = subset(wine,type %in% c("R")))+
geom_histogram(color =I('black'),fill = I('#099009'))+
ggtitle('pH distribution for Red wine')
p2<-ggplot(aes(x=free.sulfur.dioxide),
data = subset(wine,type %in% c("R")))+
geom_histogram(color =I('black'),fill = I('#099009'))+
ggtitle('Free SO2 distribution for Red wine')
p3<-ggplot(aes(x=total.sulfur.dioxide),
data = subset(wine,type %in% c("R")))+
geom_histogram(color =I('black'),fill = I('#099009'))+
ggtitle('Total SO2 distribution for Red wine')
p4<-ggplot(aes(x=alcohol),
data = subset(wine,type %in% c("R")))+
geom_histogram(color =I('black'),fill = I('#099009'))+
ggtitle('Alcohol distribution for Red wine')
grid.arrange(p1,p2,p3,p4,ncol=2)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
The following observations are obtained from the above plot
A similar distributon analysis for white wine is given below
q1<-ggplot(aes(x=pH),
data = subset(wine,type %in% c("W")))+
geom_histogram(color =I('black'),fill = I('#099009'))+
ggtitle('pH distribution for White wine')
q2<-ggplot(aes(x=free.sulfur.dioxide),
data = subset(wine,type %in% c("W")))+
geom_histogram(color =I('black'),fill = I('#099009'))+
ggtitle('Free SO2 distribution for White wine')
q3<-ggplot(aes(x=total.sulfur.dioxide),
data = subset(wine,type %in% c("W")))+
geom_histogram(color =I('black'),fill = I('#099009'))+
ggtitle('Total SO2 distribution for White wine')
q4<-ggplot(aes(x=alcohol),
data = subset(wine,type %in% c("W")))+
geom_histogram(color =I('black'),fill = I('#099009'))+
ggtitle('Alcohol distribution for White wine')
grid.arrange(q1,q2,q3,q4,ncol=2)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Some observations from the plot are as below:
The spread of the quality analysis for the Red and White types are given as below
r1<-ggplot(aes(x=quality),
data = subset(wine,type %in% c("W")))+
geom_histogram(color =I('black'),fill = I('#099009'))+
ggtitle('Quality distribution for White wine')
r2<-ggplot(aes(x=quality),
data = subset(wine,type %in% c("R")))+
geom_histogram(color =I('black'),fill = I('#099009'))+
ggtitle('Quality distribution for Red wine')
grid.arrange(r1,r2,ncol=1)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
The spread for the quality for both Red and White seems to exhibit similar normal distribution except for the fact that White wine distribution exhibit a peak quality around quality rating of 6 while Red wine exhibit a peak quality rating of approx 5.Only White wines seems to have been rated with a quality of 9 from the given sample.
A scatterplot matrix is derived to understand the overall variable behaviour and correlations
library(GGally)
theme_set(theme_minimal(20))
set.seed(2183)
ggpairs(wine[sample.int(nrow(wine),1000),])
Scatterplot output indicates the following correlation behaviour
Noticeable positive correlation with Total SO2 and Residual sugar Negative correlation with pH ,sulphates and Alcohol
Positive correlation between free so2 and residual sugar Negative correlation with pH,Sulphates and Alcohol
Positive correlation with Sulphated,Alcohol and Volatile Acidity Negative correlation with Total and Free SO2,Residual sugar,citric acid,acidity(volatile and Fixed)
Positive correlation with pH and quality NEGATIVE Correlation with density,total and free so2,chlorides
positive correlation with alcohol negative correaltion with density,chlorides,volatile acidity
significant disparity in the mean ,median and q-q values for residual sugar,citric acis,fixed and volatile acidity
Scatterplot matrices indicate a strong positive correlation between Alcohol Content and Quality and without any bias towards the Red or White Wine.It will be intresting to see the distribution of Alcohol content across both Red and White wine
ggplot(aes(x=alcohol),data =wine) +
geom_density(aes(fill = type))+
facet_wrap(~quality)+
ggtitle('Alcohol and Quality Relationship')
There seems to be no significant bias of the alcohol content eventhough there are samples with higer Alcohol content for Red wine exhibiting a higher density reading for the quality levels of 3 and 5 as compared to White Wine.
From our earlier scatterplot matrices,Alcohol seems to exhibit a strong correlation with PH value.
ggplot(aes(x=alcohol,y=pH),data = wine) +
geom_jitter(aes(color = type,bg = type),alpha=1/10,,pch=21,cex=4)+
facet_wrap(~quality)+
scale_color_brewer(type = 'div')+
ggtitle('Alcohol and pH Relationship')
This indicates an intrestng observation that as quality rating increases,the Red wine has instances of more PH value then White wine for similar Alcohol content.
Alcohol also exhibited a string negative correlation with density and a further analysis on this is provided below.
ggplot(aes(x=alcohol,y=density),
data = wine)+
geom_line(aes(color=type),stat='summary',fun.y=median)+
ggtitle('Alcohol content across Wine types')
As expected,there is seems to dip in density with increase in the Alcohol content and the white wine exhibit a more prominent dip.
The negative correlation of Alcohol with Total and Free SO2 and Chlorides are analysed as below:
a1<-ggplot(aes(x=alcohol,y=total.sulfur.dioxide),
data = wine)+
geom_density(aes(color=type),stat='summary',fun.y=median)
a2<-ggplot(aes(x=alcohol,y=free.sulfur.dioxide),
data = wine)+
geom_density(aes(color=type),stat='summary',fun.y=median)
a3<-ggplot(aes(x=alcohol,y=chlorides),
data = wine)+
geom_density(aes(color=type),stat='summary',fun.y=median)
grid.arrange(a1,a2,a3,ncol=2)
The observations from the above analysis are as follows:
White wine exhibits higher Total SO2 contents than Red wine across all Alcholo level
Total SO2 content decreases with Alcohol content for White wine
Again White wine exhibits higher Free SO2 levels across all Alcohol content though the unit difference between Red and White wine seems to be lower as compared to the Total SO2 difference
The Free SO2 content decrease as the alcohol content increases for White wine.
Red wine indicated a higher chloride content than white wine with increasing Alcohol content
The Chloride content is quite high at lower Alcohol content between 8 and 9 but then exhibits steady reduction till Alcohol content level of 13 before a spike.
White wine exhibits lower Chloride levels across Alcohol content levels and holds a steady pattern throughout
Usage of SO2 in Wines has been topic of discussion for long time due to the health related issues.It will be intresting to see the distribution of SO2 across Red and White wine and their final impact on quality.
Analysis of Free SO2 across the Red and White wine is provided below
ggplot(aes(x = quality, y = free.sulfur.dioxide), data = wine) +
geom_point(aes(color=type),alpha=1/4, position = 'jitter')+
ggtitle(' Free SO2 and Quality Relationship')
This indicates a existence of samples with higher free SO2 for white wines for the same quality ratings with the Red wine samples exhibiting lower free SO2 across the quality ratings
Analysis of Total SO2 is provided below
ggplot(aes(x = quality, y = total.sulfur.dioxide), data = wine) +
geom_point(aes(color=type),alpha=1/4, position = 'jitter')+
ggtitle('Total SO2 and Quality Relationship')
The analysis plot indicates again existence of higher total SO2 for the White wine sample as compared to Red wine
The relationship of the Total SO2 with sulpahtes and residual sugar is analysed below:
b1<-ggplot(aes(x=total.sulfur.dioxide,y=sulphates),
data = wine)+
geom_density(aes(color=type),stat='summary',fun.y=median)
b2<-ggplot(aes(x=total.sulfur.dioxide,y=residual.sugar),
data = wine)+
geom_density(aes(color=type),stat='summary',fun.y=median)
grid.arrange(b1,b2,ncol=2)
The observations from the above analysis is provided below:
sulphate level is quite high for the red wine as compared to white wine with a huge spike around 150 unit mark of total.sulfur.dioxide.
For a higher total SO2 level of around 250,the sulphates level of white wine is higher.
White wine seems to exhibit a total SO2 level higher than 280 units
White wine exhibits high level of Residual sugar around 250 unit mark for Total SO2 as compared to Red wine and generally the quantity of Residual sugar seems to be on higher after the 150 unit level for Total SO2
The relationship of Sulphate and Residual Sugar is analysed as below:
c1<-ggplot(aes(x=free.sulfur.dioxide,y=sulphates),
data = wine)+
geom_density(aes(color=type),stat='summary',fun.y=median)
c2<-ggplot(aes(x=free.sulfur.dioxide,y=residual.sugar),
data = wine)+
geom_density(aes(color=type),stat='summary',fun.y=median)
grid.arrange(c1,c2,ncol=2)
The analysis from the above plot is provided as below:
sulphate level is quite high for the red wine as compared to white wine.
Red wine do not exhibit a Free SO2 level beyond 70 units
White wine exhibits higher level of Residual sugar and has peaks aroud 150 mark.
A final comparison is done between the Red and White wine to understand the diffrence between the two variants for the parameter of Total and Free SO2 and the PH values
s1<-ggplot(aes(x=pH,y=free.sulfur.dioxide),
data = wine)+
geom_line(aes(color=type),stat='summary',fun.y=median)
s2<-ggplot(aes(x=pH,y=total.sulfur.dioxide),
data = wine)+
geom_line(aes(color=type),stat='summary',fun.y=median)
grid.arrange(s1,s2,ncol=2)
The above plot indicates that white wine does exhibit higher SO2 components as compared to Red Wine for similar pH values across all pH values within the sample.There seems to be higher variation for both SO2 valuesin both Red and White wines between a pH value of 3.5 and 4.0.A closer look at these pH interval is given below
t1<-ggplot(aes(x=pH,y=free.sulfur.dioxide),
data = wine)+
geom_line(aes(color=type),stat='summary',fun.y=median)+
xlim(3.5,4.0)
t2<-ggplot(aes(x=pH,y=total.sulfur.dioxide),
data = wine)+
geom_line(aes(color=type),stat='summary',fun.y=median)+
xlim(3.5,4.0)
grid.arrange(t1,t2,ncol=2)
## Warning: Removed 6161 rows containing missing values (stat_summary).
## Warning: Removed 6161 rows containing missing values (stat_summary).
The above analysis plot indicate a high peak for free SO2 of unit 60 for a pH value of 3.65 while a high peak for Red wine for a pH value of 3.75 for 41 units.In case of Total so2,the peak of around 180 units for White at a pH level around 3.62 while Red wine exhibits a peak of around 105 units at a pH level of 3.85.Also it is observed that only Red wines in the sample has a Ph value beyond 3.85 and the Total and Free SO2 levels at this level is low.
ggplot(aes(x=alcohol,y=free.sulfur.dioxide),
data = wine)+
geom_line(aes(color=type),stat='summary',fun.y=median)+
ggtitle('Alcohol and Free SO2 relationship')
The above plot indicates that for the same alcholo content,the content of free SO2 is higher for white wine than Red wine and also the free SO2 decreases quite significantly with increase in the alcohol content
Since pH value is considered to be a key contributor in determing the quality of wine,an analysis plot is created for both variants as given below
ggplot(aes(x=pH),data =wine) +
geom_density(aes(fill = type))+
facet_wrap(~quality)+
ggtitle('pH values relationship with Quality')
From the above analysis plot,there doesnt seem to be any specific relations between pH values and quality in terms of the spread.However the Red wine tends to exhibit a higher pH value density then wine for quality rating till 7 while quality rating of 8 has more similar values of density. The quality rating of 9 exhibits a more narrower spread for pH value between 3.1 and 3.6.
The analysis performed on the sample dataset can be summarised as below:
pH value is considered an important parameter when determining the quality of the Wine.The analysis over the samples however indicate that there is no specific values of pH which provides bias for quality ratings and a higher density of Red Wine samples did indicate a higher PH values as compared to White wine samples for the same quality ratings.These pH value however was found to be optimum between a value of 3.0 and 3.5 .A pH value of higher than 3.5 tends to exhibit a higher SO2 values which can be concern for people with concerns of health issues with SO2.Samples with higher alcohol content did exhibit lower SO2 counts and also White wine samples exhibited a higher level of SO2 components as compared to Red wine for the same level of Alcohol.
Some of the learnings from the analysis were as follows:
A limitation of the current analysis is that the current data consists of samples collected from a specific portugal region.It will be intresting to obtain datasets across various wine making regions to eliminate any bias created by any secific qualities of the product,