Wine Quality Analysis


Analysis

The scope of this analysis is to understand relationship of various parameters which impact the quality ratings for both Red and White wine.The data set utilized for the analysis is downloaded from UCI repository https://archive.ics.uci.edu/ml/datasets/Wine+Quality and consists of 6000+ sample data for combined Red and White wine types.

The analysis will have additional focus on the following key parameters:

pH value

pH value is considered to be a key parameter for the determination of quality of wine and hence the analysis focussed on determining the impact of these pH values on final quality determination.

SO2 values (Free and Total)

SO2 has been always a debatable topic due to the allergic reactions associated with SO2.The current analysis tries to determine the impact of SO2 on pH values and the final quality values for the wine samples.

Alcohol content

Alcohol content is an important parameter considered when a buyer purchases any alcoholic product and this analysis tries to unravel relationship of Alcohol content with parmaters like pH values and SO2 contents and the impact to quality.

Loading the Data and required libraries

library(ggplot2)
library (gridExtra)
## Warning: package 'gridExtra' was built under R version 3.1.1
## Loading required package: grid
library(GGally)
## Warning: package 'GGally' was built under R version 3.1.1
wine<-read.csv('redwhitewine.csv')

Summary of the Data Set

Summarization of the data set is as given below:

dim(wine)
## [1] 6497   13
names(wine)
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "type"                
## [13] "quality"
str(wine)
## 'data.frame':    6497 obs. of  13 variables:
##  $ fixed.acidity       : num  4.6 4.7 4.9 5 5 5 5 5 5 5.1 ...
##  $ volatile.acidity    : num  0.52 0.6 0.42 0.38 0.4 0.42 0.74 1.02 1.04 0.42 ...
##  $ citric.acid         : num  0.15 0.17 0 0.01 0.5 0.24 0 0.04 0.24 0 ...
##  $ residual.sugar      : num  2.1 2.3 2.1 1.6 4.3 2 1.2 1.4 1.6 1.8 ...
##  $ chlorides           : num  0.054 0.058 0.048 0.048 0.046 0.06 0.041 0.045 0.05 0.044 ...
##  $ free.sulfur.dioxide : num  8 17 16 26 29 19 16 41 32 18 ...
##  $ total.sulfur.dioxide: num  65 106 42 60 80 50 46 85 96 88 ...
##  $ density             : num  0.993 0.993 0.992 0.991 0.99 ...
##  $ pH                  : num  3.9 3.85 3.71 3.7 3.49 3.72 4.01 3.75 3.74 3.68 ...
##  $ sulphates           : num  0.56 0.6 0.74 0.75 0.66 0.74 0.59 0.48 0.62 0.73 ...
##  $ alcohol             : num  13.1 12.9 14 14 13.6 14 12.5 10.5 11.5 13.6 ...
##  $ type                : Factor w/ 2 levels "R","W": 1 1 1 1 1 1 1 1 1 1 ...
##  $ quality             : int  4 6 7 6 6 8 6 4 5 7 ...
summary(wine)
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar 
##  Min.   : 3.80   Min.   :0.08     Min.   :0.000   Min.   : 0.60  
##  1st Qu.: 6.40   1st Qu.:0.23     1st Qu.:0.250   1st Qu.: 1.80  
##  Median : 7.00   Median :0.29     Median :0.310   Median : 3.00  
##  Mean   : 7.21   Mean   :0.34     Mean   :0.319   Mean   : 5.45  
##  3rd Qu.: 7.70   3rd Qu.:0.40     3rd Qu.:0.390   3rd Qu.: 8.10  
##  Max.   :15.90   Max.   :1.58     Max.   :1.660   Max.   :65.80  
##    chlorides     free.sulfur.dioxide total.sulfur.dioxide    density     
##  Min.   :0.009   Min.   :  1.0       Min.   :  6          Min.   :0.987  
##  1st Qu.:0.038   1st Qu.: 17.0       1st Qu.: 77          1st Qu.:0.992  
##  Median :0.047   Median : 29.0       Median :118          Median :0.995  
##  Mean   :0.056   Mean   : 30.5       Mean   :116          Mean   :0.995  
##  3rd Qu.:0.065   3rd Qu.: 41.0       3rd Qu.:156          3rd Qu.:0.997  
##  Max.   :0.611   Max.   :289.0       Max.   :440          Max.   :1.039  
##        pH         sulphates        alcohol     type        quality    
##  Min.   :2.72   Min.   :0.220   Min.   : 8.0   R:1599   Min.   :3.00  
##  1st Qu.:3.11   1st Qu.:0.430   1st Qu.: 9.5   W:4898   1st Qu.:5.00  
##  Median :3.21   Median :0.510   Median :10.3            Median :6.00  
##  Mean   :3.22   Mean   :0.531   Mean   :10.5            Mean   :5.82  
##  3rd Qu.:3.32   3rd Qu.:0.600   3rd Qu.:11.3            3rd Qu.:6.00  
##  Max.   :4.01   Max.   :2.000   Max.   :14.9            Max.   :9.00

Observations from the Summary

  1. There is a big range for sulfur.dioxide (both Free and Total) across the samples.
  2. The sample consists of 1599 Red Wine and 4898 White wine.
  3. The alcohol content varies from 8.00 to 14.90 for the samples in dataset.
  4. The quality of the samples range from 3 to 9 with 6 being the median.
  5. The range for fixed acidity is quite high with minimum being 3.8 and maximum being 15.9,
  6. pH value varies from 2.720 to 4.010 with a median being 3.210.

Understanding the overall distribution

A distribution analysis of the key paramaters pH,Total SO2 ,Free SO2 and Alcohol is explored across Red and White wine as follows

p1<-ggplot(aes(x=pH),
       data =  subset(wine,type %in% c("R")))+
   geom_histogram(color =I('black'),fill = I('#099009'))+
  ggtitle('pH distribution for Red wine')
p2<-ggplot(aes(x=free.sulfur.dioxide),
       data =  subset(wine,type %in% c("R")))+
   geom_histogram(color =I('black'),fill = I('#099009'))+
  ggtitle('Free SO2 distribution for Red wine')
p3<-ggplot(aes(x=total.sulfur.dioxide),
       data =  subset(wine,type %in% c("R")))+
   geom_histogram(color =I('black'),fill = I('#099009'))+
  ggtitle('Total SO2 distribution for Red wine')
p4<-ggplot(aes(x=alcohol),
       data =  subset(wine,type %in% c("R")))+
   geom_histogram(color =I('black'),fill = I('#099009'))+
  ggtitle('Alcohol distribution for Red wine')

grid.arrange(p1,p2,p3,p4,ncol=2)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk PF,Total and Free SO2 ,Alcohol distribution for Red Wine

The following observations are obtained from the above plot

  1. The pH value seems to dispaly a normal distribution with major samples exhibiting values between 3.0 and 3.5
  2. The free sulfur dioxide seems to be between the 1-100 count with peaking around 50 mark
  3. The total sulfur dioxide seems to a have a spread between 0 and 300 and exhibiting peak around 150.There seems to be a lower peak exhibited around the 50 mark.
  4. The alcohol content seems to vary from 8 to 14 with major peaks around 10 with a lower count between 13 and 14.

A similar distributon analysis for white wine is given below

q1<-ggplot(aes(x=pH),
       data =  subset(wine,type %in% c("W")))+
   geom_histogram(color =I('black'),fill = I('#099009'))+
  ggtitle('pH distribution for White wine')
q2<-ggplot(aes(x=free.sulfur.dioxide),
       data =  subset(wine,type %in% c("W")))+
   geom_histogram(color =I('black'),fill = I('#099009'))+
  ggtitle('Free SO2 distribution for White wine')
q3<-ggplot(aes(x=total.sulfur.dioxide),
       data =  subset(wine,type %in% c("W")))+
   geom_histogram(color =I('black'),fill = I('#099009'))+
  ggtitle('Total SO2 distribution for White wine')
q4<-ggplot(aes(x=alcohol),
       data =  subset(wine,type %in% c("W")))+
   geom_histogram(color =I('black'),fill = I('#099009'))+
  ggtitle('Alcohol distribution for White wine')

grid.arrange(q1,q2,q3,q4,ncol=2)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk PF,Total and Free SO2 ,Alcohol distribution for White Wine

Some observations from the plot are as below:

  1. In this case too the pH value exhibits quite normal distribution but there are sample which exhibits pH values beyon 3.5 and 4 though the sample size seems to be low here
  2. The free sulfur dioxide seems to spread between 0 to 120 with peak exhibiting around 50.
  3. The total sulfur dioxide seems to a have a spread between 0 and 250 and exhibiting peak around 150.There seems to be a lower peak exhibited around the 20 mark.
  4. The alcohol content seems to vary from 7 to 14 with major peaks around 10 with a large sample frequency dispayed at higher alcohol levels .

The spread of the quality analysis for the Red and White types are given as below

r1<-ggplot(aes(x=quality),
       data =  subset(wine,type %in% c("W")))+
   geom_histogram(color =I('black'),fill = I('#099009'))+
  ggtitle('Quality distribution for White wine')
r2<-ggplot(aes(x=quality),
       data =  subset(wine,type %in% c("R")))+
   geom_histogram(color =I('black'),fill = I('#099009'))+
  ggtitle('Quality distribution for Red wine')

grid.arrange(r1,r2,ncol=1)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk Quality factor  distribution for Red and White Wine

The spread for the quality for both Red and White seems to exhibit similar normal distribution except for the fact that White wine distribution exhibit a peak quality around quality rating of 6 while Red wine exhibit a peak quality rating of approx 5.Only White wines seems to have been rated with a quality of 9 from the given sample.

A scatterplot matrix is derived to understand the overall variable behaviour and correlations

Running Scatterplot matrices

library(GGally)
theme_set(theme_minimal(20))
set.seed(2183)
ggpairs(wine[sample.int(nrow(wine),1000),])

plot of chunk Scatterplot Matrix

Scatterplot output indicates the following correlation behaviour

Free SO2

Noticeable positive correlation with Total SO2 and Residual sugar Negative correlation with pH ,sulphates and Alcohol

Total So2

Positive correlation between free so2 and residual sugar Negative correlation with pH,Sulphates and Alcohol

pH

Positive correlation with Sulphated,Alcohol and Volatile Acidity Negative correlation with Total and Free SO2,Residual sugar,citric acid,acidity(volatile and Fixed)

Alcohol

Positive correlation with pH and quality NEGATIVE Correlation with density,total and free so2,chlorides

Quality

positive correlation with alcohol negative correaltion with density,chlorides,volatile acidity

Red and White wine types

significant disparity in the mean ,median and q-q values for residual sugar,citric acis,fixed and volatile acidity

Alcohol

Scatterplot matrices indicate a strong positive correlation between Alcohol Content and Quality and without any bias towards the Red or White Wine.It will be intresting to see the distribution of Alcohol content across both Red and White wine

 ggplot(aes(x=alcohol),data =wine) + 
  geom_density(aes(fill = type))+
  facet_wrap(~quality)+
   ggtitle('Alcohol and  Quality Relationship')

plot of chunk Alcohol content across Wine types and impact on quality

There seems to be no significant bias of the alcohol content eventhough there are samples with higer Alcohol content for Red wine exhibiting a higher density reading for the quality levels of 3 and 5 as compared to White Wine.

From our earlier scatterplot matrices,Alcohol seems to exhibit a strong correlation with PH value.

ggplot(aes(x=alcohol,y=pH),data = wine) + 
  geom_jitter(aes(color = type,bg = type),alpha=1/10,,pch=21,cex=4)+
  facet_wrap(~quality)+
   scale_color_brewer(type = 'div')+
  ggtitle('Alcohol and pH Relationship')

plot of chunk Alcohol and pH relation across Wine types and impact on quality

This indicates an intrestng observation that as quality rating increases,the Red wine has instances of more PH value then White wine for similar Alcohol content.

Alcohol also exhibited a string negative correlation with density and a further analysis on this is provided below.

ggplot(aes(x=alcohol,y=density),
       data = wine)+
   geom_line(aes(color=type),stat='summary',fun.y=median)+
  ggtitle('Alcohol content across Wine types')

plot of chunk Alcohol content across Wine types

As expected,there is seems to dip in density with increase in the Alcohol content and the white wine exhibit a more prominent dip.

The negative correlation of Alcohol with Total and Free SO2 and Chlorides are analysed as below:

a1<-ggplot(aes(x=alcohol,y=total.sulfur.dioxide),
       data = wine)+
   geom_density(aes(color=type),stat='summary',fun.y=median)
 a2<-ggplot(aes(x=alcohol,y=free.sulfur.dioxide),
       data = wine)+
   geom_density(aes(color=type),stat='summary',fun.y=median)
  a3<-ggplot(aes(x=alcohol,y=chlorides),
       data = wine)+
   geom_density(aes(color=type),stat='summary',fun.y=median)

grid.arrange(a1,a2,a3,ncol=2)

plot of chunk Alcohol content correlation with Total and Free SO2 and Chlorides

The observations from the above analysis are as follows:

Total SO2

White wine exhibits higher Total SO2 contents than Red wine across all Alcholo level

Total SO2 content decreases with Alcohol content for White wine

Free SO2

Again White wine exhibits higher Free SO2 levels across all Alcohol content though the unit difference between Red and White wine seems to be lower as compared to the Total SO2 difference

The Free SO2 content decrease as the alcohol content increases for White wine.

Chloride

Red wine indicated a higher chloride content than white wine with increasing Alcohol content

The Chloride content is quite high at lower Alcohol content between 8 and 9 but then exhibits steady reduction till Alcohol content level of 13 before a spike.

White wine exhibits lower Chloride levels across Alcohol content levels and holds a steady pattern throughout

Sulphur Dioxide

Usage of SO2 in Wines has been topic of discussion for long time due to the health related issues.It will be intresting to see the distribution of SO2 across Red and White wine and their final impact on quality.

Analysis of Free SO2 across the Red and White wine is provided below

ggplot(aes(x = quality, y = free.sulfur.dioxide), data = wine) + 
  geom_point(aes(color=type),alpha=1/4, position = 'jitter')+
  ggtitle(' Free SO2 and  Quality Relationship')

plot of chunk Impact of Free Sulphur Dioxide on Quality

This indicates a existence of samples with higher free SO2 for white wines for the same quality ratings with the Red wine samples exhibiting lower free SO2 across the quality ratings

Analysis of Total SO2 is provided below

ggplot(aes(x = quality, y = total.sulfur.dioxide), data = wine) + 
  geom_point(aes(color=type),alpha=1/4, position = 'jitter')+
  ggtitle('Total SO2 and Quality Relationship')

plot of chunk Impact of Total Sulphur Dioxide on Quality

The analysis plot indicates again existence of higher total SO2 for the White wine sample as compared to Red wine

The relationship of the Total SO2 with sulpahtes and residual sugar is analysed below:

b1<-ggplot(aes(x=total.sulfur.dioxide,y=sulphates),
       data = wine)+
   geom_density(aes(color=type),stat='summary',fun.y=median)
  b2<-ggplot(aes(x=total.sulfur.dioxide,y=residual.sugar),
       data = wine)+
   geom_density(aes(color=type),stat='summary',fun.y=median)

grid.arrange(b1,b2,ncol=2)

plot of chunk Relationship of Sulphate and Residual sugar with Total SO2

The observations from the above analysis is provided below:

Sulphate

sulphate level is quite high for the red wine as compared to white wine with a huge spike around 150 unit mark of total.sulfur.dioxide.

For a higher total SO2 level of around 250,the sulphates level of white wine is higher.

White wine seems to exhibit a total SO2 level higher than 280 units

Residual Sugar

White wine exhibits high level of Residual sugar around 250 unit mark for Total SO2 as compared to Red wine and generally the quantity of Residual sugar seems to be on higher after the 150 unit level for Total SO2

The relationship of Sulphate and Residual Sugar is analysed as below:

c1<-ggplot(aes(x=free.sulfur.dioxide,y=sulphates),
       data = wine)+
   geom_density(aes(color=type),stat='summary',fun.y=median)
 c2<-ggplot(aes(x=free.sulfur.dioxide,y=residual.sugar),
       data = wine)+
   geom_density(aes(color=type),stat='summary',fun.y=median)

grid.arrange(c1,c2,ncol=2)

plot of chunk Relationship of Sulphate and Residual sugar with Free  SO2

The analysis from the above plot is provided as below:

Sulphate

sulphate level is quite high for the red wine as compared to white wine.

Red wine do not exhibit a Free SO2 level beyond 70 units

Residual Sugar

White wine exhibits higher level of Residual sugar and has peaks aroud 150 mark.

Final Plots

Final Plot One

A final comparison is done between the Red and White wine to understand the diffrence between the two variants for the parameter of Total and Free SO2 and the PH values

s1<-ggplot(aes(x=pH,y=free.sulfur.dioxide),
       data = wine)+
   geom_line(aes(color=type),stat='summary',fun.y=median)
s2<-ggplot(aes(x=pH,y=total.sulfur.dioxide),
       data = wine)+
   geom_line(aes(color=type),stat='summary',fun.y=median)

grid.arrange(s1,s2,ncol=2)

plot of chunk PF,Total and Free SO2 comparison for Red and White Wine

The above plot indicates that white wine does exhibit higher SO2 components as compared to Red Wine for similar pH values across all pH values within the sample.There seems to be higher variation for both SO2 valuesin both Red and White wines between a pH value of 3.5 and 4.0.A closer look at these pH interval is given below

t1<-ggplot(aes(x=pH,y=free.sulfur.dioxide),
       data = wine)+
   geom_line(aes(color=type),stat='summary',fun.y=median)+
  xlim(3.5,4.0)
t2<-ggplot(aes(x=pH,y=total.sulfur.dioxide),
       data = wine)+
   geom_line(aes(color=type),stat='summary',fun.y=median)+
  xlim(3.5,4.0)

grid.arrange(t1,t2,ncol=2)
## Warning: Removed 6161 rows containing missing values (stat_summary).
## Warning: Removed 6161 rows containing missing values (stat_summary).

plot of chunk PF,Total and Free SO2 comparison for Red and White Wine for pH values between 3.5 and 4.0

The above analysis plot indicate a high peak for free SO2 of unit 60 for a pH value of 3.65 while a high peak for Red wine for a pH value of 3.75 for 41 units.In case of Total so2,the peak of around 180 units for White at a pH level around 3.62 while Red wine exhibits a peak of around 105 units at a pH level of 3.85.Also it is observed that only Red wines in the sample has a Ph value beyond 3.85 and the Total and Free SO2 levels at this level is low.

Final Plot Two

ggplot(aes(x=alcohol,y=free.sulfur.dioxide),
       data = wine)+
   geom_line(aes(color=type),stat='summary',fun.y=median)+
  ggtitle('Alcohol and Free SO2 relationship')

plot of chunk Alcohol and Free SO2 relationship

The above plot indicates that for the same alcholo content,the content of free SO2 is higher for white wine than Red wine and also the free SO2 decreases quite significantly with increase in the alcohol content

Final Plot Three

Since pH value is considered to be a key contributor in determing the quality of wine,an analysis plot is created for both variants as given below

ggplot(aes(x=pH),data =wine) + 
  geom_density(aes(fill = type))+
  facet_wrap(~quality)+
  ggtitle('pH values  relationship with Quality')

plot of chunk pH values  across Wine types and impact on quality

From the above analysis plot,there doesnt seem to be any specific relations between pH values and quality in terms of the spread.However the Red wine tends to exhibit a higher pH value density then wine for quality rating till 7 while quality rating of 8 has more similar values of density. The quality rating of 9 exhibits a more narrower spread for pH value between 3.1 and 3.6.

Summary

The analysis performed on the sample dataset can be summarised as below:

pH value is considered an important parameter when determining the quality of the Wine.The analysis over the samples however indicate that there is no specific values of pH which provides bias for quality ratings and a higher density of Red Wine samples did indicate a higher PH values as compared to White wine samples for the same quality ratings.These pH value however was found to be optimum between a value of 3.0 and 3.5 .A pH value of higher than 3.5 tends to exhibit a higher SO2 values which can be concern for people with concerns of health issues with SO2.Samples with higher alcohol content did exhibit lower SO2 counts and also White wine samples exhibited a higher level of SO2 components as compared to Red wine for the same level of Alcohol.

Some of the learnings from the analysis were as follows:

  1. The understanding that Red Wine generally exhibits more SO2 properties than White wine seems to be not true as per the samples considered.The analysis proves that White wine exhibitis a higher level of SO2 properties
  2. It always seemed that pH value was a key factor in detemining the quality of the wines but from the analysis ,it seems that pH value do not exhibit any patterns which can be utilized as akey determinstic variable for wine quality testing by sensory analysis.
  3. From the samples analyzed,the wines with higher Alcohol content exhibited lower SO2 content as compared with samples with lower Alcohol content. 4.For the buyer conscious of the sugar content in the wines,White wine exhibits more residual sugar and at we have seen spikes in the residual sugar for certain ranges of the Free and Total SO2 primarily with White wine.

A limitation of the current analysis is that the current data consists of samples collected from a specific portugal region.It will be intresting to obtain datasets across various wine making regions to eliminate any bias created by any secific qualities of the product,