White Wine Data Analysis by Mohit Kumar

White wine data set used in this report contains 4898 observations with 11 features qualifing the chemical properties of different variants of Portuguese “Vinho Verde” wine from 2009 source that are measured by physicochemical test. Here, Each property plays important role in defining the wine taste and quality.The quality varaible is based on sensory data contains median of the ratings given by at least 3 experts from 0(very poor) to 10(very excellent).

Input variables (based on physicochemical tests):
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume)
Output variable (based on sensory data):
12 - quality (score between 0 and 10)

I’ll analyze and explore the White Wine data set to understand and find the factors that are resposible for determining the quality of Wine.I’ll start by visualizing and understanding each variable, then I find the correlation between quality and other variables.Lastly, I’ll create Linear Regression model to predict the quality.

Univariate Plots Section

Data Summary

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

The wine data set contains 13 Variables and 4898 observations where 12 variables contain are numerical values and one is categorical value(Quality).Here X is index value so it’ll be considered as non significant variable.

Fixed Acidity:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

## [1] 14.2 11.8

The distribution is positive skewed as shown by histogram(left) and contains some outliers as this can be seen clearly in the Box Plot.There is one observation which has value 14.2 while the next nearest value is 11.8.After limiting the values(without outliers)i.e. omiting the .1% top value, the histogram is plotted(middle).

Volatile Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

The range is very less as these values are measured up to three decimal places.It look like data distribution is positively skewed (histogram at left) and contains tail also.The observation contains some outliers which can be removed to obtain normal distribution. So, by omiting the 2.5% of the top value,histogram is plotted(middle)

Citric Acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

## 
## 0.49 
##  215

## 
## 0.74 
##   41

There’re two intresting peaks I found while analysing this histogram(left), one is at 0.49 contains 215 observations and second is at 0.74 contains 41 observations, this might be due some standard value.I’ve also checked the max frequency which is at 0.3 contains 307 observations gives highest peak.Ploting after omiting 0.5% of top values gives reasonable graph.This also shows that there are extreme outliers present.

Residual Sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

## 
## 65.8 
##    1

The data is positively skewed and there is high peak near 0.6 while first quantile range is only at 1.1. This shows that 25% of data is below 1.1.This might be desired also as less the residual sugar ensures that it is nicely fermented.The distribution is bimodal and is visible on log scale also.So I normalized it by subtracting the mean from the residual values. The data contain one extreme observation (65.8) which is only value above 32.

Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

They are measured up to three decimal places.The maximum value is .34 which is much far from mean(.04577).The histogram(middle) is plotted after omiting the top 2.5%.

Free Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

## 
## 289 
##   1

It is the amount of sulfur dioxide left after fermentation. There is one outlier with extreme value 289.The data is positively skewed and the histogram(middle) is plotted after omiting the 2.5% of values.

Total Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

## [1] 313.0 366.5 307.5 344.0 303.0 440.0

The maximum value is 440 while the third quantile is at 167.There are some high observations that are suprising.

Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

It contains small peaks and the range is also low.The values are measured up to 4 decimal places.The maxmum value is 1.0390.The distribution of values is somewhat normal as shown in box plot.

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

Here the difference of range of first and second quantile is approximately equal to difference in third and fouth quantile. I do not found something surprising by visualising the plots.

Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

The data is positively skewed.After transforming values to log 10, The distribution looks reasonable. The mean is 0.4898 and median is 0.47.

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

This contains number of peaks but the box plot shows that this doesn’t contains any ouliers which is suprising.The data distribution is normal.

Quality

##     Max.   Median  3rd Qu.     Mean  1st Qu.     Min. 
## 9.000000 6.000000 6.000000 5.877909 5.000000 3.000000

##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

## [1] "Values in Percentage"

## 
##     3     4     5     6     7     8     9 
##  0.41  3.33 29.75 44.88 17.97  3.57  0.10

I’ve transformed the quality data type from integer to factor(as it is categorical variable). Most wines are of average taste and we can see that around 92.6% values are having rating 5,6 or 7.

Rating

## 
##    poor average    good 
##     183    3655    1060

## [1] "Values in Percentage"

## 
##    poor average    good 
##    3.74   74.62   21.64

I have subdivided the quality and unioned into three categories,these are poor(3-4 quality rating),average (5-6) and good(7-9 quality wine).

Univariate Analysis

Structure of dataset:

The dataset contains 4868 observations of 12 features (Fixed Acidity, Volatile Acidity, Citric Acid, Residual Sugar, Chlorides, Free sulfur Dioxide, Total sulfur Dioxide, Density, pH, Sulphates, alcohol and Quality). Here, 11 features are of numerical data type which reflect the physical and chemical properties of wine while last one (Quality) is categorical variable.

The most data is of average wine as 92.6% of wines quality is rated as 5 or 6 or 7.

The main feature(s) of interest:

The main feature I’m intrested in is “quality” and “alcohol”,I’ll continue my analysis to find out which more given factors determine quality.

Other intresting features and relationships:

I think the alcohol and acidity(Fixed,Volatile and Citric Acid) will have effect on quality.There is also the fact the higher acid content leads to lower pH value, so I suppose I might find some relationship there also.The density will affected by alcohol content and residual sugar as science clearly states(sugar have high density and alcohol have low density compared to water) that so I might find something there too.The free and total sulfur dioxide are quite same so they might have same correlations to quality but this is the possibility and I’ll be sure after analysing these features.

New variables created:

I’ve created new categorical variable “rating” and used following unions:
-Quality 3-4 : Poor
-Quality 5-6 : Average
-Quality 7-9 : Good

Unusual distribution and new calculations performed:

-Alcohol,pH and density are normally distributed with few outliers.
-Fixed Acidity, Free sulfur dioxide and Total sulfur dioxide shows positively skewed distribution with small tail and have number of outliers.
-Volatile Acidity and chlorides are positively skewed with long tail pattern and contains ouliers.
-Residual sugar shows bimodal distribution with extreme outlier.
-I’ve normalized(i.e subtracted mean value from all observations) the residual sugar.The distribution of residual sugar is bimodal and after normalizing, it shows positively skewed distribution.
-I’ve plotted the histogram of Log10 value of sulphates so the data distribution alters.

Bivariate Plots Section

Correlation Plot

Correlation Plot : I don’t observed any strong correlation between quality and other variable.Most variable are weakly correlated.
- Weakly Correlated: Alcohol(.44), Density (0.31,negative), chlorides(0.21,negative), Volatile Acidity (0.19,negative) - Rest Variable do not show any considerable correlation.

There are also some strong correlation are found between independent variables.
- Density and Residual Sugar (.84)
- Density and Alcohol (.78, negative)
- Density and Total Sulfur Dioxide (.53)
- Alcohol and Residual Sugar (0.45,negative)
- Alcohol and Total Sulfur Dioxide (.45,negative)
- Alcohol and Chlorides (0.36)

Box Plot of Quality vs other features

There are so much outliers are present in given data.This is also the reason that I didn’t find any strong correlation between quality and other variables.

The Alcohol shows the positive increase in the mean of the quality of wine while density shows constant decrease.

It’ll hard to interpret any other variables as they contains high number on outliers and also as can be seen in correlation matrix, they are either less correlated or not correlated.

Scatter Plot between Residual Sugar and other features

There is positive correlation is shown in plot Density vs Residual Sugar and negative correlation shown in Alcohol vs Residual Sugar. Here, all plots shown are scaled.

Scatter Plot between Total Sulfur Dioxide and other features

There is positive correlation is shown in Total Sulfur Dioxide vs Free Sulfur Dioxide and Total Sulfur Dioxide vs Density.Rest plot show little relationship between other variables.Here, all scatter plots are scaled.

Scatter Plot between Density and other features

Density Vs Residual Sugar and Density vs Total Sulfur Dioxide shows positive correlation while Density vs Alcohol shows negative trend.There are also some extreme outlier present.

Scatter Plot between Alcohol and other features

Alcohol and Chlorides doesn’t show any relationship. The data have high noise as compared to other plots. Here, plots are scaled views.

Density Plot of Density and Alcohol by Rating

There are some interpretations can be followed:
-The most of good wines have low density and high alcohol content.There is also one small peak at 9 in alcohol and 0.99875 in density.
- The average and poor wine graph are nearly overlap.This might be due to fact that both depends on other features also.

Bivariate Analysis

New Relationships found:

There are some relationships that I have observed are mentioned:
-Quality is positively correlated with alcohol(.44). -Quality is negatively correlated with Density(-0.31), Chloride(-0.21), Volatile Acidity (-0.19) and Total Sulfur Dioxide(-0.17)

Inter-relationships among other features:

There are several strong relationships that I’ve found between other features. They’re mentioned below

Positive correlation is noted among Density and Residual Sugar(0.84), Total Sulfur Dioxide and Free Sulfur Dioxide(0.62), Density and Total Sulfur Dioxide(0.53).
Negative correlation is noted among Alcohol and Density(-0.78),Alcohol and Residual Sugar(-0.45).

Strongest relationship

It’s between Residual Sugar and Density.

Multivariate Plots Section

Scatter Plot between Alcohol and Density by Quality and Rating

I’ve plotted the density vs alcohol, facet by the rating. There difference in alcohol concentration of wine is present as there are the average wine have high frequency between the range 8.5-11 while for good wine,most frequency lies in between the concentration 10-13.

Scatter Plot between Free and Total Sulfur Dioxide by Quality and Rating

This show that mostly good wines contains low total and free sulfur dioxide values.Here, view is scaled.

Scatter Plot between Denisty and other features by Quality and Rating

Here, the left plot are colored by quality while right one by rating. It seems that good wine contains low residual sugar and total sulfur dioxide and high alcohol.The poor wine have high sulfur content and contains less alcohol.

Scatter Plot between Alcohol and other features by Quality and Rating

Relative plots against Alcohol doesn’t help much in intreperting results from plots. Mostly wine quality seem affected by alcohol concentration rather than other factors that are plotted. The poor wines contain high chloride content.Here, views are scaled for all plots.

Linear Regression Model

The exploratory data analysis doesn’t give any strong linear relationship among quality of wine and other features. There are some weak relationship was observed with alcohol, density, chlorides and volatile acidity.I’ll train linear regression model with all these features.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine_data)
## m2: lm(formula = quality ~ alcohol + density, data = wine_data)
## m3: lm(formula = quality ~ alcohol + density + chlorides, data = wine_data)
## m4: lm(formula = quality ~ alcohol + density + chlorides + residual.sugar, 
##     data = wine_data)
## m5: lm(formula = quality ~ alcohol + density + chlorides + residual.sugar + 
##     volatile.acidity, data = wine_data)
## 
## ==========================================================================================
##                          m1            m2            m3            m4            m5       
## ------------------------------------------------------------------------------------------
##   (Intercept)           2.582***    -22.492***    -21.150***     87.563***     73.271***  
##                        (0.098)       (6.165)       (6.162)      (12.392)      (11.999)    
##   alcohol               0.313***      0.360***      0.343***      0.237***      0.283***  
##                        (0.009)       (0.015)       (0.015)       (0.018)       (0.018)    
##   density                            24.728***     23.671***    -84.931***    -70.514***  
##                                      (6.079)       (6.074)      (12.340)      (11.949)    
##   chlorides                                        -2.382***     -1.776**      -0.692     
##                                                    (0.558)       (0.555)       (0.540)    
##   residual.sugar                                                  0.052***      0.052***  
##                                                                  (0.005)       (0.005)    
##   volatile.acidity                                                             -2.044***  
##                                                                                (0.110)    
## ------------------------------------------------------------------------------------------
##   R-squared             0.190         0.192         0.195         0.212         0.264     
##   adj. R-squared        0.190         0.192         0.195         0.211         0.263     
##   sigma                 0.797         0.796         0.795         0.787         0.760     
##   F                  1146.395       583.290       396.315       328.736       351.293     
##   p                     0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood    -5839.391     -5831.127     -5822.011     -5771.696     -5603.301     
##   Deviance           3112.257      3101.773      3090.247      3027.406      2826.235     
##   AIC               11684.782     11670.255     11654.021     11555.391     11220.603     
##   BIC               11704.272     11696.241     11686.504     11594.371     11266.079     
##   N                  4898          4898          4898          4898          4898         
## ==========================================================================================

Looking over the statistics of linear model, the r value is only 0.26 which will results in wrong prediction ultimately model will have high error rate.

This model is not suitable to predict quality.

Multivariate Analysis

Newly observed relationship:

The relationship among features observed are quality of wine,alcohol and density. Density and residual sugar also shows strong relationship.

Linear Model and It’s efficiency:

I’ve created a linear regression model. By analysing the statistics of linear model (adjusted r-square= 0.26), The model doesn’t seems to preditct correctly the test variables. I might need more variable having higher correlation with quality. The observations of physicochemical properties are also less i.e.4898 and will require more observations to improve my model.

Final Plots and Summary

Plot One

Description One

The plot visualize the strong relationship between wine quality and physiochemical property alcohol.It seems a bit suprising that wine with quality rating 3 contains high alcohol content.This is due to the fact that wine also depend upon other factors rather than only alcohol.It looks that experts rated more to wine have high alcohol concentration.

If we look over the statistics of plot,then interpretation of result becomes more clear.The percentage of wine as per rating, the wine with rating 3 covers only 0.41% of given wine data while wine rated with 6 ratings covers 44.88%.

Here are some statistical values:

	Quanlity 3	Quanlity 4	Quanlity 5	Quanlity 6	Quanlity 7	Quanlity 8	Quanlity 9
Minimun Value	8.00	8.40	8.00	8.50	8.60	8.50	10.40
First Quantile	9.55	9.40	9.20	9.60	10.60	11.00	12.40
Median	10.45	10.10	9.50	10.50	11.40	12.00	12.50
Third Quantile	11.00	10.75	10.30	11.40	12.30	12.60	12.70
Maximum Value	12.60	13.50	13.60	14.00	14.20	14.00	12.90
Mean	10.35	10.15	9.81	10.58	11.37	11.64	12.18
Percentage of overall wine	0.41	3.33	29.75	44.88	17.97	3.57	0.10

Plot Two

Description Two

Density and Alcohol are found to carry relationship with quality.The good wine contains less density and high alcohol concentration.This also looks scientifically true as alcohol have low density so high alcohol leads to wine with low density.

There are two peaks can be seen in density as well as in alcohol plot for good wine.The poor wine and average wine nearly overlap, this might be due to the fact that the wine quality doens’t depends only on density and alcohol but also on other features.

We can consider that there is high probability of wine to have good taste if we select wine have low density and high alcohol content.

Plot Three

Description Three

The above plots reveal relationships among physiochemical properties and alcohol. The good wine tends to have low density, low total sulfur dioxide, low residual sugar and high alcohol value. The poor wine have higher density and residual sugar. The high sugar content give wine more sweet taste which is not preferred.

The total sulfur dioxide is undetectable in low concentration but when sulfur dioxide crosses 50 ppm, then it becomes evident to nose and taste of wine.This is also a reason for poor wine to have high total sulfur dioxide content.

The alcohol and density are negatively correlated, so increase in alcohol will cause reduce in density. This also infers that good wine have high alcohol which leads to low density.

Reflection

The quality of wine weakly depends on alcohol and density while I supposed that they may play important role in defining the quality. I didn’t find any strong correlations between features and wine quality but able to find some intresting inter-relationship among other physiochemical test observations like between alcohol & density and residual sugar & density. There are also many outliers are present which reduces the linear model efficiency. After completing the analysis, I’m able to interpret some results as Good Wine tends to have lower density, high alcohol, less residual sugar and less sulfur dioxide content.

The given dataset is based on only 4898 observation of white wine variants of Portuguese “Vinho Verde” wine from 2009 source. The conclusion I’ve shared is just based on given dataset. There are number of outliers present which imply that I have to recheck them from the data source from where they were collected at first place or perform test in controlled enviroment to redefine them and these both are almost impossible. Here, I’m just using sample to define population which is not correct due to the fact that I still don’t have much feature in dataset to define quality of wine. The grape source, region of wine, name of wine etc. can be collected and added to this dataset will improve and help us to decide which wine is better.