White wine data set used in this report contains 4898 observations with 11 features qualifing the chemical properties of different variants of Portuguese “Vinho Verde” wine from 2009 source that are measured by physicochemical test. Here, Each property plays important role in defining the wine taste and quality.The quality varaible is based on sensory data contains median of the ratings given by at least 3 experts from 0(very poor) to 10(very excellent).
Input variables (based on physicochemical tests):
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume)
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
I’ll analyze and explore the White Wine data set to understand and find the factors that are resposible for determining the quality of Wine.I’ll start by visualizing and understanding each variable, then I find the correlation between quality and other variables.Lastly, I’ll create Linear Regression model to predict the quality.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
The wine data set contains 13 Variables and 4898 observations where 12 variables contain are numerical values and one is categorical value(Quality).Here X is index value so it’ll be considered as non significant variable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
## [1] 14.2 11.8
The distribution is positive skewed as shown by histogram(left) and contains some outliers as this can be seen clearly in the Box Plot.There is one observation which has value 14.2 while the next nearest value is 11.8.After limiting the values(without outliers)i.e. omiting the .1% top value, the histogram is plotted(middle).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
The range is very less as these values are measured up to three decimal places.It look like data distribution is positively skewed (histogram at left) and contains tail also.The observation contains some outliers which can be removed to obtain normal distribution. So, by omiting the 2.5% of the top value,histogram is plotted(middle)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
##
## 0.49
## 215
##
## 0.74
## 41
There’re two intresting peaks I found while analysing this histogram(left), one is at 0.49 contains 215 observations and second is at 0.74 contains 41 observations, this might be due some standard value.I’ve also checked the max frequency which is at 0.3 contains 307 observations gives highest peak.Ploting after omiting 0.5% of top values gives reasonable graph.This also shows that there are extreme outliers present.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
##
## 65.8
## 1
The data is positively skewed and there is high peak near 0.6 while first quantile range is only at 1.1. This shows that 25% of data is below 1.1.This might be desired also as less the residual sugar ensures that it is nicely fermented.The distribution is bimodal and is visible on log scale also.So I normalized it by subtracting the mean from the residual values. The data contain one extreme observation (65.8) which is only value above 32.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
They are measured up to three decimal places.The maximum value is .34 which is much far from mean(.04577).The histogram(middle) is plotted after omiting the top 2.5%.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
##
## 289
## 1
It is the amount of sulfur dioxide left after fermentation. There is one outlier with extreme value 289.The data is positively skewed and the histogram(middle) is plotted after omiting the 2.5% of values.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
## [1] 313.0 366.5 307.5 344.0 303.0 440.0
The maximum value is 440 while the third quantile is at 167.There are some high observations that are suprising.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
It contains small peaks and the range is also low.The values are measured up to 4 decimal places.The maxmum value is 1.0390.The distribution of values is somewhat normal as shown in box plot.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
Here the difference of range of first and second quantile is approximately equal to difference in third and fouth quantile. I do not found something surprising by visualising the plots.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
The data is positively skewed.After transforming values to log 10, The distribution looks reasonable. The mean is 0.4898 and median is 0.47.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
This contains number of peaks but the box plot shows that this doesn’t contains any ouliers which is suprising.The data distribution is normal.
## Max. Median 3rd Qu. Mean 1st Qu. Min.
## 9.000000 6.000000 6.000000 5.877909 5.000000 3.000000
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
## [1] "Values in Percentage"
##
## 3 4 5 6 7 8 9
## 0.41 3.33 29.75 44.88 17.97 3.57 0.10
I’ve transformed the quality data type from integer to factor(as it is categorical variable). Most wines are of average taste and we can see that around 92.6% values are having rating 5,6 or 7.
##
## poor average good
## 183 3655 1060
## [1] "Values in Percentage"
##
## poor average good
## 3.74 74.62 21.64
I have subdivided the quality and unioned into three categories,these are poor(3-4 quality rating),average (5-6) and good(7-9 quality wine).
The dataset contains 4868 observations of 12 features (Fixed Acidity, Volatile Acidity, Citric Acid, Residual Sugar, Chlorides, Free sulfur Dioxide, Total sulfur Dioxide, Density, pH, Sulphates, alcohol and Quality). Here, 11 features are of numerical data type which reflect the physical and chemical properties of wine while last one (Quality) is categorical variable.
The most data is of average wine as 92.6% of wines quality is rated as 5 or 6 or 7.
The main feature I’m intrested in is “quality” and “alcohol”,I’ll continue my analysis to find out which more given factors determine quality.
I think the alcohol and acidity(Fixed,Volatile and Citric Acid) will have effect on quality.There is also the fact the higher acid content leads to lower pH value, so I suppose I might find some relationship there also.The density will affected by alcohol content and residual sugar as science clearly states(sugar have high density and alcohol have low density compared to water) that so I might find something there too.The free and total sulfur dioxide are quite same so they might have same correlations to quality but this is the possibility and I’ll be sure after analysing these features.
I’ve created new categorical variable “rating” and used following unions:
-Quality 3-4 : Poor
-Quality 5-6 : Average
-Quality 7-9 : Good
-Alcohol,pH and density are normally distributed with few outliers.
-Fixed Acidity, Free sulfur dioxide and Total sulfur dioxide shows positively skewed distribution with small tail and have number of outliers.
-Volatile Acidity and chlorides are positively skewed with long tail pattern and contains ouliers.
-Residual sugar shows bimodal distribution with extreme outlier.
-I’ve normalized(i.e subtracted mean value from all observations) the residual sugar.The distribution of residual sugar is bimodal and after normalizing, it shows positively skewed distribution.
-I’ve plotted the histogram of Log10 value of sulphates so the data distribution alters.
Correlation Plot : I don’t observed any strong correlation between quality and other variable.Most variable are weakly correlated.
- Weakly Correlated: Alcohol(.44), Density (0.31,negative), chlorides(0.21,negative), Volatile Acidity (0.19,negative) - Rest Variable do not show any considerable correlation.
There are also some strong correlation are found between independent variables.
- Density and Residual Sugar (.84)
- Density and Alcohol (.78, negative)
- Density and Total Sulfur Dioxide (.53)
- Alcohol and Residual Sugar (0.45,negative)
- Alcohol and Total Sulfur Dioxide (.45,negative)
- Alcohol and Chlorides (0.36)
There are so much outliers are present in given data.This is also the reason that I didn’t find any strong correlation between quality and other variables.
The Alcohol shows the positive increase in the mean of the quality of wine while density shows constant decrease.
It’ll hard to interpret any other variables as they contains high number on outliers and also as can be seen in correlation matrix, they are either less correlated or not correlated.
There is positive correlation is shown in plot Density vs Residual Sugar and negative correlation shown in Alcohol vs Residual Sugar. Here, all plots shown are scaled.
There is positive correlation is shown in Total Sulfur Dioxide vs Free Sulfur Dioxide and Total Sulfur Dioxide vs Density.Rest plot show little relationship between other variables.Here, all scatter plots are scaled.
Density Vs Residual Sugar and Density vs Total Sulfur Dioxide shows positive correlation while Density vs Alcohol shows negative trend.There are also some extreme outlier present.
Alcohol and Chlorides doesn’t show any relationship. The data have high noise as compared to other plots. Here, plots are scaled views.
There are some interpretations can be followed:
-The most of good wines have low density and high alcohol content.There is also one small peak at 9 in alcohol and 0.99875 in density.
- The average and poor wine graph are nearly overlap.This might be due to fact that both depends on other features also.
There are some relationships that I have observed are mentioned:
-Quality is positively correlated with alcohol(.44). -Quality is negatively correlated with Density(-0.31), Chloride(-0.21), Volatile Acidity (-0.19) and Total Sulfur Dioxide(-0.17)
There are several strong relationships that I’ve found between other features. They’re mentioned below
It’s between Residual Sugar and Density.
I’ve plotted the density vs alcohol, facet by the rating. There difference in alcohol concentration of wine is present as there are the average wine have high frequency between the range 8.5-11 while for good wine,most frequency lies in between the concentration 10-13.
This show that mostly good wines contains low total and free sulfur dioxide values.Here, view is scaled.
Here, the left plot are colored by quality while right one by rating. It seems that good wine contains low residual sugar and total sulfur dioxide and high alcohol.The poor wine have high sulfur content and contains less alcohol.
Relative plots against Alcohol doesn’t help much in intreperting results from plots. Mostly wine quality seem affected by alcohol concentration rather than other factors that are plotted. The poor wines contain high chloride content.Here, views are scaled for all plots.
The exploratory data analysis doesn’t give any strong linear relationship among quality of wine and other features. There are some weak relationship was observed with alcohol, density, chlorides and volatile acidity.I’ll train linear regression model with all these features.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine_data)
## m2: lm(formula = quality ~ alcohol + density, data = wine_data)
## m3: lm(formula = quality ~ alcohol + density + chlorides, data = wine_data)
## m4: lm(formula = quality ~ alcohol + density + chlorides + residual.sugar,
## data = wine_data)
## m5: lm(formula = quality ~ alcohol + density + chlorides + residual.sugar +
## volatile.acidity, data = wine_data)
##
## ==========================================================================================
## m1 m2 m3 m4 m5
## ------------------------------------------------------------------------------------------
## (Intercept) 2.582*** -22.492*** -21.150*** 87.563*** 73.271***
## (0.098) (6.165) (6.162) (12.392) (11.999)
## alcohol 0.313*** 0.360*** 0.343*** 0.237*** 0.283***
## (0.009) (0.015) (0.015) (0.018) (0.018)
## density 24.728*** 23.671*** -84.931*** -70.514***
## (6.079) (6.074) (12.340) (11.949)
## chlorides -2.382*** -1.776** -0.692
## (0.558) (0.555) (0.540)
## residual.sugar 0.052*** 0.052***
## (0.005) (0.005)
## volatile.acidity -2.044***
## (0.110)
## ------------------------------------------------------------------------------------------
## R-squared 0.190 0.192 0.195 0.212 0.264
## adj. R-squared 0.190 0.192 0.195 0.211 0.263
## sigma 0.797 0.796 0.795 0.787 0.760
## F 1146.395 583.290 396.315 328.736 351.293
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5839.391 -5831.127 -5822.011 -5771.696 -5603.301
## Deviance 3112.257 3101.773 3090.247 3027.406 2826.235
## AIC 11684.782 11670.255 11654.021 11555.391 11220.603
## BIC 11704.272 11696.241 11686.504 11594.371 11266.079
## N 4898 4898 4898 4898 4898
## ==========================================================================================
Looking over the statistics of linear model, the r value is only 0.26 which will results in wrong prediction ultimately model will have high error rate.
This model is not suitable to predict quality.
The relationship among features observed are quality of wine,alcohol and density. Density and residual sugar also shows strong relationship.
I’ve created a linear regression model. By analysing the statistics of linear model (adjusted r-square= 0.26), The model doesn’t seems to preditct correctly the test variables. I might need more variable having higher correlation with quality. The observations of physicochemical properties are also less i.e.4898 and will require more observations to improve my model.
The plot visualize the strong relationship between wine quality and physiochemical property alcohol.It seems a bit suprising that wine with quality rating 3 contains high alcohol content.This is due to the fact that wine also depend upon other factors rather than only alcohol.It looks that experts rated more to wine have high alcohol concentration.
If we look over the statistics of plot,then interpretation of result becomes more clear.The percentage of wine as per rating, the wine with rating 3 covers only 0.41% of given wine data while wine rated with 6 ratings covers 44.88%.
Here are some statistical values:
| Quanlity 3 | Quanlity 4 | Quanlity 5 | Quanlity 6 | Quanlity 7 | Quanlity 8 | Quanlity 9 | |
|---|---|---|---|---|---|---|---|
| Minimun Value | 8.00 | 8.40 | 8.00 | 8.50 | 8.60 | 8.50 | 10.40 |
| First Quantile | 9.55 | 9.40 | 9.20 | 9.60 | 10.60 | 11.00 | 12.40 |
| Median | 10.45 | 10.10 | 9.50 | 10.50 | 11.40 | 12.00 | 12.50 |
| Third Quantile | 11.00 | 10.75 | 10.30 | 11.40 | 12.30 | 12.60 | 12.70 |
| Maximum Value | 12.60 | 13.50 | 13.60 | 14.00 | 14.20 | 14.00 | 12.90 |
| Mean | 10.35 | 10.15 | 9.81 | 10.58 | 11.37 | 11.64 | 12.18 |
| Percentage of overall wine | 0.41 | 3.33 | 29.75 | 44.88 | 17.97 | 3.57 | 0.10 |
Density and Alcohol are found to carry relationship with quality.The good wine contains less density and high alcohol concentration.This also looks scientifically true as alcohol have low density so high alcohol leads to wine with low density.
There are two peaks can be seen in density as well as in alcohol plot for good wine.The poor wine and average wine nearly overlap, this might be due to the fact that the wine quality doens’t depends only on density and alcohol but also on other features.
We can consider that there is high probability of wine to have good taste if we select wine have low density and high alcohol content.
The above plots reveal relationships among physiochemical properties and alcohol. The good wine tends to have low density, low total sulfur dioxide, low residual sugar and high alcohol value. The poor wine have higher density and residual sugar. The high sugar content give wine more sweet taste which is not preferred.
The total sulfur dioxide is undetectable in low concentration but when sulfur dioxide crosses 50 ppm, then it becomes evident to nose and taste of wine.This is also a reason for poor wine to have high total sulfur dioxide content.
The alcohol and density are negatively correlated, so increase in alcohol will cause reduce in density. This also infers that good wine have high alcohol which leads to low density.
The quality of wine weakly depends on alcohol and density while I supposed that they may play important role in defining the quality. I didn’t find any strong correlations between features and wine quality but able to find some intresting inter-relationship among other physiochemical test observations like between alcohol & density and residual sugar & density. There are also many outliers are present which reduces the linear model efficiency. After completing the analysis, I’m able to interpret some results as Good Wine tends to have lower density, high alcohol, less residual sugar and less sulfur dioxide content.
The given dataset is based on only 4898 observation of white wine variants of Portuguese “Vinho Verde” wine from 2009 source. The conclusion I’ve shared is just based on given dataset. There are number of outliers present which imply that I have to recheck them from the data source from where they were collected at first place or perform test in controlled enviroment to redefine them and these both are almost impossible. Here, I’m just using sample to define population which is not correct due to the fact that I still don’t have much feature in dataset to define quality of wine. The grape source, region of wine, name of wine etc. can be collected and added to this dataset will improve and help us to decide which wine is better.