In this project, i am trying to analyze the which of the factors determine the quality of the red wine. After analysis, i will create a linear model to predict the quality of wine for given characteristics.
## [1] 1599 12
Our data set consists of 13 variables that may determine Red wine quality with around 1600 observations.
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
## [1] "3" "4" "5" "6" "7" "8"
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
Plotting the distribution of each of the variable present in the dataset to get the understanding of shape i.e Normal, Right skewed or Left skewed and presence of extreme Outliers in variables.
## 3 4 5 6 7 8
## 10 53 681 638 199 18
Most of data in our dataset is of Average quality wines i.e 5 and 6 as compared to Poor and Good quality wines. This may result in bias result and inaccuracy of the model of the Wine Quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Distribution of Fixed acidity in our data set is Right/Positive Skewed with Median of 7.90 and mean being dragged to 8.32 due to presence of Outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Distribution of Volatile Acidity seems to be Bi-modal with peaks at maximum no. of wines possessing the value of 0.4 and 0.6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Citric acid dsitribution does not seems to be following any standard dsitribution with more the two peak values. 75% of the values are less than or equal to 0.42 but maximum value of citric acid is 1.0 which signifies the presence of several outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Residual sugar also seems to be Right Skewed as Fixed acidity with Mean being greater than Median due to effect of Outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Distribution of Chlorides seems to be similar to Residual sugar i.e Right Skewed due to presece of outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Free sulfur dioxide also seems to be following positive skewed distribution as most of varibles with peak value at 6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Total sulfur dioxide also seems to be following similar pattern as Free sulfur dioxide.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
Density seems to approximately Normally distributed with Mean and median to be 0.99
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
PH distribution is also Normal similar to density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Sulphates also seems to be following positive skewed distribution as most of varibles with peak value at 0.6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
`
Alcohol also seems to be right skewed but with less skewness/outliers as compared to other skewed variables.
The Red wine dataset has 1599 observations of 13 variables. Out of 13, Quality is the categorical variable and rest are numerical that reflect the chemical and physical properties of the wine.
Main feature of interest in this dataset is the ‘Quality’. I would like to determine which factors are best to determine the quality of a wine.
I think ‘Alcohol’ might play a key role in determining the quality of the Red wine. Also, ‘acidity’ also may impact the quality of the wine. Considering the taste, i believe ‘residual sugar’ will also affect the quality of the red wine.
I have created Correlation matrix between all the variables to get the overview that which variables may have impact on Quality and which all variables may be correlated to each other.
Now, lets have a closer view to check which varibales are helpful in determining the Quality of the Red wine.
Plot signify that fixed Acidity doesnt have significant impact in determining the Quality of the red Wine as Median values remain approximately same with increase in quality.
It clearly depicts that volatile acidity have negative correlation with quality of the wine. As the volatile acidity decreases quality of the wine increases.
CItric Acid also seems to be playing role in determining the quality of the red wine. Good wines are having more concentration of Citric acid.
This contradicts my initial assumption as Residual sugar seems not to be affecting the quality of the wines as median values for residual sugar are apporximately same for different quality wines.
Above plots imply that chlorides seems to be following the similar pattern as Residual sugar and have no major impact on wuality of the red wines.
Its seems to be an interesting observation, bad and good quality wines seems to have low concentration of free sulfur dioxide ,while average quality wines seems to have high conctentraion of free sulfur.
This pretty much expected, total sulfur follows the similar pattern as free sulfur
Better wines seems be less dense as compared bad quality wines. This may be due to presence high concentration of Alcohol.
Better wines seems to have less pH, i.e they are more acidic.
Lets check out which how various acids impact the pH
Its strange to see why volatile acidity have positive correlation with pH. It might be due to Simpson paradox or limited no, of observations.
It clearly depicts that better wines seems to have more concentration of sulphates
It seems to be most strongest correlation among all the variables with quality of the Red WInes. It can be clearly seen that better quality wines have more concentration of Alcoholas compared to poor quality wines.
Volatile acidity had a positive correlation with pH which was unexpected. This may be due to Simpson paradox or limited no. of observations.
Alcohol seems be strongest factor in determining the quality of wines with higest pearson correlation of 0.48 among all other variables.
As we saw, Alcohol seems be strongest factor in determining the quality of wines. So lets plot that with other significant varibales to have deeper view.
It is observed that more alcohol and more citric acid concentration seems to produce better wines.
It is observed that more alcohol produce better wines if they have high sulphates concentration.
This one says better wines should have high alcohol but low volatile acidity.
It shows high alcohol and low pH tend to produce better quality wines.
Lets analyze How other acids impact the PH:-
It shows that high citric acid and low pH tend to produce high quality wines.
This obviously was not expected, Ph increases with increase in volatile acidity, results in better quality wines.
Now after all the analysis, lets determine how much each variable is actually contributing in determining the quality of wines.
Before deciding for the final model, first i will create linear model for each independent variable seprately with quality of the wine. Based on each contribution, i will decide what variables should actually be determining the quality of the Red Wines
Quality vs Alcohol
##
## Call:
## lm(formula = as.numeric(quality) ~ alcohol, data = redwine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8442 -0.4112 -0.1690 0.5166 2.5888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.12503 0.17471 -0.716 0.474
## alcohol 0.36084 0.01668 21.639 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared: 0.2267, Adjusted R-squared: 0.2263
## F-statistic: 468.3 on 1 and 1597 DF, p-value: < 2.2e-16
Based on R squared value, we can say Alcohol can explain 22% of variance in Quality of the Red wine.
Quality vs Sulphates
##
## Call:
## lm(formula = as.numeric(quality) ~ sulphates, data = redwine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2432 -0.5424 0.1102 0.4456 2.3977
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.84775 0.07842 36.31 <2e-16 ***
## sulphates 1.19771 0.11539 10.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7819 on 1597 degrees of freedom
## Multiple R-squared: 0.0632, Adjusted R-squared: 0.06261
## F-statistic: 107.7 on 1 and 1597 DF, p-value: < 2.2e-16
Sulphates ten to contribute only 6% to the Quality of the Red wine. It is low value but significant as P-value is less than 0.05.
Quality vs pH
##
## Call:
## lm(formula = as.numeric(quality) ~ pH, data = redwine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6817 -0.6394 0.3032 0.3878 2.4874
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6359 0.4332 10.703 <2e-16 ***
## pH -0.3020 0.1307 -2.311 0.021 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8065 on 1597 degrees of freedom
## Multiple R-squared: 0.003333, Adjusted R-squared: 0.002709
## F-statistic: 5.34 on 1 and 1597 DF, p-value: 0.02096
pH seems to be contributing only 2% to the quality of the wines.
Quality vs density
##
## Call:
## lm(formula = as.numeric(quality) ~ density, data = redwine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7885 -0.6216 0.1554 0.4271 2.5177
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.24 10.51 7.446 1.57e-13 ***
## density -74.85 10.54 -7.100 1.87e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7954 on 1597 degrees of freedom
## Multiple R-squared: 0.0306, Adjusted R-squared: 0.02999
## F-statistic: 50.41 on 1 and 1597 DF, p-value: 1.875e-12
Quality vs citric acid
##
## Call:
## lm(formula = as.numeric(quality) ~ citric.acid, data = redwine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0011 -0.5976 0.1021 0.5057 2.5901
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.38172 0.03372 100.294 <2e-16 ***
## citric.acid 0.93845 0.10104 9.288 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7869 on 1597 degrees of freedom
## Multiple R-squared: 0.05124, Adjusted R-squared: 0.05065
## F-statistic: 86.26 on 1 and 1597 DF, p-value: < 2.2e-16
Citric acid tend to contribute only 5% to the Quality of the Red wine. It is low value but significant i.e reliable.
Quality vs Volatile acidity
##
## Call:
## lm(formula = as.numeric(quality) ~ volatile.acidity, data = redwine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.79071 -0.54411 -0.00687 0.47350 2.93148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.56575 0.05791 78.85 <2e-16 ***
## volatile.acidity -1.76144 0.10389 -16.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7437 on 1597 degrees of freedom
## Multiple R-squared: 0.1525, Adjusted R-squared: 0.152
## F-statistic: 287.4 on 1 and 1597 DF, p-value: < 2.2e-16
This is not what i expected, volatile acidity seems to be contributing 15% to the quality of the red wine.
Quality vs Fixed acidity
##
## Call:
## lm(formula = as.numeric(quality) ~ fixed.acidity, data = redwine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8248 -0.6061 0.1925 0.4341 2.5550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.15732 0.09789 32.253 < 2e-16 ***
## fixed.acidity 0.05754 0.01152 4.996 6.5e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8016 on 1597 degrees of freedom
## Multiple R-squared: 0.01539, Adjusted R-squared: 0.01477
## F-statistic: 24.96 on 1 and 1597 DF, p-value: 6.496e-07
FIxed acidity doesn’t seems to be contributing to the quality of the red wine with R-square value of only 1.4.
Considering the individual R square value and correlation with quality variable, I have decided to create final linear model with below 4 independent variables :-
Dependent Variable: Quality
Independent variables: Alcohol, Sulphates, Citric acid, Volatile Acidity
Here is the our Model results:-
##
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = redwine)
## m2: lm(formula = as.numeric(quality) ~ alcohol + sulphates, data = redwine)
## m3: lm(formula = as.numeric(quality) ~ alcohol + sulphates + citric.acid,
## data = redwine)
## m4: lm(formula = as.numeric(quality) ~ alcohol + sulphates + citric.acid +
## volatile.acidity, data = redwine)
##
## ============================================================================
## m1 m2 m3 m4
## ----------------------------------------------------------------------------
## (Intercept) -0.125 -0.625*** -0.566** 0.646**
## (0.175) (0.177) (0.176) (0.201)
## alcohol 0.361*** 0.346*** 0.338*** 0.309***
## (0.017) (0.016) (0.016) (0.016)
## sulphates 0.994*** 0.814*** 0.696***
## (0.102) (0.107) (0.103)
## citric.acid 0.513*** -0.079
## (0.093) (0.104)
## volatile.acidity -1.265***
## (0.113)
## ----------------------------------------------------------------------------
## R-squared 0.227 0.270 0.284 0.336
## adj. R-squared 0.226 0.269 0.282 0.334
## sigma 0.710 0.690 0.684 0.659
## F 468.267 294.988 210.501 201.777
## p 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1675.142 -1659.955 -1599.093
## Deviance 805.870 760.894 746.576 691.852
## AIC 3448.114 3358.284 3329.910 3210.186
## BIC 3464.245 3379.793 3356.795 3242.448
## N 1599 1599 1599 1599
## ============================================================================
I created couple of linear model for observing the individual contribution of the variable in explaining the variance in quality of the red wines. Alcohol and Volatile Acidity comes out to be most important factor by contributing 22% and 15% respectively in determining the quality of the wines.
Consolidating all the key variables, my final model includes Alcohol, Volatile Acididty, Citric acid, Sulphates which all together can explain approximately 33 % of the variance in quality of Red wines.
This might be good result as it can be due to the fact that our dataset comprised mainly of ‘Average’ quality wines and as there were very few data about the ‘Good’ and the ‘Bad’ quality wines in our dataset. More observations about ‘bad’ and ‘good’ quality wines would have helped me better in predicting the quality of red wine effectively.
In this dataset, even though most of the observations are for average quality wine, we can see from the above plot that the mean and median coincides for all the boxes implying that for a particular Quality it is very normally distributed.
Also, from our linear model test, we saw from the R Squared value that alcohol alone contributes to about 22% in the variance of the wine quality.
Lower the Volatile acidity, the better the wine quality. But this isn’t the case with other acid parameters included in data set.
Also, from our linear model test, we saw that Volatile acidity contributes around 15% to quality of red wines.
Redwine dataset consists of 1599 observation for 13 variable consisting of various physical and chemical properties of Red wine.
First I plotted different variables against the quality to see Univariate relationships between them and then I plotted each variable against Quality to see which are significant in determining quality of the wine. I saw that the factors which affected the quality of the wine the most were Alcohol percentage, Sulphate and Citric acid and Volatile Acidity.
I tried to figure out the effect of each individual acid on the overall pH of the wine. Here I found out a very strange phenomenon where I saw that for volatile acids, the pH was increasing with acidity which was against general property of acids.
In the final part of my analysis, I plotted multivariate plots to see if there were some interesting combinations of variables which together affected the overall quality of the wine. These plots also help me to finalize the final variables which i need to put in linear model to predict the quality of wine.
While performing the analysis, my main struggle was to get a higher confidence level for predicting the the ‘Good’ and the ‘Bad’ quality wines as the data was very centralized towards the ‘Average’ quality, dataset set did not have enough data on the extreme edges i.e for wiuality 3,4 ,7, and 8 to accurately build a model which can predict the quality of a wine given the other variables with lesser margin of error. So maybe in future, if can get a dataset about Red Wines with more complete information i.e with more number of observations of every quality type, then I can build my models more effectively.