College: “IIT Madras”

Email: “jaswisai@gmail.com

Name: “Jaswanth Sai Venkat Mutcherla”

Project title: “Analysis of Wine Prices”

1.Introduction

Due to it’s extreme popularity and widespread use, Wine has a varied price range depending on it’s quality which in turn stems from variables such as age, taste, amount of chemicals that go into it’s making. This what we’ll be hoping to decode, the dependency of Wine quality on various factors.

2.Overview

For analysing this we’ll be using the dataset which has details of the red variant of a certain Portugese “Vinho Verde” wine available at Kaggle (source :https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009). The dataset has been obtained from the publication : “P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009”.

3.Procedure in brief

We’re aiming to propose a model which will determine the quality of a wine given the various values of the variables. We shall first attempt to find out which variables heavily affect the wine quality from the given dataset and depending on that, will be suggesting models to accurately predict wine quality. So without further ado, let’s start right away !

4.Analysing and preparing the Red WIne Dataset

winequality <- read.csv("C:/Users/Jaswanth/Desktop/winequality-red.csv")
attach(winequality)
library(psych)
colnames(winequality)[colnames(winequality)=="fixed.acidity"] <- "Fixed acidity"
colnames(winequality)[colnames(winequality)=="volatile.acidity"] <- "Volatile acidity"
colnames(winequality)[colnames(winequality)=="citric.acid"] <- "Citric Acid"
colnames(winequality)[colnames(winequality)=="residual.sugar"] <- "Residual Sugar"
colnames(winequality)[colnames(winequality)=="free.sulfur.dioxide"] <- "Free Sulfur Dioxide"
colnames(winequality)[colnames(winequality)=="total.sulfur.dioxide"] <- "Total Sulfur Dioxide"
colnames(winequality)[colnames(winequality)=="chlorides"] <- "Chlorides"
colnames(winequality)[colnames(winequality)=="density"] <- "Density"
colnames(winequality)[colnames(winequality)=="sulphates"] <- "Sulphates"
colnames(winequality)[colnames(winequality)=="alcohol"] <- "Alcohol"
colnames(winequality)[colnames(winequality)=="quality"] <- "Quality"
describe(winequality)

5.Visualizing the overall Correlations

library(corrgram)
corrgram(x=cor(winequality))

Here we can see the various correlations. We see that Alcohol (Alcohol content), Sulphates, Residual Sugar, Citric Acid content and Fixed acidity have positive correlations (Signified by the blue regions).

We shall now make the individual correlations for the variables which positively correlate with Quality and then propose hypotheses for a model and choose the best one. We can also note that the descending order of correlations is as follows : Alcohol,Sulphates and Citric acid, Fixed acidity and Residual Sugar.

6. Performing t-tests to see if the lesser correlated variables (Fixed Acidity and Residual Sugar) are actually significant

T-test for Fixed acidity

t.test(winequality$`Fixed acidity`,winequality$Quality)
## 
##  Welch Two Sample t-test
## 
## data:  winequality$`Fixed acidity` and winequality$Quality
## t = 55.913, df = 2255.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.589492 2.777737
## sample estimates:
## mean of x mean of y 
##  8.319637  5.636023

The p-value is less than 0.05 which implies that it is significant

T-test for Residual sugar

t.test(winequality$`Residual Sugar`,winequality$Quality)
## 
##  Welch Two Sample t-test
## 
## data:  winequality$`Residual Sugar` and winequality$Quality
## t = -76.223, df = 2544.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.176895 -3.017539
## sample estimates:
## mean of x mean of y 
##  2.538806  5.636023

The p-value is less than 0.05 which implies that it is significant

7.Individual correlations between variables and the Wine quality

In the previous step, we’ve seen that Alcohol, Sulphates, Residual Sugar, Citric Acid and Fixed acidity are positively correlated. We’ll now see how they relate to Wine quality.

Plots between Fixed acidity and Quality

plot(quality,fixed.acidity,xlab = "Quality",ylab = "Fixed acidity")

###Here we can see that the higher quality wines lie on the higher fixed acidity spectrum of the data.

Plots between Citric Acid and Quality

plot(quality,citric.acid,xlab = "Quality",ylab = "Citric Acid")

###We can’t immediately conclude anything regarding the quality and Citric Acid content.

Plots between Residual Sugar and Quality

plot(quality,residual.sugar,xlab = "Quality",ylab = "Residual Sugar")

###Once again, there is no conclusive evidence right here about their interdependencies.

Plots between Sulphates and Quality

plot(quality,sulphates,xlab = "Quality",ylab = "Sulphates")

###Here we can see that as we traverse along the X-Axis, the Y-coordinate value (Sulphates) tends to shift upwards.

Plots between Alcohol content and Quality

plot(quality,alcohol,xlab = "Quality",ylab = "Alcohol content")

###Here the dependency is more pronounced due to higher correlation. We see trend of wine quality increasing with Alcohol content

8.Formulating Hypotheses and Creating a model

Taking a look at the Corrgram, we can see that Alcohol content,Sulphates and Citric acid are more correlated to Wine quality than Residual sugar and Fixed acidity. So we’ll hypothesise that a Regression model involving Alcohol content, Sulphates and Citric acid (Let’s call this Model 1) would be a better fit to the data than a model involving all five variables (Let’s call this Model 2).

9.Testing regression models :

Model 1: Involving Alcohol content, Sulphates, Citric Acid

Model1<- lm(quality~alcohol+sulphates+citric.acid,data=winequality)
summary(Model1)
## 
## Call:
## lm(formula = quality ~ alcohol + sulphates + citric.acid, data = winequality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7565 -0.3535 -0.1007  0.5067  2.2125 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.43392    0.17615   8.140 7.86e-16 ***
## alcohol      0.33841    0.01619  20.903  < 2e-16 ***
## sulphates    0.81403    0.10651   7.643 3.65e-14 ***
## citric.acid  0.51345    0.09284   5.531 3.72e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6842 on 1595 degrees of freedom
## Multiple R-squared:  0.2836, Adjusted R-squared:  0.2823 
## F-statistic: 210.5 on 3 and 1595 DF,  p-value: < 2.2e-16

Model 2: Involving all 5 variables

Model2<- lm(quality~fixed.acidity+residual.sugar+alcohol+citric.acid+sulphates,data=winequality)
summary(Model2)
## 
## Call:
## lm(formula = quality ~ fixed.acidity + residual.sugar + alcohol + 
##     citric.acid + sulphates, data = winequality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7521 -0.3533 -0.0909  0.5133  2.1554 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.16472    0.21552   5.404 7.50e-08 ***
## fixed.acidity   0.03298    0.01349   2.445  0.01458 *  
## residual.sugar -0.01483    0.01227  -1.209  0.22701    
## alcohol         0.34631    0.01645  21.055  < 2e-16 ***
## citric.acid     0.32568    0.12528   2.600  0.00942 ** 
## sulphates       0.81556    0.10647   7.660 3.21e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.683 on 1593 degrees of freedom
## Multiple R-squared:  0.2869, Adjusted R-squared:  0.2846 
## F-statistic: 128.2 on 5 and 1593 DF,  p-value: < 2.2e-16

By comparing the adjusted R-squared values of the two models, we can see that adding adding “Residual Sugar” and “Fixed acidity” indeed results in a better model for prediction of the Wine quality.

**10.Conclusion

We have finally obtained a model (Model 2) which accurately predicts the Wine quality based on the variables we decided based on correlations and t-tests, namely : Alcohol content, Fixed acidity, Residual sugar, Citric acid and Sulphates. So now we can finally formulate the model into a Linear equation since we have obtained the required coefficients and interecepts for Model 2.

\[Wine Quality= \alpha_0 + \alpha_1 Fixed Acidity + \alpha_2 Residual Sugar + \alpha_3 Alcohol Content + \alpha_4 Citric Acid +\alpha_5Sulphates\]

The Fit to model yielded the necessary values of the coefficients so, substituting, we get the new equation:

\[Wine Quality= 1.16472 + 0.03298* Fixed Acidity -0.01483* Residual Sugar + 0.34631* Alcohol Content + 0.32568* Citric Acid +0.81556*Sulphates\] ##We have thus obtained an equation for predicting the Wine quality using the significant attributes from the data.

Thank You !