We have a large dataset containing data about 1599 varieties of Portuguese red wine - “Vinho Verde”. This multivariate data shows the physiochemical composition and the corresponding quality of each variety rated on a 10 point scale by wine tasters (10 = Excellent Quality, 0 = Bad Quality). Ulimate goal is to fit a linear model to this data and find out what factors contribute to a good quality red wine so that we dont have to rely on Humans for assessing the quality of wine.
Now, let us extract the basic information about our dataset by applying descriptive statistics like applying the summary function.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Observations from the above summary:
There are no null values in the data. Also, the dataset contains no negative values and has only real numbers.
There are 12 columns in the data set. The data type of all these columns is numeric.
Data in all columns except the ‘quality’ column is continuous. Quality column has discrete data with only integer values ranging from 3 to 8.
By observing the mean, min, max and 75th quartile values from the summary, we guess that there can be some outliers in these columns: residual.sugar(max = 15.5), total.sulfur.dioxide (max = 289), sulphates (max = 2) etc. However, there can or cannot be outliers in these and this will be checked for, in our subsequent steps.
Now, let us check the distibrution of the data using histograms.
From the above set of histograms, we can observe the data distribution. Fixed acidity is right skewed, volatile acidity is also right skewed, but with 2 modes. A histogram with smaller bin value shows that it has bimodal distribution charactersitics. That means that the variable volatile acidity might be concentrated in 2 groups, however this might also be a garbage observation considering that our data is not too big. Similarly the data distribution of Citric acid has 3 modes. The distribution of Residual sugar has lot of sentinel values. Some of these can be outliers.
Chlorides also show lot of sentinel values which can be outliers. The graphs of free sulfurdioxide, total sulfurdioxide are also right skewed. The distribution of density is normally distributed, which is good.
The data of the PH is also normally distributed. Sulphates have a lot of sentinel values and alcohol data points are also right skewed. We dont need to consider the quality histogram as we have chosen it as the response. However, just a general comment on this observation is that 5 anf 6 are the most commonly given ratings for our wine. Also, our wine has only ratings from 3 to 8
Our data doesnt have any missing values or any inconsistent/technically incorrect entries. The next step in data cleaning involves identifying outliers. To detect the outliers let us use Tukey’s box and whisker method (box plots). So, we flag any point lying outside 1.5 times the IQR(Inter Quartile Region) as a possible oulier.
As seen in these box plots, there is some degree of right skewedness in fixed acidiy, volatile acidity values, free.sulfur.dioxide & total.sulfur.dioxide. The box plots show that there can be some ouliers in residual.sugar, chlorides, sulphates. Especially, chloride has 2 outliers at the far end of the distribution (at 0.6 chloride value). Even total sulfur dioxide has 2 outliers. However, we do not know if these data points are bad measurements or correct measurements. As we are unsure if these are invalid measurements, lets carry over these ‘potential outliers’ too to the subsequent analysis and try to eliminate at a later stage by calculating cooks distance.
We can also see the Q-Q plot
All of these plots show right skewed ness of the data.
Let us see their pair plots and try to see if there are any correlations.
From the above scatterplot and correlation matrix, we could see that there is positive correlation between 1. Citric acid and Fixed acidity 2. Total sulphur dioxide and Free sulphur dioxide 3. Density and Fixed acidity
and negative correlation between 1. Citric acid and volatile acidity 2. pH and Fixed acidity 3. pH and Citric acid 4. Alcohol and Density
Correlation > 0.5, in descending order (high correlation to low) : PH & fixed acidity, citric acid & fixed acidity, density & fixed acidity, citric acid & volatile acidity, PH and free sulfurdioxide, quality and alcohol, alcohol and density
Now let us fit a linear model to our data by statistical regression methods of entry. We chose the ‘backward elimination’ method. In this method, we eliminate insignificant coefficients until we have an adequate model that explains major part of our data. We eliminate each model based on F test
This is our initial model with quality as the response and all the other variables as predictors. We can eliminate the variables by Step AIC method or by seeing the F values.
##
## Call:
## lm(formula = RedWine$quality ~ RedWine$fixed.acidity + RedWine$volatile.acidity +
## RedWine$citric.acid + RedWine$residual.sugar + RedWine$chlorides +
## RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide +
## RedWine$density + RedWine$pH + RedWine$sulphates + RedWine$alcohol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.68911 -0.36652 -0.04699 0.45202 2.02498
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.197e+01 2.119e+01 1.036 0.3002
## RedWine$fixed.acidity 2.499e-02 2.595e-02 0.963 0.3357
## RedWine$volatile.acidity -1.084e+00 1.211e-01 -8.948 < 2e-16 ***
## RedWine$citric.acid -1.826e-01 1.472e-01 -1.240 0.2150
## RedWine$residual.sugar 1.633e-02 1.500e-02 1.089 0.2765
## RedWine$chlorides -1.874e+00 4.193e-01 -4.470 8.37e-06 ***
## RedWine$free.sulfur.dioxide 4.361e-03 2.171e-03 2.009 0.0447 *
## RedWine$total.sulfur.dioxide -3.265e-03 7.287e-04 -4.480 8.00e-06 ***
## RedWine$density -1.788e+01 2.163e+01 -0.827 0.4086
## RedWine$pH -4.137e-01 1.916e-01 -2.159 0.0310 *
## RedWine$sulphates 9.163e-01 1.143e-01 8.014 2.13e-15 ***
## RedWine$alcohol 2.762e-01 2.648e-02 10.429 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.648 on 1587 degrees of freedom
## Multiple R-squared: 0.3606, Adjusted R-squared: 0.3561
## F-statistic: 81.35 on 11 and 1587 DF, p-value: < 2.2e-16
The adjusted R squared is 0.356 which is not much but is acceptable. Let’s try to analyze the above model by eliminating variable “Density” because of it’s high p value
##
## Call:
## lm(formula = RedWine$quality ~ RedWine$fixed.acidity + RedWine$volatile.acidity +
## RedWine$citric.acid + RedWine$residual.sugar + RedWine$chlorides +
## RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide +
## RedWine$pH + RedWine$sulphates + RedWine$alcohol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.67204 -0.36527 -0.04523 0.45628 2.03894
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4538341 0.6125783 7.271 5.59e-13 ***
## RedWine$fixed.acidity 0.0081441 0.0160586 0.507 0.61212
## RedWine$volatile.acidity -1.0964449 0.1200866 -9.130 < 2e-16 ***
## RedWine$citric.acid -0.1836098 0.1471561 -1.248 0.21232
## RedWine$residual.sugar 0.0089507 0.0120542 0.743 0.45787
## RedWine$chlorides -1.9067341 0.4173928 -4.568 5.30e-06 ***
## RedWine$free.sulfur.dioxide 0.0045147 0.0021631 2.087 0.03704 *
## RedWine$total.sulfur.dioxide -0.0033120 0.0007264 -4.560 5.52e-06 ***
## RedWine$pH -0.5042762 0.1571117 -3.210 0.00136 **
## RedWine$sulphates 0.8928974 0.1107548 8.062 1.46e-15 ***
## RedWine$alcohol 0.2927427 0.0173394 16.883 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6479 on 1588 degrees of freedom
## Multiple R-squared: 0.3603, Adjusted R-squared: 0.3562
## F-statistic: 89.43 on 10 and 1588 DF, p-value: < 2.2e-16
Now, we see that Adjusted R squared has remained same at 0.356. Now let’s remove the next variable with the highest p value i.e. fixed acidity.
##
## Call:
## lm(formula = RedWine$quality ~ RedWine$volatile.acidity + RedWine$citric.acid +
## RedWine$residual.sugar + RedWine$chlorides + RedWine$free.sulfur.dioxide +
## RedWine$total.sulfur.dioxide + RedWine$pH + RedWine$sulphates +
## RedWine$alcohol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.65598 -0.36981 -0.04546 0.45701 2.03629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6583175 0.4610666 10.103 < 2e-16 ***
## RedWine$volatile.acidity -1.0815017 0.1163884 -9.292 < 2e-16 ***
## RedWine$citric.acid -0.1426059 0.1229263 -1.160 0.2462
## RedWine$residual.sugar 0.0093998 0.0120188 0.782 0.4343
## RedWine$chlorides -1.9615892 0.4030404 -4.867 1.25e-06 ***
## RedWine$free.sulfur.dioxide 0.0045912 0.0021574 2.128 0.0335 *
## RedWine$total.sulfur.dioxide -0.0034134 0.0006982 -4.889 1.12e-06 ***
## RedWine$pH -0.5465105 0.1331940 -4.103 4.28e-05 ***
## RedWine$sulphates 0.8968998 0.1104474 8.121 9.21e-16 ***
## RedWine$alcohol 0.2916526 0.0172016 16.955 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6478 on 1589 degrees of freedom
## Multiple R-squared: 0.3602, Adjusted R-squared: 0.3565
## F-statistic: 99.39 on 9 and 1589 DF, p-value: < 2.2e-16
Again, there is no change in R squared. Now let’s remove the next variable with the highest p value i.e. residual sugar
##
## Call:
## lm(formula = RedWine$quality ~ RedWine$volatile.acidity + RedWine$citric.acid +
## RedWine$chlorides + RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide +
## RedWine$pH + RedWine$sulphates + RedWine$alcohol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.66890 -0.37044 -0.04474 0.45697 2.02363
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6680876 0.4608410 10.129 < 2e-16 ***
## RedWine$volatile.acidity -1.0736123 0.1159362 -9.260 < 2e-16 ***
## RedWine$citric.acid -0.1295444 0.1217717 -1.064 0.2876
## RedWine$chlorides -1.9494185 0.4026906 -4.841 1.42e-06 ***
## RedWine$free.sulfur.dioxide 0.0047601 0.0021463 2.218 0.0267 *
## RedWine$total.sulfur.dioxide -0.0033658 0.0006954 -4.840 1.42e-06 ***
## RedWine$pH -0.5491501 0.1331350 -4.125 3.90e-05 ***
## RedWine$sulphates 0.8914283 0.1102122 8.088 1.19e-15 ***
## RedWine$alcohol 0.2928780 0.0171280 17.099 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6477 on 1590 degrees of freedom
## Multiple R-squared: 0.3599, Adjusted R-squared: 0.3567
## F-statistic: 111.8 on 8 and 1590 DF, p-value: < 2.2e-16
The adjusted R squared has actually increased slightly to 0.3567. Now let’s remove the next variable with the highest p value i.e. citric acid
##
## Call:
## lm(formula = RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides +
## RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide +
## RedWine$pH + RedWine$sulphates + RedWine$alcohol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.68918 -0.36757 -0.04653 0.46081 2.02954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4300987 0.4029168 10.995 < 2e-16 ***
## RedWine$volatile.acidity -1.0127527 0.1008429 -10.043 < 2e-16 ***
## RedWine$chlorides -2.0178138 0.3975417 -5.076 4.31e-07 ***
## RedWine$free.sulfur.dioxide 0.0050774 0.0021255 2.389 0.017 *
## RedWine$total.sulfur.dioxide -0.0034822 0.0006868 -5.070 4.43e-07 ***
## RedWine$pH -0.4826614 0.1175581 -4.106 4.23e-05 ***
## RedWine$sulphates 0.8826651 0.1099084 8.031 1.86e-15 ***
## RedWine$alcohol 0.2893028 0.0167958 17.225 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6477 on 1591 degrees of freedom
## Multiple R-squared: 0.3595, Adjusted R-squared: 0.3567
## F-statistic: 127.6 on 7 and 1591 DF, p-value: < 2.2e-16
The adjusted R squared remains same even now. Now, the p value of all the variables is very less (compared to 0.05). So now we can stop the elimination based on P values and proceed with the ANOVA method of elimination.
Ps: After following the step AIC procedure also, we end up with the same equation as above.
Now lets calculate the ANOVA of the full and reduced models and try to eliminate some more variables. Formulating a hypothesis: hypothesis that both the equations are same and removing the variable will not affect the result.
Test on volatile acidity
## Analysis of Variance Table
##
## Model 1: RedWine$quality ~ RedWine$chlorides + RedWine$free.sulfur.dioxide +
## RedWine$total.sulfur.dioxide + RedWine$pH + RedWine$sulphates +
## RedWine$alcohol
## Model 2: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides +
## RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide +
## RedWine$pH + RedWine$sulphates + RedWine$alcohol
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1592 709.85
## 2 1591 667.54 1 42.318 100.86 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] 2.015325
The F value is 100.86, which is far greater than the critical value of 2.015 (for 95% confidence). So we reject the null hypothesis and do not eliminate the volatile acidity.
Test on chlorides
## Analysis of Variance Table
##
## Model 1: RedWine$quality ~ RedWine$volatile.acidity + RedWine$free.sulfur.dioxide +
## RedWine$total.sulfur.dioxide + RedWine$pH + RedWine$sulphates +
## RedWine$alcohol
## Model 2: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides +
## RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide +
## RedWine$pH + RedWine$sulphates + RedWine$alcohol
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1592 678.35
## 2 1591 667.54 1 10.809 25.763 4.314e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F value is again 25.763, which is far greater than the critical value of 2.015 (for 95% confidence). So we reject the null hypothesis and do not eliminate the chlorides.
Test on Free sulphur dioxide
## Analysis of Variance Table
##
## Model 1: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides +
## RedWine$total.sulfur.dioxide + RedWine$pH + RedWine$sulphates +
## RedWine$alcohol
## Model 2: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides +
## RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide +
## RedWine$pH + RedWine$sulphates + RedWine$alcohol
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1592 669.93
## 2 1591 667.54 1 2.3941 5.7061 0.01702 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F value is again 5.7061, which is greater than the critical value of 2.015 (for 95% confidence). So we reject the null hypothesis and do not eliminate the free sulplur dioxide
Test on Total sulphur dioxide
## Analysis of Variance Table
##
## Model 1: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides +
## RedWine$free.sulfur.dioxide + RedWine$pH + RedWine$sulphates +
## RedWine$alcohol
## Model 2: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides +
## RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide +
## RedWine$pH + RedWine$sulphates + RedWine$alcohol
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1592 678.32
## 2 1591 667.54 1 10.787 25.709 4.435e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F value is again 25.709, which is far greater than the critical value of 2.015 (for 95% confidence). So we reject the null hypothesis and do not eliminate the total sulplur dioxide
Test on pH
## Analysis of Variance Table
##
## Model 1: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides +
## RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide +
## RedWine$sulphates + RedWine$alcohol
## Model 2: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides +
## RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide +
## RedWine$pH + RedWine$sulphates + RedWine$alcohol
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1592 674.61
## 2 1591 667.54 1 7.0727 16.857 4.235e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F value is again 16.857, which is far greater than the critical value of 2.015 (for 95% confidence). So we reject the null hypothesis and do not eliminate the total pH
Test on Sulphates
## Analysis of Variance Table
##
## Model 1: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides +
## RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide +
## RedWine$pH + RedWine$alcohol
## Model 2: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides +
## RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide +
## RedWine$pH + RedWine$sulphates + RedWine$alcohol
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1592 694.60
## 2 1591 667.54 1 27.06 64.496 1.865e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F value is again 64.496, which is far greater than the critical value of 2.015 (for 95% confidence). So we reject the null hypothesis and do not eliminate the sulphates
Test on Alcohol
## Analysis of Variance Table
##
## Model 1: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides +
## RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide +
## RedWine$pH + RedWine$sulphates
## Model 2: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides +
## RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide +
## RedWine$pH + RedWine$sulphates + RedWine$alcohol
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1592 792.02
## 2 1591 667.54 1 124.48 296.69 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F value is again 296.69, which is far greater than the critical value of 2.015 (for 95% confidence). So we reject the null hypothesis and do not eliminate the Alcohol.
We could eliminate the free sulphur dioxide as it has low F value compared to others, but the adjusted R squared value decreases a little bit if we remove it. So let us finalize our model
based on above analysis, the two mostsignificant variables are volatile acidity and alcohol which affects quality. this could also be visually seen as below
Lets create it’s residual model
There is no pattern observed.
normal Residual plot of the residuals:
plotting the predictors vs fitted value graphs
All of these plots are satisfactory. Total sulphur dioxide has a funnel shaped residual plot, however the effect of this variable on our total model is very less. So we proceed, satisfied with our X-residual plots.
plotting the leverage vs fitted value plot:
The above plot shows a non-linear (parabolic) distribution between leverage and fitted values. So doing a log transformation
Applying log transformation, we get the following:
##
## Call:
## lm(formula = RedWine$Logquality ~ RedWine$Logvolatile.acidity +
## RedWine$Logchlorides + RedWine$Logfree.sulfur.dioxide + RedWine$Logtotal.sulfur.dioxide +
## RedWine$LogpH + RedWine$Logsulphates + RedWine$Logalcohol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.62682 -0.05980 -0.00302 0.08249 0.31602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.813884 0.104769 7.768 1.42e-14 ***
## RedWine$Logvolatile.acidity -0.085475 0.009494 -9.003 < 2e-16 ***
## RedWine$Logchlorides -0.047306 0.010435 -4.533 6.24e-06 ***
## RedWine$Logfree.sulfur.dioxide 0.021078 0.007187 2.933 0.003407 **
## RedWine$Logtotal.sulfur.dioxide -0.026936 0.007134 -3.776 0.000165 ***
## RedWine$LogpH -0.290110 0.071232 -4.073 4.87e-05 ***
## RedWine$Logsulphates 0.134798 0.014873 9.063 < 2e-16 ***
## RedWine$Logalcohol 0.503209 0.034405 14.626 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1186 on 1591 degrees of freedom
## Multiple R-squared: 0.3389, Adjusted R-squared: 0.336
## F-statistic: 116.5 on 7 and 1591 DF, p-value: < 2.2e-16
Now the time of curve
## [1] "par(mfrow=c(4,2))\nplot(RedWine$volatile.acidity, WineQuality5$residuals)\nabline(h=0,col=\"red\")\nplot(RedWine$chlorides, WineQuality5$residuals)\nabline(h=0,col=\"red\")\nplot(RedWine$free.sulfur.dioxide, WineQuality5$residuals)\nabline(h=0,col=\"red\")\nplot(RedWine$total.sulfur.dioxide, WineQuality5$residuals)\nabline(h=0,col=\"red\")\nplot(RedWine$pH, WineQuality5$residuals)\nabline(h=0,col=\"red\")\nplot(RedWine$sulphates, WineQuality5$residuals)\nabline(h=0,col=\"red\")\nplot(RedWine$alcohol, WineQuality5$residuals)\nabline(h=0,col=\"red\")"
Cooks Distance
Checking for multi-collinearity in the data,
## RedWine$Logvolatile.acidity RedWine$Logchlorides
## 1.280896 1.324482
## RedWine$Logfree.sulfur.dioxide RedWine$Logtotal.sulfur.dioxide
## 2.755839 2.877134
## RedWine$LogpH RedWine$Logsulphates
## 1.250469 1.256140
## RedWine$Logalcohol
## 1.315354
None of the VIF values are greater than 5. Hence, we proceed to the next step as there appear to be no multi-collinearity issues.
centering the variables we get
The final fitted linear regression model equation is:
Quality = 4.43009 - 1.0127 volatile.acidity - 2.0178 chlorides + 0.0050 free.sulfur.dioxide - 0.0034 total.sulfur.dioxide - 0.4826 pH + 0.8826 sulphates + 0.2893 alcohol
This model explains 36% of our wine’s quality. This seems like a small number but the wine quality ranking by wine tasters is not accurate or standardized. Also, in case more data is available, perhaps a better fitting model could be fit.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.