We have a large dataset containing data about 1599 varieties of Portuguese red wine - “Vinho Verde”. This multivariate data shows the physiochemical composition and the corresponding quality of each variety rated on a 10 point scale by wine tasters (10 = Excellent Quality, 0 = Bad Quality). Ulimate goal is to fit a linear model to this data and find out what factors contribute to a good quality red wine so that we dont have to rely on Humans for assessing the quality of wine.

Data Preparation, Loading and Data Exploration

Now, let us extract the basic information about our dataset by applying descriptive statistics like applying the summary function.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Observations from the above summary:

  1. There are no null values in the data. Also, the dataset contains no negative values and has only real numbers.

  2. There are 12 columns in the data set. The data type of all these columns is numeric.

  3. Data in all columns except the ‘quality’ column is continuous. Quality column has discrete data with only integer values ranging from 3 to 8.

  4. By observing the mean, min, max and 75th quartile values from the summary, we guess that there can be some outliers in these columns: residual.sugar(max = 15.5), total.sulfur.dioxide (max = 289), sulphates (max = 2) etc. However, there can or cannot be outliers in these and this will be checked for, in our subsequent steps.

Now, let us check the distibrution of the data using histograms.

From the above set of histograms, we can observe the data distribution. Fixed acidity is right skewed, volatile acidity is also right skewed, but with 2 modes. A histogram with smaller bin value shows that it has bimodal distribution charactersitics. That means that the variable volatile acidity might be concentrated in 2 groups, however this might also be a garbage observation considering that our data is not too big. Similarly the data distribution of Citric acid has 3 modes. The distribution of Residual sugar has lot of sentinel values. Some of these can be outliers.

Chlorides also show lot of sentinel values which can be outliers. The graphs of free sulfurdioxide, total sulfurdioxide are also right skewed. The distribution of density is normally distributed, which is good.

The data of the PH is also normally distributed. Sulphates have a lot of sentinel values and alcohol data points are also right skewed. We dont need to consider the quality histogram as we have chosen it as the response. However, just a general comment on this observation is that 5 anf 6 are the most commonly given ratings for our wine. Also, our wine has only ratings from 3 to 8

Our data doesnt have any missing values or any inconsistent/technically incorrect entries. The next step in data cleaning involves identifying outliers. To detect the outliers let us use Tukey’s box and whisker method (box plots). So, we flag any point lying outside 1.5 times the IQR(Inter Quartile Region) as a possible oulier.

As seen in these box plots, there is some degree of right skewedness in fixed acidiy, volatile acidity values, free.sulfur.dioxide & total.sulfur.dioxide. The box plots show that there can be some ouliers in residual.sugar, chlorides, sulphates. Especially, chloride has 2 outliers at the far end of the distribution (at 0.6 chloride value). Even total sulfur dioxide has 2 outliers. However, we do not know if these data points are bad measurements or correct measurements. As we are unsure if these are invalid measurements, lets carry over these ‘potential outliers’ too to the subsequent analysis and try to eliminate at a later stage by calculating cooks distance.

We can also see the Q-Q plot

All of these plots show right skewed ness of the data.

Let us see their pair plots and try to see if there are any correlations.

From the above scatterplot and correlation matrix, we could see that there is positive correlation between 1. Citric acid and Fixed acidity 2. Total sulphur dioxide and Free sulphur dioxide 3. Density and Fixed acidity

and negative correlation between 1. Citric acid and volatile acidity 2. pH and Fixed acidity 3. pH and Citric acid 4. Alcohol and Density

Correlation > 0.5, in descending order (high correlation to low) : PH & fixed acidity, citric acid & fixed acidity, density & fixed acidity, citric acid & volatile acidity, PH and free sulfurdioxide, quality and alcohol, alcohol and density

Now let us fit a linear model to our data by statistical regression methods of entry. We chose the ‘backward elimination’ method. In this method, we eliminate insignificant coefficients until we have an adequate model that explains major part of our data. We eliminate each model based on F test

Test on individual regression coefficients

This is our initial model with quality as the response and all the other variables as predictors. We can eliminate the variables by Step AIC method or by seeing the F values.

## 
## Call:
## lm(formula = RedWine$quality ~ RedWine$fixed.acidity + RedWine$volatile.acidity + 
##     RedWine$citric.acid + RedWine$residual.sugar + RedWine$chlorides + 
##     RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide + 
##     RedWine$density + RedWine$pH + RedWine$sulphates + RedWine$alcohol)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68911 -0.36652 -0.04699  0.45202  2.02498 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   2.197e+01  2.119e+01   1.036   0.3002    
## RedWine$fixed.acidity         2.499e-02  2.595e-02   0.963   0.3357    
## RedWine$volatile.acidity     -1.084e+00  1.211e-01  -8.948  < 2e-16 ***
## RedWine$citric.acid          -1.826e-01  1.472e-01  -1.240   0.2150    
## RedWine$residual.sugar        1.633e-02  1.500e-02   1.089   0.2765    
## RedWine$chlorides            -1.874e+00  4.193e-01  -4.470 8.37e-06 ***
## RedWine$free.sulfur.dioxide   4.361e-03  2.171e-03   2.009   0.0447 *  
## RedWine$total.sulfur.dioxide -3.265e-03  7.287e-04  -4.480 8.00e-06 ***
## RedWine$density              -1.788e+01  2.163e+01  -0.827   0.4086    
## RedWine$pH                   -4.137e-01  1.916e-01  -2.159   0.0310 *  
## RedWine$sulphates             9.163e-01  1.143e-01   8.014 2.13e-15 ***
## RedWine$alcohol               2.762e-01  2.648e-02  10.429  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.648 on 1587 degrees of freedom
## Multiple R-squared:  0.3606, Adjusted R-squared:  0.3561 
## F-statistic: 81.35 on 11 and 1587 DF,  p-value: < 2.2e-16

The adjusted R squared is 0.356 which is not much but is acceptable. Let’s try to analyze the above model by eliminating variable “Density” because of it’s high p value

## 
## Call:
## lm(formula = RedWine$quality ~ RedWine$fixed.acidity + RedWine$volatile.acidity + 
##     RedWine$citric.acid + RedWine$residual.sugar + RedWine$chlorides + 
##     RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide + 
##     RedWine$pH + RedWine$sulphates + RedWine$alcohol)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.67204 -0.36527 -0.04523  0.45628  2.03894 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   4.4538341  0.6125783   7.271 5.59e-13 ***
## RedWine$fixed.acidity         0.0081441  0.0160586   0.507  0.61212    
## RedWine$volatile.acidity     -1.0964449  0.1200866  -9.130  < 2e-16 ***
## RedWine$citric.acid          -0.1836098  0.1471561  -1.248  0.21232    
## RedWine$residual.sugar        0.0089507  0.0120542   0.743  0.45787    
## RedWine$chlorides            -1.9067341  0.4173928  -4.568 5.30e-06 ***
## RedWine$free.sulfur.dioxide   0.0045147  0.0021631   2.087  0.03704 *  
## RedWine$total.sulfur.dioxide -0.0033120  0.0007264  -4.560 5.52e-06 ***
## RedWine$pH                   -0.5042762  0.1571117  -3.210  0.00136 ** 
## RedWine$sulphates             0.8928974  0.1107548   8.062 1.46e-15 ***
## RedWine$alcohol               0.2927427  0.0173394  16.883  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6479 on 1588 degrees of freedom
## Multiple R-squared:  0.3603, Adjusted R-squared:  0.3562 
## F-statistic: 89.43 on 10 and 1588 DF,  p-value: < 2.2e-16

Now, we see that Adjusted R squared has remained same at 0.356. Now let’s remove the next variable with the highest p value i.e. fixed acidity.

## 
## Call:
## lm(formula = RedWine$quality ~ RedWine$volatile.acidity + RedWine$citric.acid + 
##     RedWine$residual.sugar + RedWine$chlorides + RedWine$free.sulfur.dioxide + 
##     RedWine$total.sulfur.dioxide + RedWine$pH + RedWine$sulphates + 
##     RedWine$alcohol)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.65598 -0.36981 -0.04546  0.45701  2.03629 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   4.6583175  0.4610666  10.103  < 2e-16 ***
## RedWine$volatile.acidity     -1.0815017  0.1163884  -9.292  < 2e-16 ***
## RedWine$citric.acid          -0.1426059  0.1229263  -1.160   0.2462    
## RedWine$residual.sugar        0.0093998  0.0120188   0.782   0.4343    
## RedWine$chlorides            -1.9615892  0.4030404  -4.867 1.25e-06 ***
## RedWine$free.sulfur.dioxide   0.0045912  0.0021574   2.128   0.0335 *  
## RedWine$total.sulfur.dioxide -0.0034134  0.0006982  -4.889 1.12e-06 ***
## RedWine$pH                   -0.5465105  0.1331940  -4.103 4.28e-05 ***
## RedWine$sulphates             0.8968998  0.1104474   8.121 9.21e-16 ***
## RedWine$alcohol               0.2916526  0.0172016  16.955  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6478 on 1589 degrees of freedom
## Multiple R-squared:  0.3602, Adjusted R-squared:  0.3565 
## F-statistic: 99.39 on 9 and 1589 DF,  p-value: < 2.2e-16

Again, there is no change in R squared. Now let’s remove the next variable with the highest p value i.e. residual sugar

## 
## Call:
## lm(formula = RedWine$quality ~ RedWine$volatile.acidity + RedWine$citric.acid + 
##     RedWine$chlorides + RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide + 
##     RedWine$pH + RedWine$sulphates + RedWine$alcohol)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.66890 -0.37044 -0.04474  0.45697  2.02363 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   4.6680876  0.4608410  10.129  < 2e-16 ***
## RedWine$volatile.acidity     -1.0736123  0.1159362  -9.260  < 2e-16 ***
## RedWine$citric.acid          -0.1295444  0.1217717  -1.064   0.2876    
## RedWine$chlorides            -1.9494185  0.4026906  -4.841 1.42e-06 ***
## RedWine$free.sulfur.dioxide   0.0047601  0.0021463   2.218   0.0267 *  
## RedWine$total.sulfur.dioxide -0.0033658  0.0006954  -4.840 1.42e-06 ***
## RedWine$pH                   -0.5491501  0.1331350  -4.125 3.90e-05 ***
## RedWine$sulphates             0.8914283  0.1102122   8.088 1.19e-15 ***
## RedWine$alcohol               0.2928780  0.0171280  17.099  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6477 on 1590 degrees of freedom
## Multiple R-squared:  0.3599, Adjusted R-squared:  0.3567 
## F-statistic: 111.8 on 8 and 1590 DF,  p-value: < 2.2e-16

The adjusted R squared has actually increased slightly to 0.3567. Now let’s remove the next variable with the highest p value i.e. citric acid

## 
## Call:
## lm(formula = RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides + 
##     RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide + 
##     RedWine$pH + RedWine$sulphates + RedWine$alcohol)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68918 -0.36757 -0.04653  0.46081  2.02954 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   4.4300987  0.4029168  10.995  < 2e-16 ***
## RedWine$volatile.acidity     -1.0127527  0.1008429 -10.043  < 2e-16 ***
## RedWine$chlorides            -2.0178138  0.3975417  -5.076 4.31e-07 ***
## RedWine$free.sulfur.dioxide   0.0050774  0.0021255   2.389    0.017 *  
## RedWine$total.sulfur.dioxide -0.0034822  0.0006868  -5.070 4.43e-07 ***
## RedWine$pH                   -0.4826614  0.1175581  -4.106 4.23e-05 ***
## RedWine$sulphates             0.8826651  0.1099084   8.031 1.86e-15 ***
## RedWine$alcohol               0.2893028  0.0167958  17.225  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6477 on 1591 degrees of freedom
## Multiple R-squared:  0.3595, Adjusted R-squared:  0.3567 
## F-statistic: 127.6 on 7 and 1591 DF,  p-value: < 2.2e-16

The adjusted R squared remains same even now. Now, the p value of all the variables is very less (compared to 0.05). So now we can stop the elimination based on P values and proceed with the ANOVA method of elimination.

Ps: After following the step AIC procedure also, we end up with the same equation as above.

Global Test of Model Adequacy (ANOVA)

Now lets calculate the ANOVA of the full and reduced models and try to eliminate some more variables. Formulating a hypothesis: hypothesis that both the equations are same and removing the variable will not affect the result.

Test on volatile acidity

## Analysis of Variance Table
## 
## Model 1: RedWine$quality ~ RedWine$chlorides + RedWine$free.sulfur.dioxide + 
##     RedWine$total.sulfur.dioxide + RedWine$pH + RedWine$sulphates + 
##     RedWine$alcohol
## Model 2: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides + 
##     RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide + 
##     RedWine$pH + RedWine$sulphates + RedWine$alcohol
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   1592 709.85                                  
## 2   1591 667.54  1    42.318 100.86 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] 2.015325

The F value is 100.86, which is far greater than the critical value of 2.015 (for 95% confidence). So we reject the null hypothesis and do not eliminate the volatile acidity.

Test on chlorides

## Analysis of Variance Table
## 
## Model 1: RedWine$quality ~ RedWine$volatile.acidity + RedWine$free.sulfur.dioxide + 
##     RedWine$total.sulfur.dioxide + RedWine$pH + RedWine$sulphates + 
##     RedWine$alcohol
## Model 2: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides + 
##     RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide + 
##     RedWine$pH + RedWine$sulphates + RedWine$alcohol
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   1592 678.35                                  
## 2   1591 667.54  1    10.809 25.763 4.314e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F value is again 25.763, which is far greater than the critical value of 2.015 (for 95% confidence). So we reject the null hypothesis and do not eliminate the chlorides.

Test on Free sulphur dioxide

## Analysis of Variance Table
## 
## Model 1: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides + 
##     RedWine$total.sulfur.dioxide + RedWine$pH + RedWine$sulphates + 
##     RedWine$alcohol
## Model 2: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides + 
##     RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide + 
##     RedWine$pH + RedWine$sulphates + RedWine$alcohol
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1   1592 669.93                              
## 2   1591 667.54  1    2.3941 5.7061 0.01702 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F value is again 5.7061, which is greater than the critical value of 2.015 (for 95% confidence). So we reject the null hypothesis and do not eliminate the free sulplur dioxide

Test on Total sulphur dioxide

## Analysis of Variance Table
## 
## Model 1: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides + 
##     RedWine$free.sulfur.dioxide + RedWine$pH + RedWine$sulphates + 
##     RedWine$alcohol
## Model 2: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides + 
##     RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide + 
##     RedWine$pH + RedWine$sulphates + RedWine$alcohol
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   1592 678.32                                  
## 2   1591 667.54  1    10.787 25.709 4.435e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F value is again 25.709, which is far greater than the critical value of 2.015 (for 95% confidence). So we reject the null hypothesis and do not eliminate the total sulplur dioxide

Test on pH

## Analysis of Variance Table
## 
## Model 1: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides + 
##     RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide + 
##     RedWine$sulphates + RedWine$alcohol
## Model 2: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides + 
##     RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide + 
##     RedWine$pH + RedWine$sulphates + RedWine$alcohol
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   1592 674.61                                  
## 2   1591 667.54  1    7.0727 16.857 4.235e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F value is again 16.857, which is far greater than the critical value of 2.015 (for 95% confidence). So we reject the null hypothesis and do not eliminate the total pH

Test on Sulphates

## Analysis of Variance Table
## 
## Model 1: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides + 
##     RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide + 
##     RedWine$pH + RedWine$alcohol
## Model 2: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides + 
##     RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide + 
##     RedWine$pH + RedWine$sulphates + RedWine$alcohol
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   1592 694.60                                  
## 2   1591 667.54  1     27.06 64.496 1.865e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F value is again 64.496, which is far greater than the critical value of 2.015 (for 95% confidence). So we reject the null hypothesis and do not eliminate the sulphates

Test on Alcohol

## Analysis of Variance Table
## 
## Model 1: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides + 
##     RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide + 
##     RedWine$pH + RedWine$sulphates
## Model 2: RedWine$quality ~ RedWine$volatile.acidity + RedWine$chlorides + 
##     RedWine$free.sulfur.dioxide + RedWine$total.sulfur.dioxide + 
##     RedWine$pH + RedWine$sulphates + RedWine$alcohol
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   1592 792.02                                  
## 2   1591 667.54  1    124.48 296.69 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F value is again 296.69, which is far greater than the critical value of 2.015 (for 95% confidence). So we reject the null hypothesis and do not eliminate the Alcohol.

We could eliminate the free sulphur dioxide as it has low F value compared to others, but the adjusted R squared value decreases a little bit if we remove it. So let us finalize our model

based on above analysis, the two mostsignificant variables are volatile acidity and alcohol which affects quality. this could also be visually seen as below

Linear model: Quality = 4.43009 - 1.0127volatile.acidity - 2.0178chlorides + 0.0050free.sulfur.dioxide - 0.0034total.sulfur.dioxide - 0.4826pH + 0.8826sulphates + 0.2893*alcohol

Model Adequacy Checking

Lets create it’s residual model

There is no pattern observed.

normal Residual plot of the residuals:

plotting the predictors vs fitted value graphs

All of these plots are satisfactory. Total sulphur dioxide has a funnel shaped residual plot, however the effect of this variable on our total model is very less. So we proceed, satisfied with our X-residual plots.

plotting the leverage vs fitted value plot:

The above plot shows a non-linear (parabolic) distribution between leverage and fitted values. So doing a log transformation

Applying log transformation, we get the following:

## 
## Call:
## lm(formula = RedWine$Logquality ~ RedWine$Logvolatile.acidity + 
##     RedWine$Logchlorides + RedWine$Logfree.sulfur.dioxide + RedWine$Logtotal.sulfur.dioxide + 
##     RedWine$LogpH + RedWine$Logsulphates + RedWine$Logalcohol)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62682 -0.05980 -0.00302  0.08249  0.31602 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      0.813884   0.104769   7.768 1.42e-14 ***
## RedWine$Logvolatile.acidity     -0.085475   0.009494  -9.003  < 2e-16 ***
## RedWine$Logchlorides            -0.047306   0.010435  -4.533 6.24e-06 ***
## RedWine$Logfree.sulfur.dioxide   0.021078   0.007187   2.933 0.003407 ** 
## RedWine$Logtotal.sulfur.dioxide -0.026936   0.007134  -3.776 0.000165 ***
## RedWine$LogpH                   -0.290110   0.071232  -4.073 4.87e-05 ***
## RedWine$Logsulphates             0.134798   0.014873   9.063  < 2e-16 ***
## RedWine$Logalcohol               0.503209   0.034405  14.626  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1186 on 1591 degrees of freedom
## Multiple R-squared:  0.3389, Adjusted R-squared:  0.336 
## F-statistic: 116.5 on 7 and 1591 DF,  p-value: < 2.2e-16

Now the time of curve

Generate QQ plot

Residual plots against regressors

## [1] "par(mfrow=c(4,2))\nplot(RedWine$volatile.acidity, WineQuality5$residuals)\nabline(h=0,col=\"red\")\nplot(RedWine$chlorides, WineQuality5$residuals)\nabline(h=0,col=\"red\")\nplot(RedWine$free.sulfur.dioxide, WineQuality5$residuals)\nabline(h=0,col=\"red\")\nplot(RedWine$total.sulfur.dioxide, WineQuality5$residuals)\nabline(h=0,col=\"red\")\nplot(RedWine$pH, WineQuality5$residuals)\nabline(h=0,col=\"red\")\nplot(RedWine$sulphates, WineQuality5$residuals)\nabline(h=0,col=\"red\")\nplot(RedWine$alcohol, WineQuality5$residuals)\nabline(h=0,col=\"red\")"

Cooks Distance

Checking for multi-collinearity in the data,

##     RedWine$Logvolatile.acidity            RedWine$Logchlorides 
##                        1.280896                        1.324482 
##  RedWine$Logfree.sulfur.dioxide RedWine$Logtotal.sulfur.dioxide 
##                        2.755839                        2.877134 
##                   RedWine$LogpH            RedWine$Logsulphates 
##                        1.250469                        1.256140 
##              RedWine$Logalcohol 
##                        1.315354

None of the VIF values are greater than 5. Hence, we proceed to the next step as there appear to be no multi-collinearity issues.

centering the variables we get

Conclusion:

The final fitted linear regression model equation is:

Quality = 4.43009 - 1.0127 volatile.acidity - 2.0178 chlorides + 0.0050 free.sulfur.dioxide - 0.0034 total.sulfur.dioxide - 0.4826 pH + 0.8826 sulphates + 0.2893 alcohol

This model explains 36% of our wine’s quality. This seems like a small number but the wine quality ranking by wine tasters is not accurate or standardized. Also, in case more data is available, perhaps a better fitting model could be fit.

Citations:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.