Dataset

The dataset are related to red wine of Portuguese “Vinho Verde” wine. Can we predict the wine quality based on the characteristics provided in the dataset?

library(DT)
data <- read.csv("https://raw.githubusercontent.com/suswong/DATA-605/main/winequality-red.csv", head = TRUE, sep=";")
datatable(data, options = list(scrollX = TRUE))

Packages

library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(psych)
library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
## 
##     describe
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
library(MASS)
## 
## Attaching package: 'MASS'
## The following objects are masked from 'package:openintro':
## 
##     housing, mammals

Correlation

The distribution of the following variables:

Variable Distribution
fixed.acidity skewed right
volatile.acidity skewed right
citric.acid skewed right
residual.sugar skewed right
chlorides skewed right
free.sulfur.dioxide skewed right
total.sulfur.dioxide skewed right
density normal
pH skewed right
sulphates skewed right
alcohol skewed right
quality normal
suppressWarnings({
chart.Correlation(data[1:12], histogram = TRUE, method = "pearson") })

Model

Using the backward method of stepwise regression, we have the following model.

model <- step(lm(quality ~ ., data = data), direction = "backward")
## Start:  AIC=-1375.49
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol
## 
##                        Df Sum of Sq    RSS     AIC
## - density               1     0.287 666.70 -1376.8
## - fixed.acidity         1     0.389 666.80 -1376.5
## - residual.sugar        1     0.498 666.91 -1376.3
## - citric.acid           1     0.646 667.06 -1375.9
## <none>                              666.41 -1375.5
## - free.sulfur.dioxide   1     1.694 668.10 -1373.4
## - pH                    1     1.957 668.37 -1372.8
## - chlorides             1     8.391 674.80 -1357.5
## - total.sulfur.dioxide  1     8.427 674.84 -1357.4
## - sulphates             1    26.971 693.38 -1314.0
## - volatile.acidity      1    33.620 700.03 -1298.8
## - alcohol               1    45.672 712.08 -1271.5
## 
## Step:  AIC=-1376.8
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     pH + sulphates + alcohol
## 
##                        Df Sum of Sq    RSS     AIC
## - fixed.acidity         1     0.108 666.81 -1378.5
## - residual.sugar        1     0.231 666.93 -1378.2
## - citric.acid           1     0.654 667.35 -1377.2
## <none>                              666.70 -1376.8
## - free.sulfur.dioxide   1     1.829 668.53 -1374.4
## - pH                    1     4.325 671.02 -1368.5
## - total.sulfur.dioxide  1     8.728 675.43 -1358.0
## - chlorides             1     8.761 675.46 -1357.9
## - sulphates             1    27.287 693.98 -1314.7
## - volatile.acidity      1    35.000 701.70 -1297.0
## - alcohol               1   119.669 786.37 -1114.8
## 
## Step:  AIC=-1378.54
## quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides + 
##     free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates + 
##     alcohol
## 
##                        Df Sum of Sq    RSS     AIC
## - residual.sugar        1     0.257 667.06 -1379.9
## - citric.acid           1     0.565 667.37 -1379.2
## <none>                              666.81 -1378.5
## - free.sulfur.dioxide   1     1.901 668.71 -1376.0
## - pH                    1     7.065 673.87 -1363.7
## - chlorides             1     9.940 676.75 -1356.9
## - total.sulfur.dioxide  1    10.031 676.84 -1356.7
## - sulphates             1    27.673 694.48 -1315.5
## - volatile.acidity      1    36.234 703.04 -1295.9
## - alcohol               1   120.633 787.44 -1114.7
## 
## Step:  AIC=-1379.93
## quality ~ volatile.acidity + citric.acid + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol
## 
##                        Df Sum of Sq    RSS     AIC
## - citric.acid           1     0.475 667.54 -1380.8
## <none>                              667.06 -1379.9
## - free.sulfur.dioxide   1     2.064 669.13 -1377.0
## - pH                    1     7.138 674.20 -1364.9
## - total.sulfur.dioxide  1     9.828 676.89 -1358.5
## - chlorides             1     9.832 676.89 -1358.5
## - sulphates             1    27.446 694.51 -1317.5
## - volatile.acidity      1    35.977 703.04 -1297.9
## - alcohol               1   122.667 789.73 -1112.0
## 
## Step:  AIC=-1380.79
## quality ~ volatile.acidity + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol
## 
##                        Df Sum of Sq    RSS     AIC
## <none>                              667.54 -1380.8
## - free.sulfur.dioxide   1     2.394 669.93 -1377.1
## - pH                    1     7.073 674.61 -1365.9
## - total.sulfur.dioxide  1    10.787 678.32 -1357.2
## - chlorides             1    10.809 678.35 -1357.1
## - sulphates             1    27.060 694.60 -1319.2
## - volatile.acidity      1    42.318 709.85 -1284.5
## - alcohol               1   124.483 792.02 -1109.4

Since the p-value (2.2e-16) associated with the F-statistic (127.6 with degrees of freedom = 7) is less than 0.05, it means that at least 1 independent value is related to the average life expectancy.

All of the p-values of the independent variables is less than 0.05. This indicates they are statistically significant predictor of our dependent variable.

The adjusted R-squared is 0.3567, which means this model can explain 35.67% of the variability in the quality of the red wine.

summary(model)
## 
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68918 -0.36757 -0.04653  0.46081  2.02954 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.4300987  0.4029168  10.995  < 2e-16 ***
## volatile.acidity     -1.0127527  0.1008429 -10.043  < 2e-16 ***
## chlorides            -2.0178138  0.3975417  -5.076 4.31e-07 ***
## free.sulfur.dioxide   0.0050774  0.0021255   2.389    0.017 *  
## total.sulfur.dioxide -0.0034822  0.0006868  -5.070 4.43e-07 ***
## pH                   -0.4826614  0.1175581  -4.106 4.23e-05 ***
## sulphates             0.8826651  0.1099084   8.031 1.86e-15 ***
## alcohol               0.2893028  0.0167958  17.225  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6477 on 1591 degrees of freedom
## Multiple R-squared:  0.3595, Adjusted R-squared:  0.3567 
## F-statistic: 127.6 on 7 and 1591 DF,  p-value: < 2.2e-16

Model Diagnostic

This model meets the conditions.

Linearity and Constant Variability: Residuals indicates the difference between the actual and predicted value. We want the the mean of residuals to be close to 0. Based on the plot below, the points are distributed around 0 in no apparent pattern.

Both linearity and constant variability are met.

ggplot(data = model, aes(x = .fitted, y = .resid)) + geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") + xlab("Fitted values") +
  ylab("Residuals")  + 
  ggtitle("Residuals v. Fitted") + theme(plot.title=element_text(hjust=0.5))

Normality of Residuals: All of the points should lie roughly on the line. The assumption of normality is met.

ggplot(data = model, aes(x = .resid)) + geom_histogram() + xlab("Residuals")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = model, aes(sample = .resid)) + stat_qq() +stat_qq_line(col = "red")  + 
  xlab("Theoretical Quantities") + ylab("Standardized Resididuals")+ggtitle("Normal Q~Q ") + theme(plot.title=element_text(hjust=0.5))