1 Introduction

This article aims to accomplish Regression Model course at Algoritma. The dataset used is obtained from UCI Machine Learning Repository, i.e. Wine Quality Data Set. There are two datasets provided (red wine and white wine), but only the red wine one will be discussed in this article.

You could see the source code in my GitHub account here. I also wrote this article in Python language using Jupyter notebook. You could see it here.

1.1 Dataset Background

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: Web Link or the reference Cortez et al., 2009. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

1.2 Aim

The goal is to model red wine quality based on physicochemical tests.

1.3 Objectives

  1. To solve the final model equation

  2. To output the statistical values (adjusted) R-squared, Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and MAE (Mean Absolute Error)

  3. To examine the model including statistics and visualizations:

  • Assess linearity of model (parameters)
  • Assess serial independence of errors
  • Assess heteroscedasticity
  • Assess normality of residual distribution
  • Assess multicollinearity
  1. To interpretate the model

  2. To consider other factors, such as:

  • Are there any outliers?
  • Are there missing values?
  1. To test the model using dataset test and discuss the results

2 Metadata

There are 12 columns available in the dataset. They are briefly described below or you can read in this file. For more information, read Cortez et al., 2009.

Input variables (based on physicochemical tests):

  1. fixed acidity
  2. volatile acidity
  3. citric acid
  4. residual sugar
  5. chlorides
  6. free sulfur dioxide
  7. total sulfur dioxide
  8. density
  9. pH
  10. sulphates
  11. alcohol

Output variable (based on sensory data):

  1. quality (score between 0 and 10)

2.1 Source

Paulo Cortez, University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal @2009

2.2 Relevant Papers

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Available at: Web Link

3 Preparation

Prepare the performance indicators and all necessary functions.

##        fixed.acidity     volatile.acidity          citric.acid 
##                    0                    0                    0 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                    0                    0                    0 
## total.sulfur.dioxide              density                   pH 
##                    0                    0                    0 
##            sulphates              alcohol              quality 
##                    0                    0                    0

Perfect. Since the used datasets are originally clean, we will not find any missing value. So, let’s move on to explore the data.

4 Exploratory Data Analysis

In order to explore the dataset, we could use scatter plots, histograms, correlation value, and p-value.

The preceding figure tells many things. But, in general, it shows four points: scatter plots between each variable, histograms of each variable, correlation values between each value, and p-values between each value against significance value of 0.05.

4.1 Scatter plots

Surprisingly, we found something interesting here. The scatter plots of between quality and each predictor variable form the same pattern that the target variable quality classifies the values into several classess, i.e. 3, 4, 5, 6, 7, and 8. To examine this case, we will go through to assess linear regression to model the data.

Moreover, there are several predictors which have strong relationship, e.g. between fixed.acidity and citric.acid. They are indicated by their tendency to have inclined or declined line. This case is discussed further in the Correlation values point below.

4.2 Histograms

Each predictor variable shows values distributed appropriately. However, the target variable exhibits poor distribution. This supports the above finding from scatter plots analysis. We could check the summary of such variable to make sure.

Unsuprisingly, quality does have classified values. Based on this finding, it seems that linear regression is not suitable for this dataset. This is our initial hypothesis.

4.3 Correlation values

The figure above shows that below relationships have a strong correlation.

  • Between density and fixed.acidity (0.67)
  • Between fixed.acidity and citric.acid (0.67)
  • Between fixed.acidity and pH (-0.68)
  • Between free.sulfur.dioxide and total.sulfur.dioxide (0.67)

Those perhaps indicate sufficiently high multicollinearity. We will highlight this issue and discuss it later in the assumptions section.

4.4 P-values

In addition, it is only volatile.acidity and alcohol which have the largest correlation value with quality. However, we also need to check the Pearson’s correlation test based on the p-value. As seen in the figure above, the red stars in the upper triangle of the matrix indicate the significance. The more the stars exist, the more significant the relationship is. In order to be significant enough to the significance value (we use significance value (alpha) of 0.05), we need at least one star.

In this p-value analysis, we’re only interested in considering the p-values of relationship between quality and each predictor variable. We can see that all variables have at least one star (meaning p-value less than pre-determined alpha (i.e. 0.05)), except residual.sugar. So, we won’t consider such variable any longer.

5 Modelling

5.1 Splitting Train Datasets and Test Datasets

By using the dataset, we’re going to split it up into 80% of data for train datasets and 20% of data for test datasets.

5.2 Create the Model

As mentioned in the exploratory data analysis, we will employ all predictor variables, except residual.sugar, for the model. Let’s create linear model from those variables.

## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol, data = X.train_red)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.67536 -0.38553 -0.06879  0.45454  1.97578 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.510e+01  1.912e+01   0.790 0.429935    
## fixed.acidity         1.547e-02  2.696e-02   0.574 0.566286    
## volatile.acidity     -1.057e+00  1.358e-01  -7.780 1.49e-14 ***
## citric.acid          -1.774e-01  1.657e-01  -1.070 0.284621    
## chlorides            -1.779e+00  4.627e-01  -3.845 0.000126 ***
## free.sulfur.dioxide   3.392e-03  2.480e-03   1.368 0.171691    
## total.sulfur.dioxide -3.645e-03  8.201e-04  -4.444 9.58e-06 ***
## density              -1.102e+01  1.954e+01  -0.564 0.572664    
## pH                   -3.835e-01  2.010e-01  -1.908 0.056598 .  
## sulphates             7.945e-01  1.217e-01   6.527 9.67e-11 ***
## alcohol               2.924e-01  2.462e-02  11.878  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6476 on 1268 degrees of freedom
## Multiple R-squared:  0.3695, Adjusted R-squared:  0.3645 
## F-statistic:  74.3 on 10 and 1268 DF,  p-value: < 2.2e-16

From the summary() function above, it can be seen that approximately a half number of all predictor variables exhibits insignificance. Furthermore, the adjusted R-squared also performs poor result. Before we tackle with this issue, we should check the assumptions of the model.

5.3 Check the assumptions

Since the linearity assumption has been discussed earlier in this section, here, we’re going to use three assumptions, i.e. normality, homoscedasticity, and multicollinearity.

5.3.1 Normality

By employing normality assumption, we’d like to have the residuals of the predicted value to approach the normal distribution. We can check this by plotting the residuals and using Shapiro-Wilk normality test. For the latter, we expect to have p-value more than significance value (i.e. 0.05) so that the null hypothesis is failed to reject.

In the Preparation chapter, we have declared a function to carry out this task called CheckNormal(). So, let’s use it.

## 
##  Shapiro-Wilk normality test
## 
## data:  model$residuals
## W = 0.9897, p-value = 7.95e-08
## 
## [1] "H0 rejected: the residuals are NOT distributed normally"

Although it seems the figure above indicates that the residuals tend to gather around 0 number (i.e. approaching to have normal distribution), we are unable to immediately believe this. We also have to check the results of Shapiro-Wilk normality test. And unfortunately, in the case of normality, our model shows poor results. The p-value is so small that H0 is rejected, meaning that the residuals is not distributed normally. We don’t want this.

5.3.2 Homoscedasticity

In homoscedasticity aspect, we’d like to have residuals spreading constantly randomly, without generating any pattern. We have two approaches to examine this aspect, i.e. plotting the residuals vs the predicted values and performing the Breusch-Pagan test. As the function CheckHomos() to carry out this task has been declared already in Preparation, we just need to use it.

## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 56.615, df = 10, p-value = 1.575e-08
## 
## [1] "H0 rejected: Error variance spreads INCONSTANTLY/generating patterns (Heteroscedasticity)"

As read above, the p-value is so small that null hypothesis is rejected. Moreover, the figure above also points out line-like patterns. This indeed states that the residuals generate patterns, meaning that heteroscedasticity exists. We don’t want this.

5.3.3 Multicollinearity

In multicollinearity factor, inside the model, we’d like to have each predictor variable not demonstrating strong relationship with each other. We could examine this factor by inspecting their VIF (Variance Inflation Factor) score. We expect to have VIF score not greater than 10. We can perform this task by using the function vif() from the car package.

##        fixed.acidity     volatile.acidity          citric.acid 
##             6.935469             1.779712             3.229450 
##            chlorides  free.sulfur.dioxide total.sulfur.dioxide 
##             1.497186             1.996028             2.312040 
##              density                   pH            sulphates 
##             4.168414             2.962262             1.368089 
##              alcohol 
##             2.218353

By reading their score above, we see that the only largest value is fixed.acidity, i.e. ± 6.9. Fortunately, such score is still lower than 10. Therefore, in case of multicollinearity, our model performs satisfactorily.

6 Model Improvements

As stated in the previous chapter, by using summary() function and checking the assumptions, the model performs poor results, except in case of multicollinearity. Thus, any improvement has to be executed to decrease its drawbacks.

6.1 Check the Outliers

Firstly, let’s check outliers of the dataset whether any high leverage high influence exist. We could use four plots here, i.e. Residuals vs Fitted, Normal Q-Q, Cook’s Distance, and Residuals vs Leverage. For your information regarding those plots, you could read here.

From four figures above, we found there are some leverages with high influence, i.e. the observations with index 78, 202, 245, 274, and 1161. We’re going to remove those rows.

After the outliers removed, a new model is generated, and also check its summary.

## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol, data = X.train_red)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.22403 -0.38124 -0.06838  0.45623  1.94962 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.869e+01  1.878e+01   0.995   0.3200    
## fixed.acidity         2.520e-02  2.681e-02   0.940   0.3475    
## volatile.acidity     -1.038e+00  1.333e-01  -7.788 1.41e-14 ***
## citric.acid          -2.014e-01  1.635e-01  -1.231   0.2184    
## chlorides            -1.496e+00  4.643e-01  -3.223   0.0013 ** 
## free.sulfur.dioxide   3.912e-03  2.445e-03   1.600   0.1099    
## total.sulfur.dioxide -3.551e-03  8.097e-04  -4.385 1.26e-05 ***
## density              -1.488e+01  1.920e+01  -0.775   0.4384    
## pH                   -3.837e-01  1.984e-01  -1.934   0.0534 .  
## sulphates             8.907e-01  1.234e-01   7.221 8.92e-13 ***
## alcohol               2.998e-01  2.425e-02  12.366  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6344 on 1263 degrees of freedom
## Multiple R-squared:  0.3875, Adjusted R-squared:  0.3827 
## F-statistic: 79.92 on 10 and 1263 DF,  p-value: < 2.2e-16

It seems the new model performs more reliable. To make sure, let’s check the adjusted R-squared values between two models.

Well done. Adjusted R-squared increases by almost 2%. Now, we move on to try feature selection to improve the model.

6.2 Feature Selection Implementation

We’re going to employ step-wise algorithm for the feature selection method. We will use three directions of the algorithm, i.e. backward, forward, and both. First of all, we have to define the models for lower and upper threshold of the algorithm.

6.2.1 Create two models as threshold for the step wise algorithm

Now, let’s carry out three approaches of step-wise algorithm.

6.2.2 Backward approach

## 
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol, data = X.train_red)
## 
## Coefficients:
##          (Intercept)      volatile.acidity             chlorides  
##             4.088908             -0.965571             -1.673087  
##  free.sulfur.dioxide  total.sulfur.dioxide                    pH  
##             0.004563             -0.003924             -0.430205  
##            sulphates               alcohol  
##             0.875497              0.306260
## 
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol, data = X.train_red)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.24844 -0.38304 -0.06432  0.46086  1.96053 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.0889079  0.4368685   9.360  < 2e-16 ***
## volatile.acidity     -0.9655711  0.1106198  -8.729  < 2e-16 ***
## chlorides            -1.6730870  0.4384609  -3.816 0.000142 ***
## free.sulfur.dioxide   0.0045626  0.0023992   1.902 0.057436 .  
## total.sulfur.dioxide -0.0039238  0.0007458  -5.261 1.68e-07 ***
## pH                   -0.4302048  0.1266785  -3.396 0.000705 ***
## sulphates             0.8754966  0.1213830   7.213 9.42e-13 ***
## alcohol               0.3062600  0.0178667  17.141  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6342 on 1266 degrees of freedom
## Multiple R-squared:  0.3865, Adjusted R-squared:  0.3831 
## F-statistic: 113.9 on 7 and 1266 DF,  p-value: < 2.2e-16

6.2.3 Forward approach

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     total.sulfur.dioxide + chlorides + pH + free.sulfur.dioxide, 
##     data = X.train_red)
## 
## Coefficients:
##          (Intercept)               alcohol      volatile.acidity  
##             4.088908              0.306260             -0.965571  
##            sulphates  total.sulfur.dioxide             chlorides  
##             0.875497             -0.003924             -1.673087  
##                   pH   free.sulfur.dioxide  
##            -0.430205              0.004563
## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     total.sulfur.dioxide + chlorides + pH + free.sulfur.dioxide, 
##     data = X.train_red)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.24844 -0.38304 -0.06432  0.46086  1.96053 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.0889079  0.4368685   9.360  < 2e-16 ***
## alcohol               0.3062600  0.0178667  17.141  < 2e-16 ***
## volatile.acidity     -0.9655711  0.1106198  -8.729  < 2e-16 ***
## sulphates             0.8754966  0.1213830   7.213 9.42e-13 ***
## total.sulfur.dioxide -0.0039238  0.0007458  -5.261 1.68e-07 ***
## chlorides            -1.6730870  0.4384609  -3.816 0.000142 ***
## pH                   -0.4302048  0.1266785  -3.396 0.000705 ***
## free.sulfur.dioxide   0.0045626  0.0023992   1.902 0.057436 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6342 on 1266 degrees of freedom
## Multiple R-squared:  0.3865, Adjusted R-squared:  0.3831 
## F-statistic: 113.9 on 7 and 1266 DF,  p-value: < 2.2e-16

6.2.4 Both approach

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     total.sulfur.dioxide + chlorides + pH + free.sulfur.dioxide, 
##     data = X.train_red)
## 
## Coefficients:
##          (Intercept)               alcohol      volatile.acidity  
##             4.088908              0.306260             -0.965571  
##            sulphates  total.sulfur.dioxide             chlorides  
##             0.875497             -0.003924             -1.673087  
##                   pH   free.sulfur.dioxide  
##            -0.430205              0.004563
## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     total.sulfur.dioxide + chlorides + pH + free.sulfur.dioxide, 
##     data = X.train_red)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.24844 -0.38304 -0.06432  0.46086  1.96053 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.0889079  0.4368685   9.360  < 2e-16 ***
## alcohol               0.3062600  0.0178667  17.141  < 2e-16 ***
## volatile.acidity     -0.9655711  0.1106198  -8.729  < 2e-16 ***
## sulphates             0.8754966  0.1213830   7.213 9.42e-13 ***
## total.sulfur.dioxide -0.0039238  0.0007458  -5.261 1.68e-07 ***
## chlorides            -1.6730870  0.4384609  -3.816 0.000142 ***
## pH                   -0.4302048  0.1266785  -3.396 0.000705 ***
## free.sulfur.dioxide   0.0045626  0.0023992   1.902 0.057436 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6342 on 1266 degrees of freedom
## Multiple R-squared:  0.3865, Adjusted R-squared:  0.3831 
## F-statistic: 113.9 on 7 and 1266 DF,  p-value: < 2.2e-16

All three approaches have been defined. Now, we’re going to compare our all models so far by their adjusted R-squared.

Evidently, after we have performed feature selection, we don’t obtain the model with much higher performance. Instead, the best model so far is achieved not from such selection, but from manually including all available predictor variables. Hence, from now on, the best model used will be the one with all predictor variables, i.e. model_redAll

7 Results and Discussions

In this chapter, we’re going to discuss the best model so far and use it to predict the test dataset. Firstly, we should interpret the selected model. Subsequently, the performance of the model is discussed and the predictions will be carried out later.

7.1 Model Interpretation

The selected model is the one with all available preditor variables. We defined it as model_redAll. It consist of the following equation:

\(\hat{Y} = \beta_0+\beta_1X_1+\beta_2X_2+\beta_3X_3+\beta_4X_4+\beta_5X_5+\beta_6X_6+\beta_7X_7+\beta_8X_8+\beta_9X_9+\beta_{10} X_{10}+\beta_{11}X_{11}\)

where the following are values from the \(\beta_0\) to \(\beta_{11}\) and from \(X_1\) to \(X_{11}\):

##          (Intercept)        fixed.acidity     volatile.acidity 
##         37.877431553          0.042728629         -1.027665001 
##          citric.acid       residual.sugar            chlorides 
##         -0.209753167          0.024577257         -1.443925680 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##          0.003734745         -0.003638972        -34.476340336 
##                   pH            sulphates              alcohol 
##         -0.285439955          0.919299454          0.279276642

From the equation above, we can interpret that the line starts from the Cartesian coordinate of (0, 37.87), as pointed by the intercept. Furthermore, along with the increase of any \(X_i\), the related \(\beta_i\) will adjust the line according to both values.

For example and for simplicity, if we were to have \(X_{alcohol} = 1\), then the y coordinate (or so-called the predicted value) would be: \(\hat{Y} = \beta_0 + \beta_{alcohol}.X_{alcohol} = 37.87 + 0.27*1 = 38.14\)

7.2 Check the Performances

Here, we’re going to check the performances of the chosen model. The metrics used are Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE). We just need the indicator() function defined in the beginning.

With performances as shown above, we will compare them to those of the prediction using test dataset.

7.3 Predictions

Here, we’re going to compare the performances of prediction using train dataset to those of using test dataset. The metrics used are MSE, RMSE, MAE, correlation between y_pred and y_true, and R-squared between y_pred and y_true. The plots between y_pred and y_true for each train dataset and test dataset also will be shown.

There are several points we can infer from the performance results above:

  1. The model overfits the train dataset so that it performs poor when using test dataset.
  2. As the model has defective results, it is unable to satisfactorily predict the target variable.
  3. The plots verify poin number 2 that the model in fact is ineffective to predict the target variable.

8 Conclusions

We have finished this article. Below are the points we can conclude from this article:

  • A linear model has been created. The target variable is quality, whereas the attributes of physicochemical tests are as the predictor variables.

  • The selected model is the one with all available variables. So, the model equation is as follows:

\(\hat{Y} = \beta_0+\beta_1X_1+\beta_2X_2+\beta_3X_3+\beta_4X_4+\beta_5X_5+\beta_6X_6+\beta_7X_7+\beta_8X_8+\beta_9X_9+\beta_{10} X_{10}+\beta_{11}X_{11}\)

  • The statistical values (adjusted) R-squared, Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) of the selected model have been calculated.
## [1] "Adjusted R-squared: 0.3832"
## [1] "MSE: 0.3983"
## [1] "RMSE: 0.6311"
## [1] "MAE: 0.4966"
  • The selected model has been examined and improved using statistics tests (the assumptions and feature selection) and visualizations in Modelling and Model Improvements chapters.

  • The selected model has been interpretated in Results and Discussions chapter.

  • The selected model has been tested using test dataset test and discussed in Results and Discussions chapter.

  • The selected model performs ineffective in modelling the train dataset. The best adjusted R-squared value produced is only at 0.3832.

  • As the selected model show poor performances, it also demonstrates deficient results when predicting the test dataset.

  • As stated earlier in histogram and scatter plots sections that it is found an initial hypothesis that the target variable has classified values instead of continuous values so it seems the linear regression is not suitable with the dataset, such hypothesis has been proven. All results and discussions in this article verify it.

  • As mentioned above, therefore, it can be concluded that for the type of this dataset, in particular a target variable with classified values, it is not recommended to model the data using linear regression.

  • For future study, other algorithms will be applied to model this wine quality dataset.