This article aims to accomplish Regression Model course at Algoritma. The dataset used is obtained from UCI Machine Learning Repository, i.e. Wine Quality Data Set. There are two datasets provided (red wine and white wine), but only the red wine one will be discussed in this article.
You could see the source code in my GitHub account here. I also wrote this article in Python language using Jupyter notebook. You could see it here.
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: Web Link or the reference Cortez et al., 2009. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.
The goal is to model red wine quality based on physicochemical tests.
To solve the final model equation
To output the statistical values (adjusted) R-squared, Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and MAE (Mean Absolute Error)
To examine the model including statistics and visualizations:
To interpretate the model
To consider other factors, such as:
This article is arranged as follows.
There are 12 columns available in the dataset. They are briefly described below or you can read in this file. For more information, read Cortez et al., 2009.
Input variables (based on physicochemical tests):
Output variable (based on sensory data):
Paulo Cortez, University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal @2009
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Available at: Web Link
Prepare the performance indicators and all necessary functions.
library(MLmetrics)
indicator <- function(model, y_pred, y_true) {
adj.r.sq <- summary(model)$adj.r.squared
mse <- MSE(y_pred, y_true)
rmse <- RMSE(y_pred, y_true)
mae <- MAE(y_pred, y_true)
print(paste0("Adjusted R-squared: ", round(adj.r.sq, 4)))
print(paste0("MSE: ", round(mse, 4)))
print(paste0("RMSE: ", round(rmse, 4)))
print(paste0("MAE: ", round(mae, 4)))
}
metrics <- function(y_pred, y_true){
mse <- MSE(y_pred, y_true)
rmse <- RMSE(y_pred, y_true)
mae <- MAE(y_pred, y_true)
print(paste0("MSE: ", round(mse, 6)))
print(paste0("RMSE: ", round(rmse, 6)))
print(paste0("MAE: ", round(mae, 6)))
corPredAct <- cor(y_pred, y_true)
print(paste0("Correlation: ", round(corPredAct, 6)))
print(paste0("R^2 between y_pred & y_true: ", round(corPredAct^2, 6)))
}
CheckNormal <- function(model) {
hist(model$residuals, breaks = 30)
shaptest <- shapiro.test(model$residuals)
print(shaptest)
if (shaptest$p.value <= 0.05) {
print("H0 rejected: the residuals are NOT distributed normally")
} else {
print("H0 failed to reject: the residuals ARE distributed normally")
}
}
library(lmtest)
CheckHomos <- function(model){
plot(model$fitted.values, model$residuals)
abline(h = 0, col = "red")
BP <- bptest(model)
print(BP)
if (BP$p.value <= 0.05) {
print("H0 rejected: Error variance spreads INCONSTANTLY/generating patterns (Heteroscedasticity)")
} else {
print("H0 failed to reject: Error variance spreads CONSTANTLY (Homoscedasticity)")
}
}## fixed.acidity volatile.acidity citric.acid
## 0 0 0
## residual.sugar chlorides free.sulfur.dioxide
## 0 0 0
## total.sulfur.dioxide density pH
## 0 0 0
## sulphates alcohol quality
## 0 0 0
Perfect. Since the used datasets are originally clean, we will not find any missing value. So, let’s move on to explore the data.
In order to explore the dataset, we could use scatter plots, histograms, correlation value, and p-value.
# library(psych)
# pairs.panels(redDF)
# library(ggcorrplot)
# ggcorrplot(redDF %>% cor(),
# hc.order = TRUE, type = "lower",
# lab = TRUE,
# digits = 1,
# ggtheme = ggplot2::theme_dark(),
# )
library(PerformanceAnalytics)
chart.Correlation(redDF, hist = T)The preceding figure tells many things. But, in general, it shows four points: scatter plots between each variable, histograms of each variable, correlation values between each value, and p-values between each value against significance value of 0.05.
Surprisingly, we found something interesting here. The scatter plots of between quality and each predictor variable form the same pattern that the target variable quality classifies the values into several classess, i.e. 3, 4, 5, 6, 7, and 8. To examine this case, we will go through to assess linear regression to model the data.
Moreover, there are several predictors which have strong relationship, e.g. between fixed.acidity and citric.acid. They are indicated by their tendency to have inclined or declined line. This case is discussed further in the Correlation values point below.
Each predictor variable shows values distributed appropriately. However, the target variable exhibits poor distribution. This supports the above finding from scatter plots analysis. We could check the summary of such variable to make sure.
summary(redDF$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
table(redDF$quality)
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18Unsuprisingly, quality does have classified values. Based on this finding, it seems that linear regression is not suitable for this dataset. This is our initial hypothesis.
The figure above shows that below relationships have a strong correlation.
density and fixed.acidity (0.67)fixed.acidity and citric.acid (0.67)fixed.acidity and pH (-0.68)free.sulfur.dioxide and total.sulfur.dioxide (0.67)Those perhaps indicate sufficiently high multicollinearity. We will highlight this issue and discuss it later in the assumptions section.
In addition, it is only volatile.acidity and alcohol which have the largest correlation value with quality. However, we also need to check the Pearson’s correlation test based on the p-value. As seen in the figure above, the red stars in the upper triangle of the matrix indicate the significance. The more the stars exist, the more significant the relationship is. In order to be significant enough to the significance value (we use significance value (alpha) of 0.05), we need at least one star.
In this p-value analysis, we’re only interested in considering the p-values of relationship between quality and each predictor variable. We can see that all variables have at least one star (meaning p-value less than pre-determined alpha (i.e. 0.05)), except residual.sugar. So, we won’t consider such variable any longer.
By using the dataset, we’re going to split it up into 80% of data for train datasets and 20% of data for test datasets.
As mentioned in the exploratory data analysis, we will employ all predictor variables, except residual.sugar, for the model. Let’s create linear model from those variables.
model_red1 <- lm(quality ~ fixed.acidity + volatile.acidity + citric.acid + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol,
data = X.train_red)
summary(model_red1)##
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol, data = X.train_red)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.67536 -0.38553 -0.06879 0.45454 1.97578
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.510e+01 1.912e+01 0.790 0.429935
## fixed.acidity 1.547e-02 2.696e-02 0.574 0.566286
## volatile.acidity -1.057e+00 1.358e-01 -7.780 1.49e-14 ***
## citric.acid -1.774e-01 1.657e-01 -1.070 0.284621
## chlorides -1.779e+00 4.627e-01 -3.845 0.000126 ***
## free.sulfur.dioxide 3.392e-03 2.480e-03 1.368 0.171691
## total.sulfur.dioxide -3.645e-03 8.201e-04 -4.444 9.58e-06 ***
## density -1.102e+01 1.954e+01 -0.564 0.572664
## pH -3.835e-01 2.010e-01 -1.908 0.056598 .
## sulphates 7.945e-01 1.217e-01 6.527 9.67e-11 ***
## alcohol 2.924e-01 2.462e-02 11.878 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6476 on 1268 degrees of freedom
## Multiple R-squared: 0.3695, Adjusted R-squared: 0.3645
## F-statistic: 74.3 on 10 and 1268 DF, p-value: < 2.2e-16
From the summary() function above, it can be seen that approximately a half number of all predictor variables exhibits insignificance. Furthermore, the adjusted R-squared also performs poor result. Before we tackle with this issue, we should check the assumptions of the model.
Since the linearity assumption has been discussed earlier in this section, here, we’re going to use three assumptions, i.e. normality, homoscedasticity, and multicollinearity.
By employing normality assumption, we’d like to have the residuals of the predicted value to approach the normal distribution. We can check this by plotting the residuals and using Shapiro-Wilk normality test. For the latter, we expect to have p-value more than significance value (i.e. 0.05) so that the null hypothesis is failed to reject.
In the Preparation chapter, we have declared a function to carry out this task called CheckNormal(). So, let’s use it.
##
## Shapiro-Wilk normality test
##
## data: model$residuals
## W = 0.9897, p-value = 7.95e-08
##
## [1] "H0 rejected: the residuals are NOT distributed normally"
Although it seems the figure above indicates that the residuals tend to gather around 0 number (i.e. approaching to have normal distribution), we are unable to immediately believe this. We also have to check the results of Shapiro-Wilk normality test. And unfortunately, in the case of normality, our model shows poor results. The p-value is so small that H0 is rejected, meaning that the residuals is not distributed normally. We don’t want this.
In homoscedasticity aspect, we’d like to have residuals spreading constantly randomly, without generating any pattern. We have two approaches to examine this aspect, i.e. plotting the residuals vs the predicted values and performing the Breusch-Pagan test. As the function CheckHomos() to carry out this task has been declared already in Preparation, we just need to use it.
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 56.615, df = 10, p-value = 1.575e-08
##
## [1] "H0 rejected: Error variance spreads INCONSTANTLY/generating patterns (Heteroscedasticity)"
As read above, the p-value is so small that null hypothesis is rejected. Moreover, the figure above also points out line-like patterns. This indeed states that the residuals generate patterns, meaning that heteroscedasticity exists. We don’t want this.
In multicollinearity factor, inside the model, we’d like to have each predictor variable not demonstrating strong relationship with each other. We could examine this factor by inspecting their VIF (Variance Inflation Factor) score. We expect to have VIF score not greater than 10. We can perform this task by using the function vif() from the car package.
## fixed.acidity volatile.acidity citric.acid
## 6.935469 1.779712 3.229450
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## 1.497186 1.996028 2.312040
## density pH sulphates
## 4.168414 2.962262 1.368089
## alcohol
## 2.218353
By reading their score above, we see that the only largest value is fixed.acidity, i.e. ± 6.9. Fortunately, such score is still lower than 10. Therefore, in case of multicollinearity, our model performs satisfactorily.
As stated in the previous chapter, by using summary() function and checking the assumptions, the model performs poor results, except in case of multicollinearity. Thus, any improvement has to be executed to decrease its drawbacks.
Firstly, let’s check outliers of the dataset whether any high leverage high influence exist. We could use four plots here, i.e. Residuals vs Fitted, Normal Q-Q, Cook’s Distance, and Residuals vs Leverage. For your information regarding those plots, you could read here.
par(mfrow=c(2,2)) # Change the panel layout to 2 x 2
lapply(c(1,2,4,5), # showing 4 types of plots
function(x) plot(model_red1,
which = x,
# labels.id = 1:nrow(X.train_red),
cook.levels = c(0.05, 0.1))) %>% invisible()From four figures above, we found there are some leverages with high influence, i.e. the observations with index 78, 202, 245, 274, and 1161. We’re going to remove those rows.
to.rm <- c(78,202,245,274,1161)
# X.train_red[to.rm,]
X.train_red <- X.train_red[-to.rm,]
rownames(X.train_red) <- NULLAfter the outliers removed, a new model is generated, and also check its summary.
model_red2 <- lm(quality ~ fixed.acidity + volatile.acidity + citric.acid + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol,
data = X.train_red)
summary(model_red2)##
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol, data = X.train_red)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.22403 -0.38124 -0.06838 0.45623 1.94962
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.869e+01 1.878e+01 0.995 0.3200
## fixed.acidity 2.520e-02 2.681e-02 0.940 0.3475
## volatile.acidity -1.038e+00 1.333e-01 -7.788 1.41e-14 ***
## citric.acid -2.014e-01 1.635e-01 -1.231 0.2184
## chlorides -1.496e+00 4.643e-01 -3.223 0.0013 **
## free.sulfur.dioxide 3.912e-03 2.445e-03 1.600 0.1099
## total.sulfur.dioxide -3.551e-03 8.097e-04 -4.385 1.26e-05 ***
## density -1.488e+01 1.920e+01 -0.775 0.4384
## pH -3.837e-01 1.984e-01 -1.934 0.0534 .
## sulphates 8.907e-01 1.234e-01 7.221 8.92e-13 ***
## alcohol 2.998e-01 2.425e-02 12.366 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6344 on 1263 degrees of freedom
## Multiple R-squared: 0.3875, Adjusted R-squared: 0.3827
## F-statistic: 79.92 on 10 and 1263 DF, p-value: < 2.2e-16
It seems the new model performs more reliable. To make sure, let’s check the adjusted R-squared values between two models.
print("Adjusted R-squared for 1st model:")
## [1] "Adjusted R-squared for 1st model:"
ad.r.sq1 <- summary(model_red1)$adj.r.squared
ad.r.sq1
## [1] 0.3644797
print("Adjusted R-squared for 2nd model:")
## [1] "Adjusted R-squared for 2nd model:"
ad.r.sq2 <- summary(model_red2)$adj.r.squared
ad.r.sq2
## [1] 0.3826897
print(paste0("The difference between both is ", round(ad.r.sq2-ad.r.sq1, 5)*100, "%"))
## [1] "The difference between both is 1.821%"Well done. Adjusted R-squared increases by almost 2%. Now, we move on to try feature selection to improve the model.
We’re going to employ step-wise algorithm for the feature selection method. We will use three directions of the algorithm, i.e. backward, forward, and both. First of all, we have to define the models for lower and upper threshold of the algorithm.
model_redAlc <- lm(quality ~ alcohol, data = X.train_red)
# summary(model_redAlc)
model_redAll <- lm(quality ~ ., data = X.train_red)
# summary(model_redAll)Now, let’s carry out three approaches of step-wise algorithm.
##
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide +
## total.sulfur.dioxide + pH + sulphates + alcohol, data = X.train_red)
##
## Coefficients:
## (Intercept) volatile.acidity chlorides
## 4.088908 -0.965571 -1.673087
## free.sulfur.dioxide total.sulfur.dioxide pH
## 0.004563 -0.003924 -0.430205
## sulphates alcohol
## 0.875497 0.306260
model.back_red <- lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide +
total.sulfur.dioxide + pH + sulphates + alcohol, data = X.train_red)
summary(model.back_red)##
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide +
## total.sulfur.dioxide + pH + sulphates + alcohol, data = X.train_red)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.24844 -0.38304 -0.06432 0.46086 1.96053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0889079 0.4368685 9.360 < 2e-16 ***
## volatile.acidity -0.9655711 0.1106198 -8.729 < 2e-16 ***
## chlorides -1.6730870 0.4384609 -3.816 0.000142 ***
## free.sulfur.dioxide 0.0045626 0.0023992 1.902 0.057436 .
## total.sulfur.dioxide -0.0039238 0.0007458 -5.261 1.68e-07 ***
## pH -0.4302048 0.1266785 -3.396 0.000705 ***
## sulphates 0.8754966 0.1213830 7.213 9.42e-13 ***
## alcohol 0.3062600 0.0178667 17.141 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6342 on 1266 degrees of freedom
## Multiple R-squared: 0.3865, Adjusted R-squared: 0.3831
## F-statistic: 113.9 on 7 and 1266 DF, p-value: < 2.2e-16
step(model_redAlc, scope = list(lower = model_redAlc, upper = model_redAll),
direction = "forward",
trace = F)##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## total.sulfur.dioxide + chlorides + pH + free.sulfur.dioxide,
## data = X.train_red)
##
## Coefficients:
## (Intercept) alcohol volatile.acidity
## 4.088908 0.306260 -0.965571
## sulphates total.sulfur.dioxide chlorides
## 0.875497 -0.003924 -1.673087
## pH free.sulfur.dioxide
## -0.430205 0.004563
model.forw_red <- lm(formula = quality ~ alcohol + volatile.acidity + sulphates + total.sulfur.dioxide + chlorides + pH + free.sulfur.dioxide,
data = X.train_red)
summary(model.forw_red)##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## total.sulfur.dioxide + chlorides + pH + free.sulfur.dioxide,
## data = X.train_red)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.24844 -0.38304 -0.06432 0.46086 1.96053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0889079 0.4368685 9.360 < 2e-16 ***
## alcohol 0.3062600 0.0178667 17.141 < 2e-16 ***
## volatile.acidity -0.9655711 0.1106198 -8.729 < 2e-16 ***
## sulphates 0.8754966 0.1213830 7.213 9.42e-13 ***
## total.sulfur.dioxide -0.0039238 0.0007458 -5.261 1.68e-07 ***
## chlorides -1.6730870 0.4384609 -3.816 0.000142 ***
## pH -0.4302048 0.1266785 -3.396 0.000705 ***
## free.sulfur.dioxide 0.0045626 0.0023992 1.902 0.057436 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6342 on 1266 degrees of freedom
## Multiple R-squared: 0.3865, Adjusted R-squared: 0.3831
## F-statistic: 113.9 on 7 and 1266 DF, p-value: < 2.2e-16
step(model_redAlc, scope = list(lower = model_redAlc, upper = model_redAll),
direction = "both",
trace = F)##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## total.sulfur.dioxide + chlorides + pH + free.sulfur.dioxide,
## data = X.train_red)
##
## Coefficients:
## (Intercept) alcohol volatile.acidity
## 4.088908 0.306260 -0.965571
## sulphates total.sulfur.dioxide chlorides
## 0.875497 -0.003924 -1.673087
## pH free.sulfur.dioxide
## -0.430205 0.004563
model.both_red <- lm(formula = quality ~ alcohol + volatile.acidity + sulphates + total.sulfur.dioxide + chlorides + pH + free.sulfur.dioxide,
data = X.train_red)
summary(model.both_red)##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## total.sulfur.dioxide + chlorides + pH + free.sulfur.dioxide,
## data = X.train_red)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.24844 -0.38304 -0.06432 0.46086 1.96053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0889079 0.4368685 9.360 < 2e-16 ***
## alcohol 0.3062600 0.0178667 17.141 < 2e-16 ***
## volatile.acidity -0.9655711 0.1106198 -8.729 < 2e-16 ***
## sulphates 0.8754966 0.1213830 7.213 9.42e-13 ***
## total.sulfur.dioxide -0.0039238 0.0007458 -5.261 1.68e-07 ***
## chlorides -1.6730870 0.4384609 -3.816 0.000142 ***
## pH -0.4302048 0.1266785 -3.396 0.000705 ***
## free.sulfur.dioxide 0.0045626 0.0023992 1.902 0.057436 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6342 on 1266 degrees of freedom
## Multiple R-squared: 0.3865, Adjusted R-squared: 0.3831
## F-statistic: 113.9 on 7 and 1266 DF, p-value: < 2.2e-16
All three approaches have been defined. Now, we’re going to compare our all models so far by their adjusted R-squared.
cat("Adjusted R-squared for 1st model:\n")
## Adjusted R-squared for 1st model:
ad.r.sq1 <- summary(model_red1)$adj.r.squared
ad.r.sq1
## [1] 0.3644797
cat("\nAdjusted R-squared for 2nd model:\n")
##
## Adjusted R-squared for 2nd model:
ad.r.sq2 <- summary(model_red2)$adj.r.squared
ad.r.sq2
## [1] 0.3826897
cat("\nAdjusted R-squared for model using 'alcohol' variable only:\n")
##
## Adjusted R-squared for model using 'alcohol' variable only:
ad.r.sqAlc <- summary(model_redAlc)$adj.r.squared
ad.r.sqAlc
## [1] 0.2591122
cat("\nAdjusted R-squared for model using all variables:\n")
##
## Adjusted R-squared for model using all variables:
ad.r.sqAll <- summary(model_redAll)$adj.r.squared
ad.r.sqAll
## [1] 0.3831521
cat("\nAdjusted R-squared for model with backward approach:\n")
##
## Adjusted R-squared for model with backward approach:
ad.r.sqBack <- summary(model.back_red)$adj.r.squared
ad.r.sqBack
## [1] 0.3831096
cat("\nAdjusted R-squared for model with forward approach:\n")
##
## Adjusted R-squared for model with forward approach:
ad.r.sqForw <- summary(model.forw_red)$adj.r.squared
ad.r.sqForw
## [1] 0.3831096
cat("\nAdjusted R-squared for model with both approach:\n")
##
## Adjusted R-squared for model with both approach:
ad.r.sqBoth <- summary(model.both_red)$adj.r.squared
ad.r.sqBoth
## [1] 0.3831096Evidently, after we have performed feature selection, we don’t obtain the model with much higher performance. Instead, the best model so far is achieved not from such selection, but from manually including all available predictor variables. Hence, from now on, the best model used will be the one with all predictor variables, i.e. model_redAll
In this chapter, we’re going to discuss the best model so far and use it to predict the test dataset. Firstly, we should interpret the selected model. Subsequently, the performance of the model is discussed and the predictions will be carried out later.
The selected model is the one with all available preditor variables. We defined it as model_redAll. It consist of the following equation:
\(\hat{Y} = \beta_0+\beta_1X_1+\beta_2X_2+\beta_3X_3+\beta_4X_4+\beta_5X_5+\beta_6X_6+\beta_7X_7+\beta_8X_8+\beta_9X_9+\beta_{10} X_{10}+\beta_{11}X_{11}\)
where the following are values from the \(\beta_0\) to \(\beta_{11}\) and from \(X_1\) to \(X_{11}\):
## (Intercept) fixed.acidity volatile.acidity
## 37.877431553 0.042728629 -1.027665001
## citric.acid residual.sugar chlorides
## -0.209753167 0.024577257 -1.443925680
## free.sulfur.dioxide total.sulfur.dioxide density
## 0.003734745 -0.003638972 -34.476340336
## pH sulphates alcohol
## -0.285439955 0.919299454 0.279276642
From the equation above, we can interpret that the line starts from the Cartesian coordinate of (0, 37.87), as pointed by the intercept. Furthermore, along with the increase of any \(X_i\), the related \(\beta_i\) will adjust the line according to both values.
For example and for simplicity, if we were to have \(X_{alcohol} = 1\), then the y coordinate (or so-called the predicted value) would be: \(\hat{Y} = \beta_0 + \beta_{alcohol}.X_{alcohol} = 37.87 + 0.27*1 = 38.14\)
Here, we’re going to check the performances of the chosen model. The metrics used are Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE). We just need the indicator() function defined in the beginning.
indicator(model = model_redAll, y_pred = model_redAll$fitted.values, y_true = X.train_red$quality)
## [1] "Adjusted R-squared: 0.3832"
## [1] "MSE: 0.3983"
## [1] "RMSE: 0.6311"
## [1] "MAE: 0.4966"With performances as shown above, we will compare them to those of the prediction using test dataset.
Here, we’re going to compare the performances of prediction using train dataset to those of using test dataset. The metrics used are MSE, RMSE, MAE, correlation between y_pred and y_true, and R-squared between y_pred and y_true. The plots between y_pred and y_true for each train dataset and test dataset also will be shown.
cat("Performance using train dataset:\n")
## Performance using train dataset:
metrics(y_pred = model_redAll$fitted.values, y_true = X.train_red$quality)
## [1] "MSE: 0.39834"
## [1] "RMSE: 0.631142"
## [1] "MAE: 0.496614"
## [1] "Correlation: 0.623283"
## [1] "R^2 between y_pred & y_true: 0.388482"
redPredict.back <- predict(model_redAll, newdata = X.test_red)
cat("\nPerformances using test dataset:\n")
##
## Performances using test dataset:
metrics(y_pred = redPredict.back, y_true = X.test_red$quality)
## [1] "MSE: 0.427251"
## [1] "RMSE: 0.653644"
## [1] "MAE: 0.493648"
## [1] "Correlation: 0.56356"
## [1] "R^2 between y_pred & y_true: 0.317599"redFitted.back <- data.frame(qualityPred = model.back_red$fitted.values,
qualityAct = X.train_red$quality)
ggplot(redFitted.back, aes(x = qualityPred,
y = qualityAct)) +
geom_point(aes(color = as.factor(qualityAct)), show.legend = F) +
geom_smooth(method = "lm", se = F) +
labs(title = "Predicted vs Actual Values Using Train Dataset",
x = "Predicted quality",
y = "Actual quality")redPredict.backDF <- data.frame(qualityPred = redPredict.back,
qualityAct = X.test_red$quality)
ggplot(redPredict.backDF, aes(x = qualityPred,
y = qualityAct)) +
geom_point(aes(color = as.factor(qualityAct)), show.legend = F) +
geom_smooth(method = "lm", se = F) +
labs(title = "Predicted vs Actual Values Using Test Dataset",
x = "Predicted quality",
y = "Actual quality")There are several points we can infer from the performance results above:
We have finished this article. Below are the points we can conclude from this article:
A linear model has been created. The target variable is quality, whereas the attributes of physicochemical tests are as the predictor variables.
The selected model is the one with all available variables. So, the model equation is as follows:
\(\hat{Y} = \beta_0+\beta_1X_1+\beta_2X_2+\beta_3X_3+\beta_4X_4+\beta_5X_5+\beta_6X_6+\beta_7X_7+\beta_8X_8+\beta_9X_9+\beta_{10} X_{10}+\beta_{11}X_{11}\)
## [1] "Adjusted R-squared: 0.3832"
## [1] "MSE: 0.3983"
## [1] "RMSE: 0.6311"
## [1] "MAE: 0.4966"
The selected model has been examined and improved using statistics tests (the assumptions and feature selection) and visualizations in Modelling and Model Improvements chapters.
The selected model has been interpretated in Results and Discussions chapter.
The selected model has been tested using test dataset test and discussed in Results and Discussions chapter.
The selected model performs ineffective in modelling the train dataset. The best adjusted R-squared value produced is only at 0.3832.
As the selected model show poor performances, it also demonstrates deficient results when predicting the test dataset.
As stated earlier in histogram and scatter plots sections that it is found an initial hypothesis that the target variable has classified values instead of continuous values so it seems the linear regression is not suitable with the dataset, such hypothesis has been proven. All results and discussions in this article verify it.
As mentioned above, therefore, it can be concluded that for the type of this dataset, in particular a target variable with classified values, it is not recommended to model the data using linear regression.
For future study, other algorithms will be applied to model this wine quality dataset.