Rpubs Project Link : https://rpubs.com/asakpal/775841
Introduction
Problem Statement:
Data set:
Preprocessing Data:
4.1. Loading Packages
4.2. Reading the dataset
4.3. Checking for Missing values:
4.4. Proportion of Wine Quality in the dataset
5.1 Categorize the Data
5.2 Inspecting the Distribution
5.3 Downsample the data
5.4 Normalize the Data
5.5 Splitting the Data
5.6 Multicollinearity of the Predictors
5.7 Logistic Stepwise Model Implementation
5.8 Log of Odds Ratio:
5.9 Distribtion of Predicted Values
5.10 Confusion Matrix for Log-Model
6.1 Reading the Data
6.1 Splitting the Data Set in Test and Train
6.2 Linear Model 1 - LM1
6.2.1 Model implementation LM!
6.2.2 Checking VIF values LM1
6.2.3 Confusion Matrix for LM1
6.2.4 Model 1 Evaluation for LM1
6.3 Linear Model 2 - LM2
6.3.1 Model implementation LM2
6.3.2 VIF Values of LM2
6.3.3 Confusion Matrix of LM2
6.3.4 Evaluation of LM2
6.4 Linear Model 3 LM3
6.4.1 Model implementation LM3
6.4.2 VIF Values of LM3
6.4.3 Confusion Matrix of Model 3
6.4.4 Evaluation of Linear Model 3
6.5. Comparing the Linear models
Wine is an alcoholic drink typically made from fermented grapes. Yeast consumes the sugar in the grapes and converts it to ethanol, carbon dioxide and heat. Different varieties of grapes and strains of yeasts are major factors in different styles of wine. These differences result from the complex interactions between the biochemical development of the grape, the reactions involved in fermentation, the grape’s growing environment (terroir), and the wine production process. Many countries enact legal appellations intended to define styles and qualities of wine. These typically restrict the geographical origin and permitted varieties of grapes, as well as other aspects of wine production. Wines not made from grapes involve fermentation of additional crops, including rice wine and other fruit wines such as plum, cherry, pomegranate, currant and elderberry.(WikiLink: https://en.wikipedia.org/wiki/Wine )
The essence of wine lies in its composition and it is necessary to understand its elements in quality determination.
Different elements of wine contribute to the quality of wine. Our data set contains the information from one of the manufacturer of wine.
It will be really useful we could model the data for the prediction of quality of wine. We will try to fit logistic and linear models to analyze if we could predict the quality of wine. The project involves transforming the data to be suitable for logistic regression. We will also implement 3 linear models to predict the quality of wine and determine the performance of model. The accuracy of the models would be determined by using confusion matrices.
The details of the dataset can be found here:
Link: https://archive.ics.uci.edu/ml/datasets/wine+quality
The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult: [Web Link] or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.
Attribute Information:
For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
library(ggplot2) # to plot
library(inspectdf) # to inspect proportion of categorical varible
library(GGally) # to inspect correlation between variables
library(caret) # to perform cross-validation
library(class) # package for knn
library(car) # to inspect multicollinearity
library(readr)
library(e1071)
winequality_red <- read_csv("D:/Sem 1 2021/Regression Analysis/Project/winequality-red.csv")
head(winequality_red)
anyNA(winequality_red)
## [1] FALSE
Hence after anyNa() we could not find any NA values in the dataset. Hence there were no missing values in the dataset.
round(prop.table(table(winequality_red$quality)), 3)*100
##
## 3 4 5 6 7 8
## 0.6 3.3 42.6 39.9 12.4 1.1
Hence we can see that the wine quality in the dataset ranged form 3 to 8. From the description of the dataset we can say that higher the quality, better is the wine. Hence in the dataset 3 is the lowest quality of wine, while 8 is the highest.
As we are going to implement logistic regression, we are going to factorize the wine quality. We will consider that the wine with quality greater than 7 as Excellent Wines and those below 7 as Good Wines
winequality_red$quality <- as.factor(ifelse(winequality_red$quality>=7, 1, 0))
summary(winequality_red$quality)
## 0 1
## 1382 217
From the results, we can see that there are 217 Excellent Wine and 1382 Good Wines.
inspect_cat(winequality_red)
As the number Good and Excellent wines differ in number, the model can produce biased results. Hence we can downsample the data for our model implementation.
Hence after downsampling we can see that each of the wine is in equal proportion.
#Downsample the data:
set.seed(89)
winequality_red.d <- winequality_red
winequality_red.d <- downSample(x = winequality_red.d[,-12], y = winequality_red$quality, list = F, yname = "quality")
inspect_cat(winequality_red.d)
# Summary of Downsample data
summary(winequality_red.d)
## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. : 4.900 Min. :0.1200 Min. :0.0000 Min. : 1.200
## 1st Qu.: 7.200 1st Qu.:0.3400 1st Qu.:0.1500 1st Qu.: 1.900
## Median : 8.200 Median :0.4300 Median :0.3300 Median : 2.200
## Mean : 8.567 Mean :0.4694 Mean :0.3195 Mean : 2.575
## 3rd Qu.: 9.800 3rd Qu.:0.5837 3rd Qu.:0.4700 3rd Qu.: 2.600
## Max. :15.600 Max. :1.3300 Max. :0.7600 Max. :13.800
## chlorides free_sulfur_dioxide total_sulfur_dioxide density
## Min. :0.01200 Min. : 3.00 Min. : 6.00 Min. :0.9906
## 1st Qu.:0.06700 1st Qu.: 6.25 1st Qu.: 19.00 1st Qu.:0.9952
## Median :0.07800 Median :12.00 Median : 31.00 Median :0.9964
## Mean :0.08335 Mean :15.10 Mean : 42.21 Mean :0.9965
## 3rd Qu.:0.08900 3rd Qu.:20.00 3rd Qu.: 53.00 3rd Qu.:0.9977
## Max. :0.46700 Max. :57.00 Max. :289.00 Max. :1.0032
## pH sulphates alcohol quality
## Min. :2.870 Min. :0.3900 Min. : 8.70 0:217
## 1st Qu.:3.210 1st Qu.:0.5800 1st Qu.: 9.80 1:217
## Median :3.300 Median :0.6700 Median :10.80
## Mean :3.306 Mean :0.6916 Mean :10.86
## 3rd Qu.:3.400 3rd Qu.:0.7800 3rd Qu.:11.70
## Max. :3.780 Max. :1.5600 Max. :14.00
From the summary of the downsampled data we can see that all the variables have minimum and maximum values, hence we can normalize the data for our model implementation.
# normalize function
normalize <- function(x){
return (
(x - min(x))/(max(x) - min(x))
)}
# normalize our data
winequality_red.n <- winequality_red.d
winequality_red.n[,-12] <- sapply(winequality_red.n[,-12], normalize)
summary(winequality_red.n)
## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.2150 1st Qu.:0.1818 1st Qu.:0.1974 1st Qu.:0.05556
## Median :0.3084 Median :0.2562 Median :0.4342 Median :0.07937
## Mean :0.3427 Mean :0.2888 Mean :0.4204 Mean :0.10910
## 3rd Qu.:0.4579 3rd Qu.:0.3833 3rd Qu.:0.6184 3rd Qu.:0.11111
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## chlorides free_sulfur_dioxide total_sulfur_dioxide density
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.1209 1st Qu.:0.06019 1st Qu.:0.04594 1st Qu.:0.3617
## Median :0.1451 Median :0.16667 Median :0.08834 Median :0.4586
## Mean :0.1568 Mean :0.22414 Mean :0.12797 Mean :0.4641
## 3rd Qu.:0.1692 3rd Qu.:0.31481 3rd Qu.:0.16608 3rd Qu.:0.5611
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000
## pH sulphates alcohol quality
## Min. :0.0000 Min. :0.0000 Min. :0.0000 0:217
## 1st Qu.:0.3736 1st Qu.:0.1624 1st Qu.:0.2075 1:217
## Median :0.4725 Median :0.2393 Median :0.3962
## Mean :0.4787 Mean :0.2578 Mean :0.4081
## 3rd Qu.:0.5824 3rd Qu.:0.3333 3rd Qu.:0.5660
## Max. :1.0000 Max. :1.0000 Max. :1.0000
We are going to split the data in training and testing subset.
set.seed(4709)
train <- sample(nrow(winequality_red.n), nrow(winequality_red.n)*0.7) # making condition to sample 70% of wine.n data
winequality_red.train <- winequality_red.n[train,] # consist of 70% of wine.n data
winequality_red.test <- winequality_red.n[-train,] # consist of the remaining 30% data
winequality_red.train.label <- winequality_red.n[train,12] # taking `quality` label for train set
winequality_red.test.label <- winequality_red.n[-train,12] # taking `quality` label for test set
ggcorr(winequality_red.train[,-c(5,8,11)], hjust=1, layout.exp = 1, label = T, label_size = 3.0)
## Warning in ggcorr(winequality_red.train[, -c(5, 8, 11)], hjust = 1, layout.exp =
## 1, : data in column(s) 'quality' are not numeric and were ignored
Removing Fixed_Acidity and Total Sulphurdioxde from the dataset as they responsible for high correlation 0.7.
log <- glm(formula = quality ~ volatile_acidity + citric_acid + residual_sugar +
free_sulfur_dioxide + pH + sulphates,winequality_red.train, family = "binomial")
step(log,direction = "backward")
## Start: AIC=367.72
## quality ~ volatile_acidity + citric_acid + residual_sugar + free_sulfur_dioxide +
## pH + sulphates
##
## Df Deviance AIC
## - citric_acid 1 353.74 365.74
## - pH 1 353.86 365.86
## <none> 353.72 367.72
## - free_sulfur_dioxide 1 359.32 371.32
## - residual_sugar 1 361.10 373.10
## - volatile_acidity 1 367.54 379.54
## - sulphates 1 372.65 384.65
##
## Step: AIC=365.74
## quality ~ volatile_acidity + residual_sugar + free_sulfur_dioxide +
## pH + sulphates
##
## Df Deviance AIC
## - pH 1 353.87 363.87
## <none> 353.74 365.74
## - free_sulfur_dioxide 1 359.45 369.45
## - residual_sugar 1 361.51 371.51
## - volatile_acidity 1 370.32 380.32
## - sulphates 1 373.16 383.16
##
## Step: AIC=363.87
## quality ~ volatile_acidity + residual_sugar + free_sulfur_dioxide +
## sulphates
##
## Df Deviance AIC
## <none> 353.87 363.87
## - free_sulfur_dioxide 1 359.86 367.86
## - residual_sugar 1 362.66 370.66
## - sulphates 1 373.34 381.34
## - volatile_acidity 1 373.61 381.61
##
## Call: glm(formula = quality ~ volatile_acidity + residual_sugar + free_sulfur_dioxide +
## sulphates, family = "binomial", data = winequality_red.train)
##
## Coefficients:
## (Intercept) volatile_acidity residual_sugar
## 0.03197 -4.56695 3.53315
## free_sulfur_dioxide sulphates
## -1.69027 4.87004
##
## Degrees of Freedom: 302 Total (i.e. Null); 298 Residual
## Null Deviance: 420
## Residual Deviance: 353.9 AIC: 363.9
step <- glm(formula = quality ~ volatile_acidity + residual_sugar + sulphates, family = "binomial",
data = winequality_red.train)
The Log of Odds help to determine the proportion the variable getting affected by unit increase or decrease of each predictor in the model.
In multivariate analysis:
Positive coefficient describes a positive correlation between a predictor variable and the odds of our target variable.
Negative coefficient describes negative correlation between a predictor variable and the odds of our target variable.
data.frame(coefficient = round(coef(step),2),
odds_ratio = round(exp(coef(step)),2))
As you see the log of odds with volatile acidity does not produce significant results while the other predictors are significant in logistic regression.
winequality_red.test$pred <- predict(step, winequality_red.test, type = "response") # predict the value
hist(winequality_red.test$pred, breaks=30, main = "Histogram of Predicted Wine Quality") # inspect the distribution of predicted value
abline(v = 0.7, col="blue") # adding abline where the distribution separate
From the histogram of the predicted values, we donot see a normal distribution. Hence we are going to assume that the split of the distribution is at 0.7 for our model evaluation.
winequality_red.test$pred.label <- ifelse(winequality_red.test$pred>0.7, 1, 0)
confusionMatrix(as.factor(winequality_red.test$pred.label), winequality_red.test.label, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 58 39
## 1 9 25
##
## Accuracy : 0.6336
## 95% CI : (0.545, 0.716)
## No Information Rate : 0.5115
## P-Value [Acc > NIR] : 0.003218
##
## Kappa : 0.259
##
## Mcnemar's Test P-Value : 2.842e-05
##
## Sensitivity : 0.3906
## Specificity : 0.8657
## Pos Pred Value : 0.7353
## Neg Pred Value : 0.5979
## Prevalence : 0.4885
## Detection Rate : 0.1908
## Detection Prevalence : 0.2595
## Balanced Accuracy : 0.6281
##
## 'Positive' Class : 1
##
From the analysis of the confusion matrix, we can say that logistic regression could help only in 67.94% accuracy in predicting of wine being Good or Excellent.
Linear models can help us to the predict the wine quality. However there are certain assumptions we need to consider in implementing linear Models wrt to its residuals. The error terms are assumed to be
# Reading the data for linear models:
winequality_red <- read_csv("D:/Sem 1 2021/Regression Analysis/Project/winequality-red.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## fixed_acidity = col_double(),
## volatile_acidity = col_double(),
## citric_acid = col_double(),
## residual_sugar = col_double(),
## chlorides = col_double(),
## free_sulfur_dioxide = col_double(),
## total_sulfur_dioxide = col_double(),
## density = col_double(),
## pH = col_double(),
## sulphates = col_double(),
## alcohol = col_double(),
## quality = col_double()
## )
head(winequality_red)
train_lm <- sample(nrow(winequality_red), nrow(winequality_red)*0.7)
lm_train <- winequality_red[train,] # consist of 70% of wine.n data
lm_test <- winequality_red[-train,] # consist of the remaining 30% data
linear_model1 <- lm(quality ~ . , lm_train) # taking all the points.
summary(linear_model1)
##
## Call:
## lm(formula = quality ~ ., data = lm_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.41584 -0.39933 -0.06416 0.41147 1.86593
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.576491 51.306879 0.401 0.6887
## fixed_acidity 0.119674 0.058203 2.056 0.0407 *
## volatile_acidity -1.009153 0.247263 -4.081 5.79e-05 ***
## citric_acid -0.506567 0.306298 -1.654 0.0992 .
## residual_sugar 0.045232 0.036331 1.245 0.2141
## chlorides -0.816095 0.682091 -1.196 0.2325
## free_sulfur_dioxide -0.005036 0.005424 -0.928 0.3540
## total_sulfur_dioxide -0.002956 0.001745 -1.695 0.0912 .
## density -19.999682 52.359080 -0.382 0.7028
## pH 0.360725 0.446791 0.807 0.4201
## sulphates 0.436545 0.209296 2.086 0.0379 *
## alcohol 0.331011 0.055900 5.921 8.98e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6007 on 291 degrees of freedom
## Multiple R-squared: 0.4042, Adjusted R-squared: 0.3816
## F-statistic: 17.94 on 11 and 291 DF, p-value: < 2.2e-16
vif(linear_model1)
## fixed_acidity volatile_acidity citric_acid
## 8.397381 1.658356 3.397631
## residual_sugar chlorides free_sulfur_dioxide
## 2.004281 1.895356 2.447268
## total_sulfur_dioxide density pH
## 2.626095 5.775649 3.595172
## sulphates alcohol
## 1.758435 2.121034
As we can see that fixed_acidity and density have high vif values which can be considered to be removed for our next model. Ideally the VIF values need to be close to 1.
linear_model1_predict <- round(predict(linear_model1, lm_test),0)
conf_mat_lm1 <- table(predicted = linear_model1_predict, actual = lm_test$quality)
conf_mat_lm1
## actual
## predicted 3 4 5 6 7 8
## 4 0 0 3 0 0 0
## 5 7 18 329 102 3 0
## 6 3 24 191 378 100 8
## 7 0 1 9 48 64 7
## 8 0 0 1 0 0 0
sum(diag(conf_mat_lm1))/length(lm_test$quality)
## [1] 0.1983025
The accuracy of the linear model 1 is only 19.83 percent.
We considered removing the variables with vif values for this model.
linear_model2 <- lm(quality ~ . -fixed_acidity -density , lm_train)
summary(linear_model2)
##
## Call:
## lm(formula = quality ~ . - fixed_acidity - density, data = lm_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.51691 -0.38414 -0.05599 0.43074 1.83128
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.425116 1.143362 2.996 0.002973 **
## volatile_acidity -0.919396 0.243868 -3.770 0.000197 ***
## citric_acid -0.051678 0.259133 -0.199 0.842067
## residual_sugar 0.051945 0.027361 1.898 0.058617 .
## chlorides -1.582326 0.629769 -2.513 0.012523 *
## free_sulfur_dioxide -0.002000 0.005317 -0.376 0.707043
## total_sulfur_dioxide -0.004756 0.001632 -2.915 0.003833 **
## pH -0.276943 0.328796 -0.842 0.400310
## sulphates 0.448152 0.204108 2.196 0.028900 *
## alcohol 0.355592 0.042941 8.281 4.4e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6067 on 293 degrees of freedom
## Multiple R-squared: 0.388, Adjusted R-squared: 0.3692
## F-statistic: 20.64 on 9 and 293 DF, p-value: < 2.2e-16
vif(linear_model2)
## volatile_acidity citric_acid residual_sugar
## 1.581266 2.383795 1.114361
## chlorides free_sulfur_dioxide total_sulfur_dioxide
## 1.583816 2.305330 2.251913
## pH sulphates alcohol
## 1.908540 1.639317 1.226878
The VIF are close to 1 which indicates that there is negligible multicollinearity between the predictors.
linear_model2_predict <- round(predict(linear_model2, lm_test),0)
conf_mat_lm2 <- table(predicted = linear_model2_predict, actual = lm_test$quality)
conf_mat_lm2
## actual
## predicted 3 4 5 6 7 8
## 4 0 0 2 0 0 0
## 5 7 17 317 96 2 0
## 6 3 24 204 369 95 8
## 7 0 2 9 63 70 7
## 8 0 0 1 0 0 0
sum(diag(conf_mat_lm2))/length(lm_test$quality)
## [1] 0.2191358
The accuracy is of Linear model 2 is only 21.91%.
In Model 3, we tried to remove the insignificant predictors and analyse the performance of the model which are derived from coefficients of model 2.
linear_model3 <- lm(quality ~ . -fixed_acidity -citric_acid -residual_sugar -free_sulfur_dioxide -pH , lm_train)
summary(linear_model3)
##
## Call:
## lm(formula = quality ~ . - fixed_acidity - citric_acid - residual_sugar -
## free_sulfur_dioxide - pH, data = lm_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.43103 -0.41105 -0.05954 0.40211 1.94758
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -55.150550 23.032365 -2.394 0.0173 *
## volatile_acidity -0.888760 0.201958 -4.401 1.51e-05 ***
## chlorides -1.415238 0.607468 -2.330 0.0205 *
## total_sulfur_dioxide -0.004617 0.001103 -4.186 3.75e-05 ***
## density 57.707918 23.032675 2.505 0.0128 *
## sulphates 0.393553 0.200777 1.960 0.0509 .
## alcohol 0.374913 0.040290 9.305 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6023 on 296 degrees of freedom
## Multiple R-squared: 0.3908, Adjusted R-squared: 0.3784
## F-statistic: 31.64 on 6 and 296 DF, p-value: < 2.2e-16
vif(linear_model3)
## volatile_acidity chlorides total_sulfur_dioxide
## 1.100581 1.495516 1.044419
## density sulphates alcohol
## 1.111846 1.609801 1.096136
linear_model3_predict <- round(predict(linear_model3, lm_test),0)
conf_mat_lm3 <- table(predicted = linear_model3_predict, actual = lm_test$quality)
conf_mat_lm3
## actual
## predicted 3 4 5 6 7 8
## 4 0 0 2 0 0 0
## 5 7 13 312 103 4 0
## 6 3 29 215 383 107 9
## 7 0 1 3 42 56 6
## 8 0 0 1 0 0 0
sum(diag(conf_mat_lm3))/length(lm_test$quality)
## [1] 0.2083333
The accuracy of linear model 3 is only 17.09%.
None of the linear model performed well in terms of predicting the quality of wine. The Linear Model LM1 consisted of all the predictors and it consisting of many insignificant predictor which suggest that the model could be improvised. We analyzed the VIF values of the model 1 and tried to remove the variables with high VIF. Removing the predictors fixed_acidity and density, we have improved results in the model. However the accuracy only improved by 2 percent. In Linear Model 3, we tried to remove all the insignificant coefficients and it results in poor results. Hence the best performing linear model among the implemented model was model 2.
Thus in this project we implemented Logistic Model and Linear Models for classification and prediction of Wine Quality. The logistic model performed with the accuracy of 67.94%, while the linear models performed with the accuracy of 19.83% , 21.91% and 17.09%. The assumptions of the linear models would clearly fail as the model performed poorly and the residual analysis would not be helpful in getting insights. The linear models needs more scrutiny in addressing the outliers, multicollinearity and feature selection. Same analysis could be carried out for white wine as per requirement.