Wines of Portugal

Table of Content

Introduction
Problem Statement:
Data set:
Preprocessing Data:

4.1. Loading Packages

4.2. Reading the dataset

4.3. Checking for Missing values:

4.4. Proportion of Wine Quality in the dataset

Logistic Regression:

5.1 Categorize the Data

5.2 Inspecting the Distribution

5.3 Downsample the data

5.4 Normalize the Data

5.5 Splitting the Data

5.6 Multicollinearity of the Predictors

5.7 Logistic Stepwise Model Implementation

5.8 Log of Odds Ratio:

5.9 Distribtion of Predicted Values

5.10 Confusion Matrix for Log-Model

Linear Models

6.1 Reading the Data

6.1 Splitting the Data Set in Test and Train

6.2 Linear Model 1 - LM1

6.2.1 Model implementation LM!

6.2.2 Checking VIF values LM1

6.2.3 Confusion Matrix for LM1

6.2.4 Model 1 Evaluation for LM1

6.3 Linear Model 2 - LM2

6.3.1 Model implementation LM2

6.3.2 VIF Values of LM2

6.3.3 Confusion Matrix of LM2

6.3.4 Evaluation of LM2

6.4 Linear Model 3 LM3

6.4.1 Model implementation LM3

6.4.2 VIF Values of LM3

6.4.3 Confusion Matrix of Model 3

6.4.4 Evaluation of Linear Model 3

6.5. Comparing the Linear models

Discussion and Conclusion

1. Introduction:

Wine is an alcoholic drink typically made from fermented grapes. Yeast consumes the sugar in the grapes and converts it to ethanol, carbon dioxide and heat. Different varieties of grapes and strains of yeasts are major factors in different styles of wine. These differences result from the complex interactions between the biochemical development of the grape, the reactions involved in fermentation, the grape’s growing environment (terroir), and the wine production process. Many countries enact legal appellations intended to define styles and qualities of wine. These typically restrict the geographical origin and permitted varieties of grapes, as well as other aspects of wine production. Wines not made from grapes involve fermentation of additional crops, including rice wine and other fruit wines such as plum, cherry, pomegranate, currant and elderberry.(WikiLink: https://en.wikipedia.org/wiki/Wine )

The essence of wine lies in its composition and it is necessary to understand its elements in quality determination.

Different elements of wine contribute to the quality of wine. Our data set contains the information from one of the manufacturer of wine.

2. Problem Statement:

It will be really useful we could model the data for the prediction of quality of wine. We will try to fit logistic and linear models to analyze if we could predict the quality of wine. The project involves transforming the data to be suitable for logistic regression. We will also implement 3 linear models to predict the quality of wine and determine the performance of model. The accuracy of the models would be determined by using confusion matrices.

3. Data set:

The details of the dataset can be found here:

Link: https://archive.ics.uci.edu/ml/datasets/wine+quality

The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult: [Web Link] or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

Attribute Information:

For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests):

1 - fixed acidity

2 - volatile acidity

3 - citric acid

4 - residual sugar

5 - chlorides

6 - free sulfur dioxide

7 - total sulfur dioxide

8 - density

9 - pH

10 - sulphates

11 - alcohol

Output variable (based on sensory data):

12 - quality (score between 0 and 10)

4. Preprocessing Data:

4.1 Loading Packages

library(ggplot2) # to plot
library(inspectdf) # to inspect proportion of categorical varible
library(GGally) # to inspect correlation between variables
library(caret) # to perform cross-validation
library(class) # package for knn
library(car) # to inspect multicollinearity
library(readr)
library(e1071)

4.2 Reading the dataset

winequality_red <- read_csv("D:/Sem 1 2021/Regression Analysis/Project/winequality-red.csv")
head(winequality_red)

4.3 Checking for Missing values:

anyNA(winequality_red)

## [1] FALSE

Hence after anyNa() we could not find any NA values in the dataset. Hence there were no missing values in the dataset.

4.4 Proportion of Wine Quality in the dataset

round(prop.table(table(winequality_red$quality)), 3)*100

## 
##    3    4    5    6    7    8 
##  0.6  3.3 42.6 39.9 12.4  1.1

Hence we can see that the wine quality in the dataset ranged form 3 to 8. From the description of the dataset we can say that higher the quality, better is the wine. Hence in the dataset 3 is the lowest quality of wine, while 8 is the highest.

5 Logistic Regression:

As we are going to implement logistic regression, we are going to factorize the wine quality. We will consider that the wine with quality greater than 7 as Excellent Wines and those below 7 as Good Wines

5.1 Categorize the Data

winequality_red$quality <- as.factor(ifelse(winequality_red$quality>=7, 1, 0))
summary(winequality_red$quality)

##    0    1 
## 1382  217

From the results, we can see that there are 217 Excellent Wine and 1382 Good Wines.

5.2 Inspecting the Distribution

inspect_cat(winequality_red)

5.3 Downsample the data

As the number Good and Excellent wines differ in number, the model can produce biased results. Hence we can downsample the data for our model implementation.

Hence after downsampling we can see that each of the wine is in equal proportion.

#Downsample the data:

set.seed(89)
winequality_red.d <- winequality_red
winequality_red.d <- downSample(x = winequality_red.d[,-12], y = winequality_red$quality, list = F, yname = "quality")
inspect_cat(winequality_red.d)

# Summary of Downsample data

summary(winequality_red.d)

##  fixed_acidity    volatile_acidity  citric_acid     residual_sugar  
##  Min.   : 4.900   Min.   :0.1200   Min.   :0.0000   Min.   : 1.200  
##  1st Qu.: 7.200   1st Qu.:0.3400   1st Qu.:0.1500   1st Qu.: 1.900  
##  Median : 8.200   Median :0.4300   Median :0.3300   Median : 2.200  
##  Mean   : 8.567   Mean   :0.4694   Mean   :0.3195   Mean   : 2.575  
##  3rd Qu.: 9.800   3rd Qu.:0.5837   3rd Qu.:0.4700   3rd Qu.: 2.600  
##  Max.   :15.600   Max.   :1.3300   Max.   :0.7600   Max.   :13.800  
##    chlorides       free_sulfur_dioxide total_sulfur_dioxide    density      
##  Min.   :0.01200   Min.   : 3.00       Min.   :  6.00       Min.   :0.9906  
##  1st Qu.:0.06700   1st Qu.: 6.25       1st Qu.: 19.00       1st Qu.:0.9952  
##  Median :0.07800   Median :12.00       Median : 31.00       Median :0.9964  
##  Mean   :0.08335   Mean   :15.10       Mean   : 42.21       Mean   :0.9965  
##  3rd Qu.:0.08900   3rd Qu.:20.00       3rd Qu.: 53.00       3rd Qu.:0.9977  
##  Max.   :0.46700   Max.   :57.00       Max.   :289.00       Max.   :1.0032  
##        pH          sulphates         alcohol      quality
##  Min.   :2.870   Min.   :0.3900   Min.   : 8.70   0:217  
##  1st Qu.:3.210   1st Qu.:0.5800   1st Qu.: 9.80   1:217  
##  Median :3.300   Median :0.6700   Median :10.80          
##  Mean   :3.306   Mean   :0.6916   Mean   :10.86          
##  3rd Qu.:3.400   3rd Qu.:0.7800   3rd Qu.:11.70          
##  Max.   :3.780   Max.   :1.5600   Max.   :14.00

5.4 Normalize the Data

From the summary of the downsampled data we can see that all the variables have minimum and maximum values, hence we can normalize the data for our model implementation.

# normalize function
normalize <- function(x){
  return ( 
    (x - min(x))/(max(x) - min(x)) 
  )}

# normalize our data
winequality_red.n <- winequality_red.d
winequality_red.n[,-12] <- sapply(winequality_red.n[,-12], normalize)
summary(winequality_red.n)

##  fixed_acidity    volatile_acidity  citric_acid     residual_sugar   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.2150   1st Qu.:0.1818   1st Qu.:0.1974   1st Qu.:0.05556  
##  Median :0.3084   Median :0.2562   Median :0.4342   Median :0.07937  
##  Mean   :0.3427   Mean   :0.2888   Mean   :0.4204   Mean   :0.10910  
##  3rd Qu.:0.4579   3rd Qu.:0.3833   3rd Qu.:0.6184   3rd Qu.:0.11111  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##    chlorides      free_sulfur_dioxide total_sulfur_dioxide    density      
##  Min.   :0.0000   Min.   :0.00000     Min.   :0.00000      Min.   :0.0000  
##  1st Qu.:0.1209   1st Qu.:0.06019     1st Qu.:0.04594      1st Qu.:0.3617  
##  Median :0.1451   Median :0.16667     Median :0.08834      Median :0.4586  
##  Mean   :0.1568   Mean   :0.22414     Mean   :0.12797      Mean   :0.4641  
##  3rd Qu.:0.1692   3rd Qu.:0.31481     3rd Qu.:0.16608      3rd Qu.:0.5611  
##  Max.   :1.0000   Max.   :1.00000     Max.   :1.00000      Max.   :1.0000  
##        pH           sulphates         alcohol       quality
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   0:217  
##  1st Qu.:0.3736   1st Qu.:0.1624   1st Qu.:0.2075   1:217  
##  Median :0.4725   Median :0.2393   Median :0.3962          
##  Mean   :0.4787   Mean   :0.2578   Mean   :0.4081          
##  3rd Qu.:0.5824   3rd Qu.:0.3333   3rd Qu.:0.5660          
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

5.5 Splitting the Data

We are going to split the data in training and testing subset.

set.seed(4709)
train <- sample(nrow(winequality_red.n), nrow(winequality_red.n)*0.7) # making condition to sample 70% of wine.n data
winequality_red.train <- winequality_red.n[train,] # consist of 70% of wine.n data
winequality_red.test <- winequality_red.n[-train,] # consist of the remaining 30% data

winequality_red.train.label <- winequality_red.n[train,12] # taking `quality` label for train set
winequality_red.test.label <- winequality_red.n[-train,12] # taking `quality` label for test set

5.6 Multicollinearity of the Predictors

ggcorr(winequality_red.train[,-c(5,8,11)], hjust=1, layout.exp = 1, label = T, label_size = 3.0)

## Warning in ggcorr(winequality_red.train[, -c(5, 8, 11)], hjust = 1, layout.exp =
## 1, : data in column(s) 'quality' are not numeric and were ignored

Removing Fixed_Acidity and Total Sulphurdioxde from the dataset as they responsible for high correlation 0.7.

5.7 Logistic Stepwise Model Implementation

log <- glm(formula = quality ~ volatile_acidity + citric_acid + residual_sugar + 
    free_sulfur_dioxide + pH + sulphates,winequality_red.train, family = "binomial")
step(log,direction = "backward")

## Start:  AIC=367.72
## quality ~ volatile_acidity + citric_acid + residual_sugar + free_sulfur_dioxide + 
##     pH + sulphates
## 
##                       Df Deviance    AIC
## - citric_acid          1   353.74 365.74
## - pH                   1   353.86 365.86
## <none>                     353.72 367.72
## - free_sulfur_dioxide  1   359.32 371.32
## - residual_sugar       1   361.10 373.10
## - volatile_acidity     1   367.54 379.54
## - sulphates            1   372.65 384.65
## 
## Step:  AIC=365.74
## quality ~ volatile_acidity + residual_sugar + free_sulfur_dioxide + 
##     pH + sulphates
## 
##                       Df Deviance    AIC
## - pH                   1   353.87 363.87
## <none>                     353.74 365.74
## - free_sulfur_dioxide  1   359.45 369.45
## - residual_sugar       1   361.51 371.51
## - volatile_acidity     1   370.32 380.32
## - sulphates            1   373.16 383.16
## 
## Step:  AIC=363.87
## quality ~ volatile_acidity + residual_sugar + free_sulfur_dioxide + 
##     sulphates
## 
##                       Df Deviance    AIC
## <none>                     353.87 363.87
## - free_sulfur_dioxide  1   359.86 367.86
## - residual_sugar       1   362.66 370.66
## - sulphates            1   373.34 381.34
## - volatile_acidity     1   373.61 381.61

## 
## Call:  glm(formula = quality ~ volatile_acidity + residual_sugar + free_sulfur_dioxide + 
##     sulphates, family = "binomial", data = winequality_red.train)
## 
## Coefficients:
##         (Intercept)     volatile_acidity       residual_sugar  
##             0.03197             -4.56695              3.53315  
## free_sulfur_dioxide            sulphates  
##            -1.69027              4.87004  
## 
## Degrees of Freedom: 302 Total (i.e. Null);  298 Residual
## Null Deviance:       420 
## Residual Deviance: 353.9     AIC: 363.9

step <- glm(formula = quality ~ volatile_acidity + residual_sugar +  sulphates, family = "binomial", 
    data = winequality_red.train)

5.8 Log of Odds Ratio:

The Log of Odds help to determine the proportion the variable getting affected by unit increase or decrease of each predictor in the model.

In multivariate analysis:

Positive coefficient describes a positive correlation between a predictor variable and the odds of our target variable.

Negative coefficient describes negative correlation between a predictor variable and the odds of our target variable.

data.frame(coefficient = round(coef(step),2),
          odds_ratio = round(exp(coef(step)),2))

As you see the log of odds with volatile acidity does not produce significant results while the other predictors are significant in logistic regression.

5.9 Distribtion of Predicted Values

winequality_red.test$pred <- predict(step, winequality_red.test, type = "response") # predict the value

hist(winequality_red.test$pred, breaks=30, main = "Histogram of Predicted Wine Quality") # inspect the distribution of predicted value

abline(v = 0.7, col="blue") # adding abline where the distribution separate

From the histogram of the predicted values, we donot see a normal distribution. Hence we are going to assume that the split of the distribution is at 0.7 for our model evaluation.

5.10 Confusion Matrix for Log-Model

winequality_red.test$pred.label <- ifelse(winequality_red.test$pred>0.7, 1, 0)
confusionMatrix(as.factor(winequality_red.test$pred.label), winequality_red.test.label, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 58 39
##          1  9 25
##                                         
##                Accuracy : 0.6336        
##                  95% CI : (0.545, 0.716)
##     No Information Rate : 0.5115        
##     P-Value [Acc > NIR] : 0.003218      
##                                         
##                   Kappa : 0.259         
##                                         
##  Mcnemar's Test P-Value : 2.842e-05     
##                                         
##             Sensitivity : 0.3906        
##             Specificity : 0.8657        
##          Pos Pred Value : 0.7353        
##          Neg Pred Value : 0.5979        
##              Prevalence : 0.4885        
##          Detection Rate : 0.1908        
##    Detection Prevalence : 0.2595        
##       Balanced Accuracy : 0.6281        
##                                         
##        'Positive' Class : 1             
##

From the analysis of the confusion matrix, we can say that logistic regression could help only in 67.94% accuracy in predicting of wine being Good or Excellent.

6. Linear Models:

Linear models can help us to the predict the wine quality. However there are certain assumptions we need to consider in implementing linear Models wrt to its residuals. The error terms are assumed to be

Normally distributed
Homoscedastic (The same variance at every X)
Independent

6.1 Reading the Data

# Reading the data for linear models:

winequality_red <- read_csv("D:/Sem 1 2021/Regression Analysis/Project/winequality-red.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   fixed_acidity = col_double(),
##   volatile_acidity = col_double(),
##   citric_acid = col_double(),
##   residual_sugar = col_double(),
##   chlorides = col_double(),
##   free_sulfur_dioxide = col_double(),
##   total_sulfur_dioxide = col_double(),
##   density = col_double(),
##   pH = col_double(),
##   sulphates = col_double(),
##   alcohol = col_double(),
##   quality = col_double()
## )

head(winequality_red)

6.1 Splitting the Data Set in Test and Train

train_lm <- sample(nrow(winequality_red), nrow(winequality_red)*0.7) 


lm_train <- winequality_red[train,] # consist of 70% of wine.n data


lm_test <- winequality_red[-train,] # consist of the remaining 30% data

6.2 Linear Model 1

6.2.1 Model implementation

linear_model1 <- lm(quality    ~ . , lm_train) # taking all the points. 


summary(linear_model1)

## 
## Call:
## lm(formula = quality ~ ., data = lm_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.41584 -0.39933 -0.06416  0.41147  1.86593 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           20.576491  51.306879   0.401   0.6887    
## fixed_acidity          0.119674   0.058203   2.056   0.0407 *  
## volatile_acidity      -1.009153   0.247263  -4.081 5.79e-05 ***
## citric_acid           -0.506567   0.306298  -1.654   0.0992 .  
## residual_sugar         0.045232   0.036331   1.245   0.2141    
## chlorides             -0.816095   0.682091  -1.196   0.2325    
## free_sulfur_dioxide   -0.005036   0.005424  -0.928   0.3540    
## total_sulfur_dioxide  -0.002956   0.001745  -1.695   0.0912 .  
## density              -19.999682  52.359080  -0.382   0.7028    
## pH                     0.360725   0.446791   0.807   0.4201    
## sulphates              0.436545   0.209296   2.086   0.0379 *  
## alcohol                0.331011   0.055900   5.921 8.98e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6007 on 291 degrees of freedom
## Multiple R-squared:  0.4042, Adjusted R-squared:  0.3816 
## F-statistic: 17.94 on 11 and 291 DF,  p-value: < 2.2e-16

6.2.2 Checking VIF values LM2

vif(linear_model1)

##        fixed_acidity     volatile_acidity          citric_acid 
##             8.397381             1.658356             3.397631 
##       residual_sugar            chlorides  free_sulfur_dioxide 
##             2.004281             1.895356             2.447268 
## total_sulfur_dioxide              density                   pH 
##             2.626095             5.775649             3.595172 
##            sulphates              alcohol 
##             1.758435             2.121034

As we can see that fixed_acidity and density have high vif values which can be considered to be removed for our next model. Ideally the VIF values need to be close to 1.

6.2.3 Confusion Matrix for LM1

linear_model1_predict <- round(predict(linear_model1, lm_test),0)

conf_mat_lm1 <- table(predicted = linear_model1_predict, actual = lm_test$quality)

conf_mat_lm1

##          actual
## predicted   3   4   5   6   7   8
##         4   0   0   3   0   0   0
##         5   7  18 329 102   3   0
##         6   3  24 191 378 100   8
##         7   0   1   9  48  64   7
##         8   0   0   1   0   0   0

6.2.4 Model 1 Evaluation for LM1

sum(diag(conf_mat_lm1))/length(lm_test$quality)

## [1] 0.1983025

The accuracy of the linear model 1 is only 19.83 percent.

6.3 Linear Model 2 - LM2

We considered removing the variables with vif values for this model.

6.3.1 Model implementation

linear_model2 <- lm(quality    ~ . -fixed_acidity -density , lm_train)

summary(linear_model2)

## 
## Call:
## lm(formula = quality ~ . - fixed_acidity - density, data = lm_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.51691 -0.38414 -0.05599  0.43074  1.83128 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.425116   1.143362   2.996 0.002973 ** 
## volatile_acidity     -0.919396   0.243868  -3.770 0.000197 ***
## citric_acid          -0.051678   0.259133  -0.199 0.842067    
## residual_sugar        0.051945   0.027361   1.898 0.058617 .  
## chlorides            -1.582326   0.629769  -2.513 0.012523 *  
## free_sulfur_dioxide  -0.002000   0.005317  -0.376 0.707043    
## total_sulfur_dioxide -0.004756   0.001632  -2.915 0.003833 ** 
## pH                   -0.276943   0.328796  -0.842 0.400310    
## sulphates             0.448152   0.204108   2.196 0.028900 *  
## alcohol               0.355592   0.042941   8.281  4.4e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6067 on 293 degrees of freedom
## Multiple R-squared:  0.388,  Adjusted R-squared:  0.3692 
## F-statistic: 20.64 on 9 and 293 DF,  p-value: < 2.2e-16

6.3.2 VIF values of LM2

vif(linear_model2)

##     volatile_acidity          citric_acid       residual_sugar 
##             1.581266             2.383795             1.114361 
##            chlorides  free_sulfur_dioxide total_sulfur_dioxide 
##             1.583816             2.305330             2.251913 
##                   pH            sulphates              alcohol 
##             1.908540             1.639317             1.226878

The VIF are close to 1 which indicates that there is negligible multicollinearity between the predictors.

6.3.3 Confusion Matrix of LM2

linear_model2_predict <- round(predict(linear_model2, lm_test),0)

conf_mat_lm2 <- table(predicted = linear_model2_predict, actual = lm_test$quality)

conf_mat_lm2

##          actual
## predicted   3   4   5   6   7   8
##         4   0   0   2   0   0   0
##         5   7  17 317  96   2   0
##         6   3  24 204 369  95   8
##         7   0   2   9  63  70   7
##         8   0   0   1   0   0   0

6.3.4 Evaluation of Linear Model 2

sum(diag(conf_mat_lm2))/length(lm_test$quality)

## [1] 0.2191358

The accuracy is of Linear model 2 is only 21.91%.

6.4 Linear Model 3 LM3:

In Model 3, we tried to remove the insignificant predictors and analyse the performance of the model which are derived from coefficients of model 2.

6.4.1 Model implementation LM3

linear_model3 <- lm(quality    ~ . -fixed_acidity -citric_acid -residual_sugar -free_sulfur_dioxide -pH  , lm_train)

summary(linear_model3)

## 
## Call:
## lm(formula = quality ~ . - fixed_acidity - citric_acid - residual_sugar - 
##     free_sulfur_dioxide - pH, data = lm_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.43103 -0.41105 -0.05954  0.40211  1.94758 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -55.150550  23.032365  -2.394   0.0173 *  
## volatile_acidity      -0.888760   0.201958  -4.401 1.51e-05 ***
## chlorides             -1.415238   0.607468  -2.330   0.0205 *  
## total_sulfur_dioxide  -0.004617   0.001103  -4.186 3.75e-05 ***
## density               57.707918  23.032675   2.505   0.0128 *  
## sulphates              0.393553   0.200777   1.960   0.0509 .  
## alcohol                0.374913   0.040290   9.305  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6023 on 296 degrees of freedom
## Multiple R-squared:  0.3908, Adjusted R-squared:  0.3784 
## F-statistic: 31.64 on 6 and 296 DF,  p-value: < 2.2e-16

6.4.2. VIF Values of LM3

vif(linear_model3)

##     volatile_acidity            chlorides total_sulfur_dioxide 
##             1.100581             1.495516             1.044419 
##              density            sulphates              alcohol 
##             1.111846             1.609801             1.096136

6.4.3. Confusion Matrix of Model 3

linear_model3_predict <- round(predict(linear_model3, lm_test),0)

conf_mat_lm3 <- table(predicted = linear_model3_predict, actual = lm_test$quality)

conf_mat_lm3

##          actual
## predicted   3   4   5   6   7   8
##         4   0   0   2   0   0   0
##         5   7  13 312 103   4   0
##         6   3  29 215 383 107   9
##         7   0   1   3  42  56   6
##         8   0   0   1   0   0   0

6.4.4. Evaluation of Linear Model 3

sum(diag(conf_mat_lm3))/length(lm_test$quality)

## [1] 0.2083333

The accuracy of linear model 3 is only 17.09%.

6.5. Comparing the Linear models

None of the linear model performed well in terms of predicting the quality of wine. The Linear Model LM1 consisted of all the predictors and it consisting of many insignificant predictor which suggest that the model could be improvised. We analyzed the VIF values of the model 1 and tried to remove the variables with high VIF. Removing the predictors fixed_acidity and density, we have improved results in the model. However the accuracy only improved by 2 percent. In Linear Model 3, we tried to remove all the insignificant coefficients and it results in poor results. Hence the best performing linear model among the implemented model was model 2.

MATH-1312 Regression Analysis (Semester 1 2021)

Student Name : Anup Sakpal

Student ID : s3801788

Project Proposal Approved On : 6th April 2021

Wines of Portugal

Table of Content

1. Introduction:

2. Problem Statement:

3. Data set:

4. Preprocessing Data:

4.1 Loading Packages

4.2 Reading the dataset

4.3 Checking for Missing values:

4.4 Proportion of Wine Quality in the dataset

5 Logistic Regression:

5.1 Categorize the Data

5.2 Inspecting the Distribution

5.3 Downsample the data

5.4 Normalize the Data

5.5 Splitting the Data

5.6 Multicollinearity of the Predictors

5.7 Logistic Stepwise Model Implementation

5.8 Log of Odds Ratio:

5.9 Distribtion of Predicted Values

5.10 Confusion Matrix for Log-Model

6. Linear Models:

6.1 Reading the Data

6.1 Splitting the Data Set in Test and Train

6.2 Linear Model 1

6.2.1 Model implementation

6.2.2 Checking VIF values LM2

6.2.3 Confusion Matrix for LM1

6.2.4 Model 1 Evaluation for LM1

6.3 Linear Model 2 - LM2

6.3.1 Model implementation

6.3.2 VIF values of LM2

6.3.3 Confusion Matrix of LM2

6.3.4 Evaluation of Linear Model 2

6.4 Linear Model 3 LM3:

6.4.1 Model implementation LM3

6.4.2. VIF Values of LM3

6.4.3. Confusion Matrix of Model 3

6.4.4. Evaluation of Linear Model 3

6.5. Comparing the Linear models

7. Discussion and Conclusion: