Overview

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

image of data description

Count Regression Model

Count Regression model is a regression model that relates a non-negative integer value (0, 1, 2, 3, etc.) field of interest (a target variable) to one or more fields that are expected to have an influence on the target variable, and are often called predictor variables.

Objective

Our objective is to build poisson regression models, negative binomial regression models and multiple linear regression models on the training data and predict the number of cases of wine that will be sold given certain properties of the wine. Given 2 datasets (i.e training and evaluation), Using the training data set, will evaluate the performance of the count regression model and make predictions using the evaluation data set.

Library load

library(ggplot2)
library(dplyr)
library(kableExtra)
library(hrbrthemes)
library(ggthemes)
library(visdat)
library(pscl)
library(MASS)
library(corrplot)
library(VIM)
library(tidyr)

Data Load

Train dataset consists of 12795 observations (rows) and 15 features (variables).

wine_train <-
  read.csv("https://raw.githubusercontent.com/mharrisonbaker/DATA621_GroupWork2/main/HW5/wine-training-data.csv", 
           header = TRUE) %>% dplyr::select(-ï..INDEX)
dim(wine_train)
## [1] 12795    15
kable(wine_train[1:10,]) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")
TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol LabelAppeal AcidIndex STARS
3 3.2 1.160 -0.98 54.20 -0.567 NA 268 0.99280 3.33 -0.59 9.9 0 8 2
3 4.5 0.160 -0.81 26.10 -0.425 15 -327 1.02792 3.38 0.70 NA -1 7 3
5 7.1 2.640 -0.88 14.80 0.037 214 142 0.99518 3.12 0.48 22.0 -1 8 3
3 5.7 0.385 0.04 18.80 -0.425 22 115 0.99640 2.24 1.83 6.2 -1 6 1
4 8.0 0.330 -1.26 9.40 NA -167 108 0.99457 3.12 1.77 13.7 0 9 2
0 11.3 0.320 0.59 2.20 0.556 -37 15 0.99940 3.20 1.29 15.4 0 11 NA
0 7.7 0.290 -0.40 21.50 0.060 287 156 0.99572 3.49 1.21 10.3 0 8 NA
4 6.5 -1.220 0.34 1.40 0.040 523 551 1.03236 3.20 NA 11.6 1 7 3
3 14.8 0.270 1.05 11.25 -0.007 -213 NA 0.99620 4.93 0.26 15.0 0 6 NA
6 5.5 -0.220 0.39 1.80 -0.277 62 180 0.94724 3.09 0.75 12.6 0 8 4

Data Exploration

Summary of data shows some variables having NAs, we will see the disribution of these using boxplot to identify outliers.

summary(wine_train)
##      TARGET       FixedAcidity     VolatileAcidity     CitricAcid     
##  Min.   :0.000   Min.   :-18.100   Min.   :-2.7900   Min.   :-3.2400  
##  1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300   1st Qu.: 0.0300  
##  Median :3.000   Median :  6.900   Median : 0.2800   Median : 0.3100  
##  Mean   :3.029   Mean   :  7.076   Mean   : 0.3241   Mean   : 0.3084  
##  3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400   3rd Qu.: 0.5800  
##  Max.   :8.000   Max.   : 34.400   Max.   : 3.6800   Max.   : 3.8600  
##                                                                       
##  ResidualSugar        Chlorides       FreeSulfurDioxide TotalSulfurDioxide
##  Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00   Min.   :-823.0    
##  1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00   1st Qu.:  27.0    
##  Median :   3.900   Median : 0.0460   Median :  30.00   Median : 123.0    
##  Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85   Mean   : 120.7    
##  3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00   3rd Qu.: 208.0    
##  Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00   Max.   :1057.0    
##  NA's   :616        NA's   :638       NA's   :647       NA's   :682       
##     Density             pH          Sulphates          Alcohol     
##  Min.   :0.8881   Min.   :0.480   Min.   :-3.1300   Min.   :-4.70  
##  1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800   1st Qu.: 9.00  
##  Median :0.9945   Median :3.200   Median : 0.5000   Median :10.40  
##  Mean   :0.9942   Mean   :3.208   Mean   : 0.5271   Mean   :10.49  
##  3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600   3rd Qu.:12.40  
##  Max.   :1.0992   Max.   :6.130   Max.   : 4.2400   Max.   :26.50  
##                   NA's   :395     NA's   :1210      NA's   :653    
##   LabelAppeal          AcidIndex          STARS      
##  Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##                                       NA's   :3359

Data distributions

All variables are symetrically unimodal distributed except TARGET.

Correlation of variables

STARTS and LabelAppeal seems to have a positive correlation with the number of cases sold.

kable(cor(drop_na(wine_train))[,14], "html", escape = F) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(1, bold = T) %>%
  scroll_box(height = "500px")
x
TARGET -0.1676431
FixedAcidity 0.1541678
VolatileAcidity 0.0250530
CitricAcid 0.0545838
ResidualSugar -0.0203019
Chlorides -0.0017134
FreeSulfurDioxide -0.0147337
TotalSulfurDioxide -0.0221293
Density 0.0477788
pH -0.0537129
Sulphates 0.0310718
Alcohol -0.0558919
LabelAppeal 0.0103010
AcidIndex 1.0000000
STARS -0.0954826
library(corrgram)
corrgram(drop_na(wine_train), order=TRUE,
         upper.panel=panel.cor, main="correlation")

Missing Values

visdat::vis_miss(wine_train)

Data Preparation

ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, pH, Sulphates, Alcohol, and STARS have NAs. Replace NAs with median value.

wine_train$ResidualSugar[is.na(wine_train$ResidualSugar)] <- median(wine_train$ResidualSugar, na.rm=TRUE)
wine_train$Chlorides[is.na(wine_train$Chlorides)] <- median(wine_train$Chlorides, na.rm=TRUE)
wine_train$FreeSulfurDioxide[is.na(wine_train$FreeSulfurDioxide)] <- median(wine_train$FreeSulfurDioxide, na.rm=TRUE)
wine_train$TotalSulfurDioxide[is.na(wine_train$TotalSulfurDioxide)] <- median(wine_train$TotalSulfurDioxide, na.rm=TRUE)
wine_train$pH[is.na(wine_train$pH)] <- median(wine_train$pH, na.rm=TRUE)
wine_train$Sulphates[is.na(wine_train$Sulphates)] <- median(wine_train$Sulphates, na.rm=TRUE)
wine_train$Alcohol[is.na(wine_train$Alcohol)] <- median(wine_train$Alcohol, na.rm=TRUE)
wine_train$STARS[is.na(wine_train$STARS)] <- median(wine_train$STARS, na.rm=TRUE)
# to check NAS exist or not
visdat::vis_miss(wine_train)

Build Models

We will build poisson regression models, negative binomial regression models, multiple linear regression models, and zero-inflated poisson model using all variables and backward approach.

poisson regression model1: all variables
model1 = glm(TARGET ~  ., data=wine_train, family=poisson)
summary(model1)
## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = wine_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4819  -0.5261   0.2041   0.6365   2.5504  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.017e+00  1.957e-01  10.305  < 2e-16 ***
## FixedAcidity       -4.515e-04  8.195e-04  -0.551 0.581697    
## VolatileAcidity    -5.045e-02  6.492e-03  -7.771 7.82e-15 ***
## CitricAcid          1.331e-02  5.892e-03   2.258 0.023925 *  
## ResidualSugar       1.442e-04  1.545e-04   0.934 0.350499    
## Chlorides          -6.005e-02  1.645e-02  -3.650 0.000262 ***
## FreeSulfurDioxide   1.424e-04  3.513e-05   4.054 5.04e-05 ***
## TotalSulfurDioxide  1.065e-04  2.268e-05   4.695 2.67e-06 ***
## Density            -4.315e-01  1.921e-01  -2.247 0.024651 *  
## pH                 -2.389e-02  7.639e-03  -3.128 0.001762 ** 
## Sulphates          -1.877e-02  5.739e-03  -3.271 0.001073 ** 
## Alcohol             5.368e-03  1.410e-03   3.807 0.000141 ***
## LabelAppeal         1.965e-01  6.021e-03  32.634  < 2e-16 ***
## AcidIndex          -1.222e-01  4.463e-03 -27.381  < 2e-16 ***
## STARS               2.198e-01  6.468e-03  33.986  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 18419  on 12780  degrees of freedom
## AIC: 50391
## 
## Number of Fisher Scoring iterations: 5
plot(model1)

  • Residual vs fitted plot shows red line is fairly flat so linearty assumption is met.
  • QQ-plot shows some observations are fall on the line with minimal deviatation
  • Scale-Location plot helps us to check the assumption of equal variance (homoscedasticity). Here red color line not completely staright but it is not bad.
  • Residual vs Leverage plot helps us to determine if we have influential outliers in our data. Residuals are not present outside of the cooks distance
poisson regression model2: with selected variables

Remove acidity variables and chemical variables from models. We will build one more glm model, using the variables:Density, PH, Sulphates, Alcohol, LabelAppeal, AcidIndex, and STARS.

model2 = glm(TARGET ~ 
                 Alcohol + LabelAppeal + AcidIndex + STARS, data=wine_train, family=poisson)
summary(model2)
## 
## Call:
## glm(formula = TARGET ~ Alcohol + LabelAppeal + AcidIndex + STARS, 
##     family = poisson, data = wine_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2829  -0.4886   0.2126   0.6327   2.7117  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.517121   0.040464  37.493  < 2e-16 ***
## Alcohol      0.005317   0.001410   3.771 0.000163 ***
## LabelAppeal  0.196742   0.006015  32.709  < 2e-16 ***
## AcidIndex   -0.124712   0.004369 -28.543  < 2e-16 ***
## STARS        0.222430   0.006461  34.427  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 18568  on 12790  degrees of freedom
## AIC: 50520
## 
## Number of Fisher Scoring iterations: 5
plot(model2)

This model looks better than first model.

negative binomial regression model3: with all variables
model3 <- glm.nb(TARGET ~ ., data = wine_train)
summary(model3)
## 
## Call:
## glm.nb(formula = TARGET ~ ., data = wine_train, init.theta = 39421.16225, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4818  -0.5261   0.2041   0.6365   2.5503  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.017e+00  1.957e-01  10.305  < 2e-16 ***
## FixedAcidity       -4.515e-04  8.196e-04  -0.551 0.581705    
## VolatileAcidity    -5.045e-02  6.493e-03  -7.770 7.83e-15 ***
## CitricAcid          1.331e-02  5.892e-03   2.258 0.023931 *  
## ResidualSugar       1.443e-04  1.545e-04   0.934 0.350498    
## Chlorides          -6.005e-02  1.645e-02  -3.650 0.000262 ***
## FreeSulfurDioxide   1.424e-04  3.513e-05   4.054 5.04e-05 ***
## TotalSulfurDioxide  1.065e-04  2.269e-05   4.695 2.67e-06 ***
## Density            -4.315e-01  1.921e-01  -2.247 0.024655 *  
## pH                 -2.389e-02  7.639e-03  -3.128 0.001763 ** 
## Sulphates          -1.877e-02  5.739e-03  -3.271 0.001073 ** 
## Alcohol             5.368e-03  1.410e-03   3.806 0.000141 ***
## LabelAppeal         1.965e-01  6.021e-03  32.633  < 2e-16 ***
## AcidIndex          -1.222e-01  4.464e-03 -27.380  < 2e-16 ***
## STARS               2.198e-01  6.468e-03  33.984  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(39421.16) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 18419  on 12780  degrees of freedom
## AIC: 50394
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  39421 
##           Std. Err.:  59707 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -50361.62
plot(model3)

negative binomial regression model4: with selected variables
model4 <- glm.nb(TARGET ~ Alcohol + LabelAppeal + AcidIndex + STARS, data = wine_train)
summary(model4)
## 
## Call:
## glm.nb(formula = TARGET ~ Alcohol + LabelAppeal + AcidIndex + 
##     STARS, data = wine_train, init.theta = 38799.08563, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2828  -0.4885   0.2126   0.6327   2.7116  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.517138   0.040465  37.492  < 2e-16 ***
## Alcohol      0.005317   0.001410   3.771 0.000163 ***
## LabelAppeal  0.196743   0.006015  32.708  < 2e-16 ***
## AcidIndex   -0.124714   0.004369 -28.542  < 2e-16 ***
## STARS        0.222428   0.006461  34.426  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(38799.09) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 18567  on 12790  degrees of freedom
## AIC: 50522
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  38799 
##           Std. Err.:  60155 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -50510.44
plot(model4)

Linear Model model5: with all variables
model5 <- lm(TARGET ~ ., data = wine_train)
summary(model5)
## 
## Call:
## lm(formula = TARGET ~ ., data = wine_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2211 -0.7540  0.3598  1.1254  4.3550 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.355e+00  5.517e-01   9.707  < 2e-16 ***
## FixedAcidity       -1.168e-03  2.315e-03  -0.505 0.613911    
## VolatileAcidity    -1.549e-01  1.838e-02  -8.429  < 2e-16 ***
## CitricAcid          3.976e-02  1.673e-02   2.377 0.017476 *  
## ResidualSugar       4.716e-04  4.371e-04   1.079 0.280670    
## Chlorides          -1.931e-01  4.638e-02  -4.164 3.15e-05 ***
## FreeSulfurDioxide   4.286e-04  9.941e-05   4.312 1.63e-05 ***
## TotalSulfurDioxide  3.098e-04  6.387e-05   4.851 1.25e-06 ***
## Density            -1.274e+00  5.427e-01  -2.347 0.018959 *  
## pH                 -6.387e-02  2.154e-02  -2.965 0.003028 ** 
## Sulphates          -5.485e-02  1.623e-02  -3.380 0.000728 ***
## Alcohol             1.883e-02  3.972e-03   4.739 2.17e-06 ***
## LabelAppeal         5.945e-01  1.686e-02  35.250  < 2e-16 ***
## AcidIndex          -3.259e-01  1.117e-02 -29.169  < 2e-16 ***
## STARS               7.478e-01  1.946e-02  38.431  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.626 on 12780 degrees of freedom
## Multiple R-squared:  0.2879, Adjusted R-squared:  0.2871 
## F-statistic: 369.1 on 14 and 12780 DF,  p-value: < 2.2e-16
plot(model5)

Linear Model model6: with selected variables

Added CitricAcid, Sulphates, VolatileAcidity, Chlorides in to the model, these varaibles having low p-value to get better fit model.

model6 <- lm(TARGET ~ CitricAcid + Sulphates + VolatileAcidity + Chlorides +
             Alcohol + LabelAppeal + AcidIndex + STARS, data = wine_train)
summary(model6)
## 
## Call:
## lm(formula = TARGET ~ CitricAcid + Sulphates + VolatileAcidity + 
##     Chlorides + Alcohol + LabelAppeal + AcidIndex + STARS, data = wine_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0223 -0.7411  0.3702  1.1218  4.4756 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.969865   0.106568  37.252  < 2e-16 ***
## CitricAcid       0.041882   0.016760   2.499  0.01247 *  
## Sulphates       -0.054633   0.016256  -3.361  0.00078 ***
## VolatileAcidity -0.158457   0.018409  -8.608  < 2e-16 ***
## Chlorides       -0.199935   0.046443  -4.305 1.68e-05 ***
## Alcohol          0.018274   0.003978   4.594 4.40e-06 ***
## LabelAppeal      0.594530   0.016895  35.190  < 2e-16 ***
## AcidIndex       -0.330626   0.010972 -30.133  < 2e-16 ***
## STARS            0.748890   0.019491  38.422  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.63 on 12786 degrees of freedom
## Multiple R-squared:  0.2846, Adjusted R-squared:  0.2842 
## F-statistic:   636 on 8 and 12786 DF,  p-value: < 2.2e-16
plot(model6)

Above model shows Multiple R-squared 0.2846 means model explains 29% variation in the response variable.

Zero inflation model7: poisson

Zero inflation model:

Zero-inflated poisson regression is used to model count data that has an excess of zero counts. Further, theory suggests that the excess zeros are generated by a separate process from the count values and that the excess zeros can be modeled independently. Thus, the zip model has two parts, a poisson count model and the logit model for predicting excess zeros.

model7 <- zeroinfl(TARGET ~ . | STARS, data = wine_train, dist = 'poisson')
summary(model7)
## 
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = wine_train, dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5795 -0.2927  0.1833  0.5092  2.2406 
## 
## Count model coefficients (poisson with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.646e+00  2.061e-01   7.987 1.38e-15 ***
## FixedAcidity        9.837e-05  8.538e-04   0.115  0.90828    
## VolatileAcidity    -1.900e-02  6.843e-03  -2.777  0.00548 ** 
## CitricAcid          2.249e-03  6.119e-03   0.368  0.71316    
## ResidualSugar      -4.630e-05  1.614e-04  -0.287  0.77420    
## Chlorides          -2.728e-02  1.722e-02  -1.584  0.11313    
## FreeSulfurDioxide   4.982e-05  3.593e-05   1.386  0.16561    
## TotalSulfurDioxide -3.468e-06  2.296e-05  -0.151  0.87995    
## Density            -3.195e-01  2.016e-01  -1.585  0.11288    
## pH                  7.112e-04  7.996e-03   0.089  0.92913    
## Sulphates          -3.625e-03  6.027e-03  -0.601  0.54755    
## Alcohol             7.189e-03  1.461e-03   4.919 8.69e-07 ***
## LabelAppeal         2.435e-01  6.342e-03  38.399  < 2e-16 ***
## AcidIndex          -4.335e-02  5.400e-03  -8.028 9.91e-16 ***
## STARS               1.025e-01  6.463e-03  15.852  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.36082    0.06746  -5.349 8.85e-08 ***
## STARS       -0.58527    0.03405 -17.188  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 21 
## Log-likelihood: -2.302e+04 on 17 Df
Zero inflation model8: negative binomial
model8 <- zeroinfl(TARGET ~ . | STARS, data = wine_train, dist = 'negbin')
summary(model8)
## 
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = wine_train, dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5795 -0.2927  0.1833  0.5092  2.2406 
## 
## Count model coefficients (negbin with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.646e+00  2.061e-01   7.987 1.38e-15 ***
## FixedAcidity        9.837e-05  8.538e-04   0.115  0.90828    
## VolatileAcidity    -1.900e-02  6.843e-03  -2.777  0.00548 ** 
## CitricAcid          2.249e-03  6.119e-03   0.368  0.71317    
## ResidualSugar      -4.630e-05  1.614e-04  -0.287  0.77421    
## Chlorides          -2.728e-02  1.722e-02  -1.584  0.11313    
## FreeSulfurDioxide   4.982e-05  3.593e-05   1.386  0.16561    
## TotalSulfurDioxide -3.468e-06  2.296e-05  -0.151  0.87995    
## Density            -3.195e-01  2.016e-01  -1.585  0.11289    
## pH                  7.111e-04  7.996e-03   0.089  0.92913    
## Sulphates          -3.625e-03  6.027e-03  -0.601  0.54754    
## Alcohol             7.189e-03  1.461e-03   4.919 8.69e-07 ***
## LabelAppeal         2.435e-01  6.342e-03  38.399  < 2e-16 ***
## AcidIndex          -4.335e-02  5.400e-03  -8.028 9.91e-16 ***
## STARS               1.025e-01  6.463e-03  15.852  < 2e-16 ***
## Log(theta)          1.797e+01         NA      NA       NA    
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.36082    0.06746  -5.349 8.85e-08 ***
## STARS       -0.58527    0.03405 -17.188  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 63639583.2798 
## Number of iterations in BFGS optimization: 21 
## Log-likelihood: -2.302e+04 on 18 Df

Model Selection

Models by matrices MSE and AIC.

aic1 <- model1$aic
aic2 <- model2$aic
aic3 <- model3$aic
aic4 <- model4$aic
aic5 <- model5$aic
aic6 <- model6$aic
aic7 <- model7$aic
aic8 <- model8$aic
mse1 <- mean((wine_train$TARGET - predict(model1))^2)
mse2 <- mean((wine_train$TARGET - predict(model2))^2)
mse3 <- mean((wine_train$TARGET - predict(model3))^2)
mse4 <- mean((wine_train$TARGET - predict(model4))^2)
mse5 <- mean((wine_train$TARGET - predict(model5))^2)
mse6 <- mean((wine_train$TARGET - predict(model6))^2)
mse7 <- mean((wine_train$TARGET - predict(model7))^2)
mse8 <- mean((wine_train$TARGET - predict(model8))^2)
MSE <- c(mse1, mse2, mse3, mse4, mse5, mse6, mse7, mse8)
AIC <- c(aic1, aic2, aic3, aic4, aic5, aic6, aic7, aic8)
compare_df <- cbind(MSE,AIC)
rownames(compare_df) <- c("Model1", "Model2", "Model3", "Model4", "Model5", "Model6", "Model7", "Model8")
DT::datatable(compare_df)

Model 6 working fine than other models. Apply model 6 on test data.

Load test dataset to apply model and predict the TARGET.

wine_test =
  read.csv("https://raw.githubusercontent.com/mharrisonbaker/DATA621_GroupWork2/main/HW5/wine-evaluation-data.csv",
           header = TRUE) %>% dplyr::select(-IN)
dim(wine_test)
## [1] 3335   15
DT::datatable(head(wine_test))
Test data cleaning

Remove Index column and replace NAS with Median value.

summary(wine_test)
##   TARGET         FixedAcidity     VolatileAcidity     CitricAcid     
##  Mode:logical   Min.   :-18.200   Min.   :-2.8300   Min.   :-3.1200  
##  NA's:3335      1st Qu.:  5.200   1st Qu.: 0.0800   1st Qu.: 0.0000  
##                 Median :  6.900   Median : 0.2800   Median : 0.3100  
##                 Mean   :  6.864   Mean   : 0.3103   Mean   : 0.3124  
##                 3rd Qu.:  9.000   3rd Qu.: 0.6300   3rd Qu.: 0.6050  
##                 Max.   : 33.500   Max.   : 3.6100   Max.   : 3.7600  
##                                                                      
##  ResidualSugar        Chlorides        FreeSulfurDioxide TotalSulfurDioxide
##  Min.   :-128.300   Min.   :-1.15000   Min.   :-563.00   Min.   :-769.00   
##  1st Qu.:  -2.600   1st Qu.: 0.01600   1st Qu.:   3.00   1st Qu.:  27.25   
##  Median :   3.600   Median : 0.04700   Median :  30.00   Median : 124.00   
##  Mean   :   5.319   Mean   : 0.06143   Mean   :  34.95   Mean   : 123.41   
##  3rd Qu.:  17.200   3rd Qu.: 0.17100   3rd Qu.:  79.25   3rd Qu.: 210.00   
##  Max.   : 145.400   Max.   : 1.26300   Max.   : 617.00   Max.   :1004.00   
##  NA's   :168        NA's   :138        NA's   :152       NA's   :157       
##     Density             pH          Sulphates          Alcohol     
##  Min.   :0.8898   Min.   :0.600   Min.   :-3.0700   Min.   :-4.20  
##  1st Qu.:0.9883   1st Qu.:2.980   1st Qu.: 0.3300   1st Qu.: 9.00  
##  Median :0.9946   Median :3.210   Median : 0.5000   Median :10.40  
##  Mean   :0.9947   Mean   :3.237   Mean   : 0.5346   Mean   :10.58  
##  3rd Qu.:1.0005   3rd Qu.:3.490   3rd Qu.: 0.8200   3rd Qu.:12.50  
##  Max.   :1.0998   Max.   :6.210   Max.   : 4.1800   Max.   :25.60  
##                   NA's   :104     NA's   :310       NA's   :185    
##   LabelAppeal         AcidIndex          STARS     
##  Min.   :-2.00000   Min.   : 5.000   Min.   :1.00  
##  1st Qu.:-1.00000   1st Qu.: 7.000   1st Qu.:1.00  
##  Median : 0.00000   Median : 8.000   Median :2.00  
##  Mean   : 0.01349   Mean   : 7.748   Mean   :2.04  
##  3rd Qu.: 1.00000   3rd Qu.: 8.000   3rd Qu.:3.00  
##  Max.   : 2.00000   Max.   :17.000   Max.   :4.00  
##                                      NA's   :841
wine_test$ResidualSugar[is.na(wine_test$ResidualSugar)] <- median(wine_test$ResidualSugar, na.rm=TRUE)
wine_test$Chlorides[is.na(wine_test$Chlorides)] <- median(wine_test$Chlorides, na.rm=TRUE)
wine_test$FreeSulfurDioxide[is.na(wine_test$FreeSulfurDioxide)] <- median(wine_test$FreeSulfurDioxide, na.rm=TRUE)
wine_test$TotalSulfurDioxide[is.na(wine_test$TotalSulfurDioxide)] <- median(wine_test$TotalSulfurDioxide, na.rm=TRUE)
wine_test$pH[is.na(wine_test$pH)] <- median(wine_test$pH, na.rm=TRUE)
wine_test$Sulphates[is.na(wine_test$Sulphates)] <- median(wine_test$Sulphates, na.rm=TRUE)
wine_test$Alcohol[is.na(wine_test$Alcohol)] <- median(wine_test$Alcohol, na.rm=TRUE)
wine_test$STARS[is.na(wine_test$STARS)] <- median(wine_test$STARS, na.rm=TRUE)
# to check NAS exist or not
visdat::vis_miss(wine_test)

wine_test$TARGET <- round(predict(model6, newdata=wine_test))
head(wine_test)
##   TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
## 1      3          5.4          -0.860       0.27         -10.7     0.092
## 2      3         12.4           0.385      -0.76         -19.7     1.169
## 3      2          7.2           1.750       0.17         -33.0     0.065
## 4      2          6.2           0.100       1.80           1.0    -0.179
## 5      2         11.4           0.210       0.28           1.2     0.038
## 6      5         17.6           0.040      -1.15           1.4     0.535
##   FreeSulfurDioxide TotalSulfurDioxide Density   pH Sulphates Alcohol
## 1                23                398 0.98527 5.02      0.64   12.30
## 2               -37                 68 0.99048 3.37      1.09   16.00
## 3                 9                 76 1.04641 4.61      0.68    8.55
## 4               104                 89 0.98877 3.20      2.11   12.30
## 5                70                 53 1.02899 2.54     -0.07    4.80
## 6              -250                140 0.95028 3.06     -0.02   11.40
##   LabelAppeal AcidIndex STARS
## 1          -1         6     2
## 2           0         6     2
## 3           0         8     1
## 4          -1         8     1
## 5           0        10     2
## 6           1         8     4

Add 3 new column in wine_test for a classification of wine - poor, okay, and good. Poor will be those with scores of 0,1,2,3; 4 will be okay and more than 4 will be good.

These newly added 3 column will show TRUE and FALSE values based on TARGET.

wine_test$Poor <- wine_test$TARGET <= 3
wine_test$Okay <- wine_test$TARGET == 4
wine_test$Good <- wine_test$TARGET >= 5
# created copy of test_wine and move last 3 col to 2,3,4 index 
wine_test_new <-   wine_test %>% dplyr::select(TARGET, Poor, Okay, Good, everything())
kable(wine_test_new[1:20,])  %>% 
  kable_styling(full_width = T)
TARGET Poor Okay Good FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol LabelAppeal AcidIndex STARS
3 TRUE FALSE FALSE 5.4 -0.860 0.27 -10.7 0.092 23 398 0.98527 5.02 0.64 12.30 -1 6 2
3 TRUE FALSE FALSE 12.4 0.385 -0.76 -19.7 1.169 -37 68 0.99048 3.37 1.09 16.00 0 6 2
2 TRUE FALSE FALSE 7.2 1.750 0.17 -33.0 0.065 9 76 1.04641 4.61 0.68 8.55 0 8 1
2 TRUE FALSE FALSE 6.2 0.100 1.80 1.0 -0.179 104 89 0.98877 3.20 2.11 12.30 -1 8 1
2 TRUE FALSE FALSE 11.4 0.210 0.28 1.2 0.038 70 53 1.02899 2.54 -0.07 4.80 0 10 2
5 FALSE FALSE TRUE 17.6 0.040 -1.15 1.4 0.535 -250 140 0.95028 3.06 -0.02 11.40 1 8 4
2 TRUE FALSE FALSE 15.5 0.530 -0.53 4.6 1.263 10 17 1.00020 3.07 0.75 8.50 0 12 3
4 FALSE TRUE FALSE 15.9 1.190 1.14 31.9 -0.299 115 381 1.03416 2.99 0.31 11.40 1 7 2
1 TRUE FALSE FALSE 11.6 0.320 0.55 -50.9 0.076 35 83 1.00020 3.32 2.18 -0.50 0 12 2
3 TRUE FALSE FALSE 3.8 0.220 0.31 -7.7 0.039 40 129 0.90610 4.72 -0.64 10.90 0 7 2
2 TRUE FALSE FALSE 6.8 1.680 0.44 -13.3 0.046 30 583 1.00833 3.21 1.64 12.60 0 8 1
3 TRUE FALSE FALSE 9.0 -0.210 0.04 51.4 0.237 -213 -527 0.99516 3.16 0.70 14.70 1 10 2
3 TRUE FALSE FALSE 24.6 0.030 -1.20 1.3 0.035 241 297 0.99232 2.22 0.50 9.80 0 9 2
3 TRUE FALSE FALSE 13.0 0.210 0.32 -3.2 -0.263 111 141 0.95918 3.20 0.50 4.20 0 8 2
2 TRUE FALSE FALSE 17.9 -0.420 -0.91 7.1 0.045 -177 169 0.95307 3.17 -1.12 13.20 -1 9 2
1 TRUE FALSE FALSE 10.0 0.200 1.27 30.9 0.050 19 152 0.99400 3.21 0.42 13.80 -1 11 2
2 TRUE FALSE FALSE 7.4 0.290 0.50 8.5 -0.480 178 647 0.97275 3.45 0.50 10.20 -1 8 1
1 TRUE FALSE FALSE 11.7 1.180 -0.94 -62.0 0.675 7 -393 0.99974 3.96 0.69 5.20 1 13 2
3 TRUE FALSE FALSE 9.7 0.410 -1.00 3.6 -0.235 24 113 0.99772 3.44 0.53 9.80 0 7 2
5 FALSE FALSE TRUE -5.2 -0.980 -0.08 6.4 0.046 180 166 0.99400 3.30 2.18 9.90 1 5 3

Reference:

https://stats.idre.ucla.edu/r/dae/negative-binomial-regression/

https://stats.idre.ucla.edu/r/dae/zip/