DATA 621: Homework 5

Overview

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

image of data description

Count Regression Model

Count Regression model is a regression model that relates a non-negative integer value (0, 1, 2, 3, etc.) field of interest (a target variable) to one or more fields that are expected to have an influence on the target variable, and are often called predictor variables.

Objective

Our objective is to build poisson regression models, negative binomial regression models and multiple linear regression models on the training data and predict the number of cases of wine that will be sold given certain properties of the wine. Given 2 datasets (i.e training and evaluation), Using the training data set, will evaluate the performance of the count regression model and make predictions using the evaluation data set.

Library load

library(ggplot2)
library(dplyr)
library(kableExtra)
library(hrbrthemes)
library(ggthemes)
library(visdat)
library(pscl)
library(MASS)
library(corrplot)
library(VIM)
library(tidyr)

Data Load

Train dataset consists of 12795 observations (rows) and 15 features (variables).

wine_train <-
  read.csv("https://raw.githubusercontent.com/mharrisonbaker/DATA621_GroupWork2/main/HW5/wine-training-data.csv", 
           header = TRUE) %>% dplyr::select(-ï..INDEX)
dim(wine_train)

## [1] 12795    15

kable(wine_train[1:10,]) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")

TARGET	FixedAcidity	VolatileAcidity	CitricAcid	ResidualSugar	Chlorides	FreeSulfurDioxide	TotalSulfurDioxide	Density	pH	Sulphates	Alcohol	LabelAppeal	AcidIndex	STARS
3	3.2	1.160	-0.98	54.20	-0.567	NA	268	0.99280	3.33	-0.59	9.9	0	8	2
3	4.5	0.160	-0.81	26.10	-0.425	15	-327	1.02792	3.38	0.70	NA	-1	7	3
5	7.1	2.640	-0.88	14.80	0.037	214	142	0.99518	3.12	0.48	22.0	-1	8	3
3	5.7	0.385	0.04	18.80	-0.425	22	115	0.99640	2.24	1.83	6.2	-1	6	1
4	8.0	0.330	-1.26	9.40	NA	-167	108	0.99457	3.12	1.77	13.7	0	9	2
0	11.3	0.320	0.59	2.20	0.556	-37	15	0.99940	3.20	1.29	15.4	0	11	NA
0	7.7	0.290	-0.40	21.50	0.060	287	156	0.99572	3.49	1.21	10.3	0	8	NA
4	6.5	-1.220	0.34	1.40	0.040	523	551	1.03236	3.20	NA	11.6	1	7	3
3	14.8	0.270	1.05	11.25	-0.007	-213	NA	0.99620	4.93	0.26	15.0	0	6	NA
6	5.5	-0.220	0.39	1.80	-0.277	62	180	0.94724	3.09	0.75	12.6	0	8	4

Data Exploration

Summary of data shows some variables having NAs, we will see the disribution of these using boxplot to identify outliers.

summary(wine_train)

##      TARGET       FixedAcidity     VolatileAcidity     CitricAcid     
##  Min.   :0.000   Min.   :-18.100   Min.   :-2.7900   Min.   :-3.2400  
##  1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300   1st Qu.: 0.0300  
##  Median :3.000   Median :  6.900   Median : 0.2800   Median : 0.3100  
##  Mean   :3.029   Mean   :  7.076   Mean   : 0.3241   Mean   : 0.3084  
##  3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400   3rd Qu.: 0.5800  
##  Max.   :8.000   Max.   : 34.400   Max.   : 3.6800   Max.   : 3.8600  
##                                                                       
##  ResidualSugar        Chlorides       FreeSulfurDioxide TotalSulfurDioxide
##  Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00   Min.   :-823.0    
##  1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00   1st Qu.:  27.0    
##  Median :   3.900   Median : 0.0460   Median :  30.00   Median : 123.0    
##  Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85   Mean   : 120.7    
##  3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00   3rd Qu.: 208.0    
##  Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00   Max.   :1057.0    
##  NA's   :616        NA's   :638       NA's   :647       NA's   :682       
##     Density             pH          Sulphates          Alcohol     
##  Min.   :0.8881   Min.   :0.480   Min.   :-3.1300   Min.   :-4.70  
##  1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800   1st Qu.: 9.00  
##  Median :0.9945   Median :3.200   Median : 0.5000   Median :10.40  
##  Mean   :0.9942   Mean   :3.208   Mean   : 0.5271   Mean   :10.49  
##  3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600   3rd Qu.:12.40  
##  Max.   :1.0992   Max.   :6.130   Max.   : 4.2400   Max.   :26.50  
##                   NA's   :395     NA's   :1210      NA's   :653    
##   LabelAppeal          AcidIndex          STARS      
##  Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##                                       NA's   :3359

Data distributions

All variables are symetrically unimodal distributed except TARGET.

Correlation of variables

STARTS and LabelAppeal seems to have a positive correlation with the number of cases sold.

kable(cor(drop_na(wine_train))[,14], "html", escape = F) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(1, bold = T) %>%
  scroll_box(height = "500px")

	x
TARGET	-0.1676431
FixedAcidity	0.1541678
VolatileAcidity	0.0250530
CitricAcid	0.0545838
ResidualSugar	-0.0203019
Chlorides	-0.0017134
FreeSulfurDioxide	-0.0147337
TotalSulfurDioxide	-0.0221293
Density	0.0477788
pH	-0.0537129
Sulphates	0.0310718
Alcohol	-0.0558919
LabelAppeal	0.0103010
AcidIndex	1.0000000
STARS	-0.0954826

library(corrgram)
corrgram(drop_na(wine_train), order=TRUE,
         upper.panel=panel.cor, main="correlation")

Missing Values

visdat::vis_miss(wine_train)

Data Preparation

ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, pH, Sulphates, Alcohol, and STARS have NAs. Replace NAs with median value.

wine_train$ResidualSugar[is.na(wine_train$ResidualSugar)] <- median(wine_train$ResidualSugar, na.rm=TRUE)
wine_train$Chlorides[is.na(wine_train$Chlorides)] <- median(wine_train$Chlorides, na.rm=TRUE)
wine_train$FreeSulfurDioxide[is.na(wine_train$FreeSulfurDioxide)] <- median(wine_train$FreeSulfurDioxide, na.rm=TRUE)
wine_train$TotalSulfurDioxide[is.na(wine_train$TotalSulfurDioxide)] <- median(wine_train$TotalSulfurDioxide, na.rm=TRUE)
wine_train$pH[is.na(wine_train$pH)] <- median(wine_train$pH, na.rm=TRUE)
wine_train$Sulphates[is.na(wine_train$Sulphates)] <- median(wine_train$Sulphates, na.rm=TRUE)
wine_train$Alcohol[is.na(wine_train$Alcohol)] <- median(wine_train$Alcohol, na.rm=TRUE)
wine_train$STARS[is.na(wine_train$STARS)] <- median(wine_train$STARS, na.rm=TRUE)
# to check NAS exist or not
visdat::vis_miss(wine_train)

Build Models

We will build poisson regression models, negative binomial regression models, multiple linear regression models, and zero-inflated poisson model using all variables and backward approach.

poisson regression model1: all variables

model1 = glm(TARGET ~  ., data=wine_train, family=poisson)
summary(model1)

## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = wine_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4819  -0.5261   0.2041   0.6365   2.5504  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.017e+00  1.957e-01  10.305  < 2e-16 ***
## FixedAcidity       -4.515e-04  8.195e-04  -0.551 0.581697    
## VolatileAcidity    -5.045e-02  6.492e-03  -7.771 7.82e-15 ***
## CitricAcid          1.331e-02  5.892e-03   2.258 0.023925 *  
## ResidualSugar       1.442e-04  1.545e-04   0.934 0.350499    
## Chlorides          -6.005e-02  1.645e-02  -3.650 0.000262 ***
## FreeSulfurDioxide   1.424e-04  3.513e-05   4.054 5.04e-05 ***
## TotalSulfurDioxide  1.065e-04  2.268e-05   4.695 2.67e-06 ***
## Density            -4.315e-01  1.921e-01  -2.247 0.024651 *  
## pH                 -2.389e-02  7.639e-03  -3.128 0.001762 ** 
## Sulphates          -1.877e-02  5.739e-03  -3.271 0.001073 ** 
## Alcohol             5.368e-03  1.410e-03   3.807 0.000141 ***
## LabelAppeal         1.965e-01  6.021e-03  32.634  < 2e-16 ***
## AcidIndex          -1.222e-01  4.463e-03 -27.381  < 2e-16 ***
## STARS               2.198e-01  6.468e-03  33.986  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 18419  on 12780  degrees of freedom
## AIC: 50391
## 
## Number of Fisher Scoring iterations: 5

plot(model1)

Residual vs fitted plot shows red line is fairly flat so linearty assumption is met.
QQ-plot shows some observations are fall on the line with minimal deviatation
Scale-Location plot helps us to check the assumption of equal variance (homoscedasticity). Here red color line not completely staright but it is not bad.
Residual vs Leverage plot helps us to determine if we have influential outliers in our data. Residuals are not present outside of the cooks distance

poisson regression model2: with selected variables

Remove acidity variables and chemical variables from models. We will build one more glm model, using the variables:Density, PH, Sulphates, Alcohol, LabelAppeal, AcidIndex, and STARS.

model2 = glm(TARGET ~ 
                 Alcohol + LabelAppeal + AcidIndex + STARS, data=wine_train, family=poisson)
summary(model2)

## 
## Call:
## glm(formula = TARGET ~ Alcohol + LabelAppeal + AcidIndex + STARS, 
##     family = poisson, data = wine_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2829  -0.4886   0.2126   0.6327   2.7117  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.517121   0.040464  37.493  < 2e-16 ***
## Alcohol      0.005317   0.001410   3.771 0.000163 ***
## LabelAppeal  0.196742   0.006015  32.709  < 2e-16 ***
## AcidIndex   -0.124712   0.004369 -28.543  < 2e-16 ***
## STARS        0.222430   0.006461  34.427  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 18568  on 12790  degrees of freedom
## AIC: 50520
## 
## Number of Fisher Scoring iterations: 5

plot(model2)

This model looks better than first model.

negative binomial regression model3: with all variables

model3 <- glm.nb(TARGET ~ ., data = wine_train)
summary(model3)

## 
## Call:
## glm.nb(formula = TARGET ~ ., data = wine_train, init.theta = 39421.16225, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4818  -0.5261   0.2041   0.6365   2.5503  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.017e+00  1.957e-01  10.305  < 2e-16 ***
## FixedAcidity       -4.515e-04  8.196e-04  -0.551 0.581705    
## VolatileAcidity    -5.045e-02  6.493e-03  -7.770 7.83e-15 ***
## CitricAcid          1.331e-02  5.892e-03   2.258 0.023931 *  
## ResidualSugar       1.443e-04  1.545e-04   0.934 0.350498    
## Chlorides          -6.005e-02  1.645e-02  -3.650 0.000262 ***
## FreeSulfurDioxide   1.424e-04  3.513e-05   4.054 5.04e-05 ***
## TotalSulfurDioxide  1.065e-04  2.269e-05   4.695 2.67e-06 ***
## Density            -4.315e-01  1.921e-01  -2.247 0.024655 *  
## pH                 -2.389e-02  7.639e-03  -3.128 0.001763 ** 
## Sulphates          -1.877e-02  5.739e-03  -3.271 0.001073 ** 
## Alcohol             5.368e-03  1.410e-03   3.806 0.000141 ***
## LabelAppeal         1.965e-01  6.021e-03  32.633  < 2e-16 ***
## AcidIndex          -1.222e-01  4.464e-03 -27.380  < 2e-16 ***
## STARS               2.198e-01  6.468e-03  33.984  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(39421.16) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 18419  on 12780  degrees of freedom
## AIC: 50394
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  39421 
##           Std. Err.:  59707 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -50361.62

plot(model3)

negative binomial regression model4: with selected variables

model4 <- glm.nb(TARGET ~ Alcohol + LabelAppeal + AcidIndex + STARS, data = wine_train)
summary(model4)

## 
## Call:
## glm.nb(formula = TARGET ~ Alcohol + LabelAppeal + AcidIndex + 
##     STARS, data = wine_train, init.theta = 38799.08563, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2828  -0.4885   0.2126   0.6327   2.7116  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.517138   0.040465  37.492  < 2e-16 ***
## Alcohol      0.005317   0.001410   3.771 0.000163 ***
## LabelAppeal  0.196743   0.006015  32.708  < 2e-16 ***
## AcidIndex   -0.124714   0.004369 -28.542  < 2e-16 ***
## STARS        0.222428   0.006461  34.426  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(38799.09) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 18567  on 12790  degrees of freedom
## AIC: 50522
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  38799 
##           Std. Err.:  60155 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -50510.44

plot(model4)

Linear Model model5: with all variables

model5 <- lm(TARGET ~ ., data = wine_train)
summary(model5)

## 
## Call:
## lm(formula = TARGET ~ ., data = wine_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2211 -0.7540  0.3598  1.1254  4.3550 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.355e+00  5.517e-01   9.707  < 2e-16 ***
## FixedAcidity       -1.168e-03  2.315e-03  -0.505 0.613911    
## VolatileAcidity    -1.549e-01  1.838e-02  -8.429  < 2e-16 ***
## CitricAcid          3.976e-02  1.673e-02   2.377 0.017476 *  
## ResidualSugar       4.716e-04  4.371e-04   1.079 0.280670    
## Chlorides          -1.931e-01  4.638e-02  -4.164 3.15e-05 ***
## FreeSulfurDioxide   4.286e-04  9.941e-05   4.312 1.63e-05 ***
## TotalSulfurDioxide  3.098e-04  6.387e-05   4.851 1.25e-06 ***
## Density            -1.274e+00  5.427e-01  -2.347 0.018959 *  
## pH                 -6.387e-02  2.154e-02  -2.965 0.003028 ** 
## Sulphates          -5.485e-02  1.623e-02  -3.380 0.000728 ***
## Alcohol             1.883e-02  3.972e-03   4.739 2.17e-06 ***
## LabelAppeal         5.945e-01  1.686e-02  35.250  < 2e-16 ***
## AcidIndex          -3.259e-01  1.117e-02 -29.169  < 2e-16 ***
## STARS               7.478e-01  1.946e-02  38.431  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.626 on 12780 degrees of freedom
## Multiple R-squared:  0.2879, Adjusted R-squared:  0.2871 
## F-statistic: 369.1 on 14 and 12780 DF,  p-value: < 2.2e-16

plot(model5)

Linear Model model6: with selected variables

Added CitricAcid, Sulphates, VolatileAcidity, Chlorides in to the model, these varaibles having low p-value to get better fit model.

model6 <- lm(TARGET ~ CitricAcid + Sulphates + VolatileAcidity + Chlorides +
             Alcohol + LabelAppeal + AcidIndex + STARS, data = wine_train)
summary(model6)

## 
## Call:
## lm(formula = TARGET ~ CitricAcid + Sulphates + VolatileAcidity + 
##     Chlorides + Alcohol + LabelAppeal + AcidIndex + STARS, data = wine_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0223 -0.7411  0.3702  1.1218  4.4756 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.969865   0.106568  37.252  < 2e-16 ***
## CitricAcid       0.041882   0.016760   2.499  0.01247 *  
## Sulphates       -0.054633   0.016256  -3.361  0.00078 ***
## VolatileAcidity -0.158457   0.018409  -8.608  < 2e-16 ***
## Chlorides       -0.199935   0.046443  -4.305 1.68e-05 ***
## Alcohol          0.018274   0.003978   4.594 4.40e-06 ***
## LabelAppeal      0.594530   0.016895  35.190  < 2e-16 ***
## AcidIndex       -0.330626   0.010972 -30.133  < 2e-16 ***
## STARS            0.748890   0.019491  38.422  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.63 on 12786 degrees of freedom
## Multiple R-squared:  0.2846, Adjusted R-squared:  0.2842 
## F-statistic:   636 on 8 and 12786 DF,  p-value: < 2.2e-16

plot(model6)

Above model shows Multiple R-squared 0.2846 means model explains 29% variation in the response variable.

Zero inflation model7: poisson

Zero inflation model:

Zero-inflated poisson regression is used to model count data that has an excess of zero counts. Further, theory suggests that the excess zeros are generated by a separate process from the count values and that the excess zeros can be modeled independently. Thus, the zip model has two parts, a poisson count model and the logit model for predicting excess zeros.

model7 <- zeroinfl(TARGET ~ . | STARS, data = wine_train, dist = 'poisson')
summary(model7)

## 
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = wine_train, dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5795 -0.2927  0.1833  0.5092  2.2406 
## 
## Count model coefficients (poisson with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.646e+00  2.061e-01   7.987 1.38e-15 ***
## FixedAcidity        9.837e-05  8.538e-04   0.115  0.90828    
## VolatileAcidity    -1.900e-02  6.843e-03  -2.777  0.00548 ** 
## CitricAcid          2.249e-03  6.119e-03   0.368  0.71316    
## ResidualSugar      -4.630e-05  1.614e-04  -0.287  0.77420    
## Chlorides          -2.728e-02  1.722e-02  -1.584  0.11313    
## FreeSulfurDioxide   4.982e-05  3.593e-05   1.386  0.16561    
## TotalSulfurDioxide -3.468e-06  2.296e-05  -0.151  0.87995    
## Density            -3.195e-01  2.016e-01  -1.585  0.11288    
## pH                  7.112e-04  7.996e-03   0.089  0.92913    
## Sulphates          -3.625e-03  6.027e-03  -0.601  0.54755    
## Alcohol             7.189e-03  1.461e-03   4.919 8.69e-07 ***
## LabelAppeal         2.435e-01  6.342e-03  38.399  < 2e-16 ***
## AcidIndex          -4.335e-02  5.400e-03  -8.028 9.91e-16 ***
## STARS               1.025e-01  6.463e-03  15.852  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.36082    0.06746  -5.349 8.85e-08 ***
## STARS       -0.58527    0.03405 -17.188  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 21 
## Log-likelihood: -2.302e+04 on 17 Df

Zero inflation model8: negative binomial

model8 <- zeroinfl(TARGET ~ . | STARS, data = wine_train, dist = 'negbin')
summary(model8)

## 
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = wine_train, dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5795 -0.2927  0.1833  0.5092  2.2406 
## 
## Count model coefficients (negbin with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.646e+00  2.061e-01   7.987 1.38e-15 ***
## FixedAcidity        9.837e-05  8.538e-04   0.115  0.90828    
## VolatileAcidity    -1.900e-02  6.843e-03  -2.777  0.00548 ** 
## CitricAcid          2.249e-03  6.119e-03   0.368  0.71317    
## ResidualSugar      -4.630e-05  1.614e-04  -0.287  0.77421    
## Chlorides          -2.728e-02  1.722e-02  -1.584  0.11313    
## FreeSulfurDioxide   4.982e-05  3.593e-05   1.386  0.16561    
## TotalSulfurDioxide -3.468e-06  2.296e-05  -0.151  0.87995    
## Density            -3.195e-01  2.016e-01  -1.585  0.11289    
## pH                  7.111e-04  7.996e-03   0.089  0.92913    
## Sulphates          -3.625e-03  6.027e-03  -0.601  0.54754    
## Alcohol             7.189e-03  1.461e-03   4.919 8.69e-07 ***
## LabelAppeal         2.435e-01  6.342e-03  38.399  < 2e-16 ***
## AcidIndex          -4.335e-02  5.400e-03  -8.028 9.91e-16 ***
## STARS               1.025e-01  6.463e-03  15.852  < 2e-16 ***
## Log(theta)          1.797e+01         NA      NA       NA    
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.36082    0.06746  -5.349 8.85e-08 ***
## STARS       -0.58527    0.03405 -17.188  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 63639583.2798 
## Number of iterations in BFGS optimization: 21 
## Log-likelihood: -2.302e+04 on 18 Df

Model Selection

Models by matrices MSE and AIC.

aic1 <- model1$aic
aic2 <- model2$aic
aic3 <- model3$aic
aic4 <- model4$aic
aic5 <- model5$aic
aic6 <- model6$aic
aic7 <- model7$aic
aic8 <- model8$aic
mse1 <- mean((wine_train$TARGET - predict(model1))^2)
mse2 <- mean((wine_train$TARGET - predict(model2))^2)
mse3 <- mean((wine_train$TARGET - predict(model3))^2)
mse4 <- mean((wine_train$TARGET - predict(model4))^2)
mse5 <- mean((wine_train$TARGET - predict(model5))^2)
mse6 <- mean((wine_train$TARGET - predict(model6))^2)
mse7 <- mean((wine_train$TARGET - predict(model7))^2)
mse8 <- mean((wine_train$TARGET - predict(model8))^2)
MSE <- c(mse1, mse2, mse3, mse4, mse5, mse6, mse7, mse8)
AIC <- c(aic1, aic2, aic3, aic4, aic5, aic6, aic7, aic8)
compare_df <- cbind(MSE,AIC)
rownames(compare_df) <- c("Model1", "Model2", "Model3", "Model4", "Model5", "Model6", "Model7", "Model8")
DT::datatable(compare_df)

Model 6 working fine than other models. Apply model 6 on test data.

Load test dataset to apply model and predict the TARGET.

wine_test =
  read.csv("https://raw.githubusercontent.com/mharrisonbaker/DATA621_GroupWork2/main/HW5/wine-evaluation-data.csv",
           header = TRUE) %>% dplyr::select(-IN)
dim(wine_test)

## [1] 3335   15

DT::datatable(head(wine_test))

Test data cleaning

Remove Index column and replace NAS with Median value.

summary(wine_test)

##   TARGET         FixedAcidity     VolatileAcidity     CitricAcid     
##  Mode:logical   Min.   :-18.200   Min.   :-2.8300   Min.   :-3.1200  
##  NA's:3335      1st Qu.:  5.200   1st Qu.: 0.0800   1st Qu.: 0.0000  
##                 Median :  6.900   Median : 0.2800   Median : 0.3100  
##                 Mean   :  6.864   Mean   : 0.3103   Mean   : 0.3124  
##                 3rd Qu.:  9.000   3rd Qu.: 0.6300   3rd Qu.: 0.6050  
##                 Max.   : 33.500   Max.   : 3.6100   Max.   : 3.7600  
##                                                                      
##  ResidualSugar        Chlorides        FreeSulfurDioxide TotalSulfurDioxide
##  Min.   :-128.300   Min.   :-1.15000   Min.   :-563.00   Min.   :-769.00   
##  1st Qu.:  -2.600   1st Qu.: 0.01600   1st Qu.:   3.00   1st Qu.:  27.25   
##  Median :   3.600   Median : 0.04700   Median :  30.00   Median : 124.00   
##  Mean   :   5.319   Mean   : 0.06143   Mean   :  34.95   Mean   : 123.41   
##  3rd Qu.:  17.200   3rd Qu.: 0.17100   3rd Qu.:  79.25   3rd Qu.: 210.00   
##  Max.   : 145.400   Max.   : 1.26300   Max.   : 617.00   Max.   :1004.00   
##  NA's   :168        NA's   :138        NA's   :152       NA's   :157       
##     Density             pH          Sulphates          Alcohol     
##  Min.   :0.8898   Min.   :0.600   Min.   :-3.0700   Min.   :-4.20  
##  1st Qu.:0.9883   1st Qu.:2.980   1st Qu.: 0.3300   1st Qu.: 9.00  
##  Median :0.9946   Median :3.210   Median : 0.5000   Median :10.40  
##  Mean   :0.9947   Mean   :3.237   Mean   : 0.5346   Mean   :10.58  
##  3rd Qu.:1.0005   3rd Qu.:3.490   3rd Qu.: 0.8200   3rd Qu.:12.50  
##  Max.   :1.0998   Max.   :6.210   Max.   : 4.1800   Max.   :25.60  
##                   NA's   :104     NA's   :310       NA's   :185    
##   LabelAppeal         AcidIndex          STARS     
##  Min.   :-2.00000   Min.   : 5.000   Min.   :1.00  
##  1st Qu.:-1.00000   1st Qu.: 7.000   1st Qu.:1.00  
##  Median : 0.00000   Median : 8.000   Median :2.00  
##  Mean   : 0.01349   Mean   : 7.748   Mean   :2.04  
##  3rd Qu.: 1.00000   3rd Qu.: 8.000   3rd Qu.:3.00  
##  Max.   : 2.00000   Max.   :17.000   Max.   :4.00  
##                                      NA's   :841

wine_test$ResidualSugar[is.na(wine_test$ResidualSugar)] <- median(wine_test$ResidualSugar, na.rm=TRUE)
wine_test$Chlorides[is.na(wine_test$Chlorides)] <- median(wine_test$Chlorides, na.rm=TRUE)
wine_test$FreeSulfurDioxide[is.na(wine_test$FreeSulfurDioxide)] <- median(wine_test$FreeSulfurDioxide, na.rm=TRUE)
wine_test$TotalSulfurDioxide[is.na(wine_test$TotalSulfurDioxide)] <- median(wine_test$TotalSulfurDioxide, na.rm=TRUE)
wine_test$pH[is.na(wine_test$pH)] <- median(wine_test$pH, na.rm=TRUE)
wine_test$Sulphates[is.na(wine_test$Sulphates)] <- median(wine_test$Sulphates, na.rm=TRUE)
wine_test$Alcohol[is.na(wine_test$Alcohol)] <- median(wine_test$Alcohol, na.rm=TRUE)
wine_test$STARS[is.na(wine_test$STARS)] <- median(wine_test$STARS, na.rm=TRUE)
# to check NAS exist or not
visdat::vis_miss(wine_test)

wine_test$TARGET <- round(predict(model6, newdata=wine_test))
head(wine_test)

##   TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
## 1      3          5.4          -0.860       0.27         -10.7     0.092
## 2      3         12.4           0.385      -0.76         -19.7     1.169
## 3      2          7.2           1.750       0.17         -33.0     0.065
## 4      2          6.2           0.100       1.80           1.0    -0.179
## 5      2         11.4           0.210       0.28           1.2     0.038
## 6      5         17.6           0.040      -1.15           1.4     0.535
##   FreeSulfurDioxide TotalSulfurDioxide Density   pH Sulphates Alcohol
## 1                23                398 0.98527 5.02      0.64   12.30
## 2               -37                 68 0.99048 3.37      1.09   16.00
## 3                 9                 76 1.04641 4.61      0.68    8.55
## 4               104                 89 0.98877 3.20      2.11   12.30
## 5                70                 53 1.02899 2.54     -0.07    4.80
## 6              -250                140 0.95028 3.06     -0.02   11.40
##   LabelAppeal AcidIndex STARS
## 1          -1         6     2
## 2           0         6     2
## 3           0         8     1
## 4          -1         8     1
## 5           0        10     2
## 6           1         8     4

Add 3 new column in wine_test for a classification of wine - poor, okay, and good. Poor will be those with scores of 0,1,2,3; 4 will be okay and more than 4 will be good.

These newly added 3 column will show TRUE and FALSE values based on TARGET.

wine_test$Poor <- wine_test$TARGET <= 3
wine_test$Okay <- wine_test$TARGET == 4
wine_test$Good <- wine_test$TARGET >= 5
# created copy of test_wine and move last 3 col to 2,3,4 index 
wine_test_new <-   wine_test %>% dplyr::select(TARGET, Poor, Okay, Good, everything())
kable(wine_test_new[1:20,])  %>% 
  kable_styling(full_width = T)

TARGET	Poor	Okay	Good	FixedAcidity	VolatileAcidity	CitricAcid	ResidualSugar	Chlorides	FreeSulfurDioxide	TotalSulfurDioxide	Density	pH	Sulphates	Alcohol	LabelAppeal	AcidIndex	STARS
3	TRUE	FALSE	FALSE	5.4	-0.860	0.27	-10.7	0.092	23	398	0.98527	5.02	0.64	12.30	-1	6	2
3	TRUE	FALSE	FALSE	12.4	0.385	-0.76	-19.7	1.169	-37	68	0.99048	3.37	1.09	16.00	0	6	2
2	TRUE	FALSE	FALSE	7.2	1.750	0.17	-33.0	0.065	9	76	1.04641	4.61	0.68	8.55	0	8	1
2	TRUE	FALSE	FALSE	6.2	0.100	1.80	1.0	-0.179	104	89	0.98877	3.20	2.11	12.30	-1	8	1
2	TRUE	FALSE	FALSE	11.4	0.210	0.28	1.2	0.038	70	53	1.02899	2.54	-0.07	4.80	0	10	2
5	FALSE	FALSE	TRUE	17.6	0.040	-1.15	1.4	0.535	-250	140	0.95028	3.06	-0.02	11.40	1	8	4
2	TRUE	FALSE	FALSE	15.5	0.530	-0.53	4.6	1.263	10	17	1.00020	3.07	0.75	8.50	0	12	3
4	FALSE	TRUE	FALSE	15.9	1.190	1.14	31.9	-0.299	115	381	1.03416	2.99	0.31	11.40	1	7	2
1	TRUE	FALSE	FALSE	11.6	0.320	0.55	-50.9	0.076	35	83	1.00020	3.32	2.18	-0.50	0	12	2
3	TRUE	FALSE	FALSE	3.8	0.220	0.31	-7.7	0.039	40	129	0.90610	4.72	-0.64	10.90	0	7	2
2	TRUE	FALSE	FALSE	6.8	1.680	0.44	-13.3	0.046	30	583	1.00833	3.21	1.64	12.60	0	8	1
3	TRUE	FALSE	FALSE	9.0	-0.210	0.04	51.4	0.237	-213	-527	0.99516	3.16	0.70	14.70	1	10	2
3	TRUE	FALSE	FALSE	24.6	0.030	-1.20	1.3	0.035	241	297	0.99232	2.22	0.50	9.80	0	9	2
3	TRUE	FALSE	FALSE	13.0	0.210	0.32	-3.2	-0.263	111	141	0.95918	3.20	0.50	4.20	0	8	2
2	TRUE	FALSE	FALSE	17.9	-0.420	-0.91	7.1	0.045	-177	169	0.95307	3.17	-1.12	13.20	-1	9	2
1	TRUE	FALSE	FALSE	10.0	0.200	1.27	30.9	0.050	19	152	0.99400	3.21	0.42	13.80	-1	11	2
2	TRUE	FALSE	FALSE	7.4	0.290	0.50	8.5	-0.480	178	647	0.97275	3.45	0.50	10.20	-1	8	1
1	TRUE	FALSE	FALSE	11.7	1.180	-0.94	-62.0	0.675	7	-393	0.99974	3.96	0.69	5.20	1	13	2
3	TRUE	FALSE	FALSE	9.7	0.410	-1.00	3.6	-0.235	24	113	0.99772	3.44	0.53	9.80	0	7	2
5	FALSE	FALSE	TRUE	-5.2	-0.980	-0.08	6.4	0.046	180	166	0.99400	3.30	2.18	9.90	1	5	3

Reference:

https://stats.idre.ucla.edu/r/dae/negative-binomial-regression/

https://stats.idre.ucla.edu/r/dae/zip/