Data 621 Homework5

Introduction

In this assignment, we are tasked with predicting the number of cases of wine sold to wine distribution companies following a sampling. The target variable, cases of wine sold, is count data and therefore will be modeled using appropriate techniques such as Poisson and Negative Binomial regressions.

data_train <- read.csv("https://raw.githubusercontent.com/salma71/Data_621/master/HW_5/wine-training-data.csv", header = TRUE)
data_test <- read.csv("https://raw.githubusercontent.com/salma71/Data_621/master/HW_5/wine-evaluation-data.csv", header = TRUE)

Data Exploration

Data Exploration

All the variable in this dataset are numeric and continuous except for AcidIndex, STARS and LabelAppeal which are discrete. The target variable TARGET is also discrete. There are a number of missing observations for certain chemical compositiion variables as well as a large number of wines with no STARS rating. The distribution of the continous variables appear well centered.

skim(data_train)
Data summary
Name data_train
Number of rows 12795
Number of columns 16
_______________________
Column type frequency:
numeric 16
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
INDEX 0 1.00 8069.98 4656.91 1.00 4037.50 8110.00 12106.50 16129.00 ▇▇▇▇▇
TARGET 0 1.00 3.03 1.93 0.00 2.00 3.00 4.00 8.00 ▆▇▇▆▁
FixedAcidity 0 1.00 7.08 6.32 -18.10 5.20 6.90 9.50 34.40 ▁▂▇▂▁
VolatileAcidity 0 1.00 0.32 0.78 -2.79 0.13 0.28 0.64 3.68 ▁▂▇▂▁
CitricAcid 0 1.00 0.31 0.86 -3.24 0.03 0.31 0.58 3.86 ▁▂▇▂▁
ResidualSugar 616 0.95 5.42 33.75 -127.80 -2.00 3.90 15.90 141.15 ▁▂▇▂▁
Chlorides 638 0.95 0.05 0.32 -1.17 -0.03 0.05 0.15 1.35 ▁▂▇▂▁
FreeSulfurDioxide 647 0.95 30.85 148.71 -555.00 0.00 30.00 70.00 623.00 ▁▂▇▂▁
TotalSulfurDioxide 682 0.95 120.71 231.91 -823.00 27.00 123.00 208.00 1057.00 ▁▂▇▂▁
Density 0 1.00 0.99 0.03 0.89 0.99 0.99 1.00 1.10 ▁▂▇▂▁
pH 395 0.97 3.21 0.68 0.48 2.96 3.20 3.47 6.13 ▁▂▇▂▁
Sulphates 1210 0.91 0.53 0.93 -3.13 0.28 0.50 0.86 4.24 ▁▂▇▂▁
Alcohol 653 0.95 10.49 3.73 -4.70 9.00 10.40 12.40 26.50 ▁▂▇▂▁
LabelAppeal 0 1.00 -0.01 0.89 -2.00 -1.00 0.00 1.00 2.00 ▁▅▇▅▁
AcidIndex 0 1.00 7.77 1.32 4.00 7.00 8.00 8.00 17.00 ▁▇▁▁▁
STARS 3359 0.74 2.04 0.90 1.00 1.00 2.00 3.00 4.00 ▇▇▁▅▂
skim(data_test)
Data summary
Name data_test
Number of rows 3335
Number of columns 16
_______________________
Column type frequency:
logical 1
numeric 15
________________________
Group variables None

Variable type: logical

skim_variable n_missing complete_rate mean count
TARGET 3335 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
IN 0 1.00 8048.31 4655.48 3.00 4018.50 7906.00 12061.00 16130.00 ▇▇▇▇▇
FixedAcidity 0 1.00 6.86 6.32 -18.20 5.20 6.90 9.00 33.50 ▁▂▇▂▁
VolatileAcidity 0 1.00 0.31 0.81 -2.83 0.08 0.28 0.63 3.61 ▁▂▇▂▁
CitricAcid 0 1.00 0.31 0.87 -3.12 0.00 0.31 0.60 3.76 ▁▂▇▂▁
ResidualSugar 168 0.95 5.32 34.37 -128.30 -2.60 3.60 17.20 145.40 ▁▂▇▂▁
Chlorides 138 0.96 0.06 0.31 -1.15 0.02 0.05 0.17 1.26 ▁▂▇▂▁
FreeSulfurDioxide 152 0.95 34.95 149.63 -563.00 3.00 30.00 79.25 617.00 ▁▂▇▂▁
TotalSulfurDioxide 157 0.95 123.41 225.80 -769.00 27.25 124.00 210.00 1004.00 ▁▂▇▂▁
Density 0 1.00 0.99 0.03 0.89 0.99 0.99 1.00 1.10 ▁▂▇▂▁
pH 104 0.97 3.24 0.68 0.60 2.98 3.21 3.49 6.21 ▁▂▇▂▁
Sulphates 310 0.91 0.53 0.91 -3.07 0.33 0.50 0.82 4.18 ▁▂▇▂▁
Alcohol 185 0.94 10.58 3.76 -4.20 9.00 10.40 12.50 25.60 ▁▂▇▂▁
LabelAppeal 0 1.00 0.01 0.89 -2.00 -1.00 0.00 1.00 2.00 ▁▅▇▅▁
AcidIndex 0 1.00 7.75 1.32 5.00 7.00 8.00 8.00 17.00 ▇▇▁▁▁
STARS 841 0.75 2.04 0.91 1.00 1.00 2.00 3.00 4.00 ▇▇▁▅▂
# remove index column as it is not needed
data_train <- data_train %>% 
  dplyr::select(-"INDEX")
data_test <- data_test %>% 
  dplyr::select(-"IN")

Visualization

The visualizations below display the distributions of the dataset. All continuous variables are centered and close to normally distributed. We note that some variables are centered around zero and take negative values which is unexpected and will be investigated further and transformed for our analysis. The distribution of the TARGET variable looks like it could be well described by the poisson distribution which has equal mean and variance but the high number of zero values justifies the use of a zero-inflated model as well. The mean and variance of TARGET are 3.02 and 3.71 respectively, which is close enough to satisfy the equal mean-variance assumption of the poisson distribution.

# histogram
data_train %>% 
  dplyr::select(-c("AcidIndex", "STARS", "TARGET", "LabelAppeal")) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~key, scale = "free",  ncol = 3) +
  geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)), fill="pink") +
  theme_minimal()

data_train %>% 
  ggplot(aes(x=TARGET)) + 
  geom_histogram(fill='pink')

In the box plot below, we plot several variables in to two panels. TotalSulfurDioxide, FreeSulfurDioxide, and ResidualSugar variables have large ranges compared to other variables. Therefore, we separated those variables in to a different panel to view their distribution. From both panel, we can tell a high number of variables have numerous outliers.

library(ggpubr)
# boxplot
p1 <- data_train %>% 
  dplyr::select(-c("TotalSulfurDioxide", "FreeSulfurDioxide", "ResidualSugar")) %>% 
  gather(na.rm = TRUE) %>% 
  ggplot(aes(factor(key), value)) +
  geom_boxplot(outlier.colour = "#e281cf", outlier.shape = 1,  color = "#5aa1ed") +
  coord_flip() +
  labs(title = "Boxplot of Chemical Properties of Wine", x = "Chemical Properties", y = "Values") +
  theme_minimal()
p2 <- data_train %>% 
  dplyr::select(c("TotalSulfurDioxide", "FreeSulfurDioxide", "ResidualSugar")) %>% 
  gather(na.rm = TRUE) %>% 
  ggplot(aes(factor(key), value)) +
  geom_boxplot(outlier.colour = "#e281cf", outlier.shape = 1, color = "#5aa1ed") +
  #labs(title = "Boxplot of Chemical Properties of Wine", x = "Chemical Properties", y = "Values") +
  theme_minimal()
ggarrange(p1, p2)

In the bar char below, AcidIndex tells us large quantity of wine were sold with the index number 7 and 8. LabelAppeal tells us generic labeled wine sells the most; however, better label does yield higher number of wine samples per order. Lastly, STARS tells us excellent quality does not result in high wine orders. It could be due to high star wine bottle’s high price tag.

# barchart
p3 <- data_train %>% 
  dplyr::select(TARGET, STARS) %>% 
  mutate(STARS = as.factor(STARS),
         TARGET = as.factor(TARGET)) %>% 
  ggplot(aes(STARS)) +
  geom_bar(aes(fill = TARGET)) +
  theme_minimal()
p4 <- data_train %>%
  dplyr::select(TARGET, LabelAppeal) %>% 
  mutate(STARS = as.factor(LabelAppeal),
         TARGET = as.factor(TARGET)) %>% 
  ggplot(aes(LabelAppeal)) +
  geom_bar(aes(fill = TARGET)) +
  theme_minimal()
p5 <- data_train %>% 
  dplyr::select(TARGET, AcidIndex) %>% 
  mutate(STARS = as.factor(AcidIndex),
         TARGET = as.factor(TARGET)) %>% 
  ggplot(aes(AcidIndex)) +
  geom_bar(aes(fill = TARGET)) +
  theme_minimal()
ggarrange(p5, ggarrange(p3, p4, ncol = 2, nrow = 1, legend = "none"), nrow = 2, common.legend = TRUE)

Correlations with Response Variable

# top correlation
wine_train_corr <- data_train %>% 
  drop_na() %>% 
  cor()
kable(sort(wine_train_corr[,1], decreasing = T), col.names = c("Correlation")) %>% 
  kable_styling(full_width = F)
Correlation
TARGET 1.0000000
STARS 0.5546857
LabelAppeal 0.4979465
Alcohol 0.0737771
FreeSulfurDioxide 0.0226398
TotalSulfurDioxide 0.0216021
ResidualSugar 0.0035196
CitricAcid 0.0023450
pH 0.0002199
FixedAcidity -0.0125381
Sulphates -0.0212204
Chlorides -0.0304301
Density -0.0475989
VolatileAcidity -0.0759979
AcidIndex -0.1676431

In the correlation table and plot below, we see, STARS and LabelAppeal are most positively correlated variables with the response variable. We expected this because our variable description mentions these variable’s theoretical affect are higher than other variables. Also, we some mild negative correlation between the response variable and AcidIndex variable.

library(corrplot)
library(RColorBrewer)
# correlation plot
corrplot(wine_train_corr, 
         method = "number", 
         type = "lower",
         col = brewer.pal(n = 15, name = "Reds"),
         number.cex = .7, tl.cex = .7,
         tl.col = "black", tl.srt = 45)

Data Preparation

Using the aggr function from VIM package, We see several variables have missing values. According to UCI Machine Learning, who published this dataset, all wine contain some natural sulfites. Therefore, we will impute the missing values for sulfite chemical properties. Also, it’s rare to find wines with less than 1 gram/liter of sugar. We will impute this as well. Matter of fact, since all missing values are missing at random, we will impute all the missing values using the mice package and random forest method. Mice uses multivariate imputations to estimate the missing values. Using multiple imputations helps in resolving the uncertainty for the missingness. Our target variable will be removed as a predictor variable but still will be imputed. Our response variables will be removed as predictor variables but still will be imputed.

library(VIM)
# missing value columns
aggr(data_train, 
     sortVars=TRUE, 
     labels=names(data_train), 
     cex.axis=.5, 
     bars = FALSE, 
     col = c("white", "#E46726"),
     combined = TRUE,
     #border = NA,
     ylab = "Missing Values") 


 Variables sorted by number of missings: 
           Variable Count
              STARS  3359
          Sulphates  1210
 TotalSulfurDioxide   682
            Alcohol   653
  FreeSulfurDioxide   647
          Chlorides   638
      ResidualSugar   616
                 pH   395
             TARGET     0
       FixedAcidity     0
    VolatileAcidity     0
         CitricAcid     0
            Density     0
        LabelAppeal     0
          AcidIndex     0
library(mice)
# imputating train data
init <- mice(data_train)

 iter imp variable
  1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
meth <- init$method
predM <- init$predictorMatrix
predM[, c("TARGET")] <- 0 #this code will remove the variable as a predictor but still will be imputed
data_train_impute <- mice(data_train, method = 'rf', predictorMatrix=predM)

 iter imp variable
  1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
data_train_imputed <- complete(data_train_impute)
print(paste0("Missing value after imputation: ", sum(is.na(data_train_imputed))))
[1] "Missing value after imputation: 0"

Model Building

Model 1: Poisson (Raw data)

# poisson model with the missing values
model1 <- glm(TARGET ~ ., family = poisson, data_train)
summary(model1)

Call:
glm(formula = TARGET ~ ., family = poisson, data = data_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.2158  -0.2734   0.0616   0.3732   1.6830  

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         1.593e+00  2.506e-01   6.359 2.03e-10 ***
FixedAcidity        3.293e-04  1.053e-03   0.313  0.75447    
VolatileAcidity    -2.560e-02  8.353e-03  -3.065  0.00218 ** 
CitricAcid         -7.259e-04  7.575e-03  -0.096  0.92365    
ResidualSugar      -6.141e-05  1.941e-04  -0.316  0.75165    
Chlorides          -3.007e-02  2.056e-02  -1.463  0.14346    
FreeSulfurDioxide   6.734e-05  4.404e-05   1.529  0.12620    
TotalSulfurDioxide  2.081e-05  2.855e-05   0.729  0.46618    
Density            -3.725e-01  2.462e-01  -1.513  0.13026    
pH                 -4.661e-03  9.598e-03  -0.486  0.62722    
Sulphates          -5.164e-03  7.051e-03  -0.732  0.46398    
Alcohol             3.948e-03  1.771e-03   2.229  0.02579 *  
LabelAppeal         1.771e-01  7.954e-03  22.271  < 2e-16 ***
AcidIndex          -4.870e-02  5.903e-03  -8.251  < 2e-16 ***
STARS               1.871e-01  7.487e-03  24.993  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 5844.1  on 6435  degrees of freedom
Residual deviance: 4009.1  on 6421  degrees of freedom
  (6359 observations deleted due to missingness)
AIC: 23172

Number of Fisher Scoring iterations: 5
print('Goodness of Fit Test:')
[1] "Goodness of Fit Test:"
with(model1, cbind(res.deviance = deviance, df = df.residual,  p = pchisq(deviance, df.residual, lower.tail=FALSE)))
     res.deviance   df p
[1,]     4009.142 6421 1

The deviance residuals is quite symmetrical. This means that the predicted points are close to actual observed points. As expected and seen in our correlation table, STARS, LabelAppeal and AcidIndex are significant variables. And the variation in standard error is low. The goodness of fit test has a high p value which indicates that the model fits the data well.

Model 2: Poisson (Imputed Data)

# poisson model with the imputed values
model2 <- glm(TARGET ~ ., family = poisson, data_train_imputed)
summary(model2)

Call:
glm(formula = TARGET ~ ., family = poisson, data = data_train_imputed)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.2375  -0.5140   0.2139   0.6265   2.5573  

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         2.020e+00  1.955e-01  10.331  < 2e-16 ***
FixedAcidity       -3.926e-04  8.194e-04  -0.479 0.631854    
VolatileAcidity    -4.870e-02  6.486e-03  -7.507 6.03e-14 ***
CitricAcid          1.263e-02  5.891e-03   2.144 0.032025 *  
ResidualSugar       1.644e-04  1.515e-04   1.085 0.277860    
Chlorides          -5.719e-02  1.615e-02  -3.542 0.000397 ***
FreeSulfurDioxide   1.492e-04  3.448e-05   4.326 1.52e-05 ***
TotalSulfurDioxide  1.017e-04  2.221e-05   4.578 4.68e-06 ***
Density            -3.779e-01  1.919e-01  -1.969 0.048955 *  
pH                 -2.450e-02  7.565e-03  -3.239 0.001200 ** 
Sulphates          -1.833e-02  5.542e-03  -3.308 0.000941 ***
Alcohol             5.327e-03  1.385e-03   3.847 0.000120 ***
LabelAppeal         1.951e-01  6.039e-03  32.309  < 2e-16 ***
AcidIndex          -1.216e-01  4.455e-03 -27.307  < 2e-16 ***
STARS               1.946e-01  5.799e-03  33.566  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 22861  on 12794  degrees of freedom
Residual deviance: 18447  on 12780  degrees of freedom
AIC: 50419

Number of Fisher Scoring iterations: 5
print('Goodness of Fit Test:')
[1] "Goodness of Fit Test:"
with(model2, cbind(res.deviance = deviance, df = df.residual, p = pchisq(deviance, df.residual, lower.tail=FALSE)))
     res.deviance    df             p
[1,]     18446.92 12780 1.038084e-214

Again the deviance residuals are quite the same. However, with imputation, there are more significant variables introduced in the model. Additionally, the AIC score improved dramatically from 23172 to 50384. Also, the deviance residuals decreased significantly to 18,412 from 40,000. However, the goodness of fit test has a very small p value, suggesting that this model does not fit the data well. Since the residual deviance is greater than the degrees of freedom, then some over-dispersion exists.

Given what we observed when looking at the distribution of TARGET, we should expext the inflated count of zeros to affect the model and bias results. For this reason, we move to a zero-inflated model to reflect this.

Model 3: Quasipoisson Model

We try a quasipoisson model to account for nay overdispersion and to see if the results change significantly. As seen in the summary below, the models are nearly identical.

# poisson model with the imputed values
model3 <- glm(TARGET ~ ., family = quasipoisson(link='log'), data_train_imputed)
summary(model3)

Call:
glm(formula = TARGET ~ ., family = quasipoisson(link = "log"), 
    data = data_train_imputed)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.2375  -0.5140   0.2139   0.6265   2.5573  

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         2.020e+00  1.914e-01  10.556  < 2e-16 ***
FixedAcidity       -3.926e-04  8.019e-04  -0.490 0.624443    
VolatileAcidity    -4.870e-02  6.348e-03  -7.671 1.83e-14 ***
CitricAcid          1.263e-02  5.765e-03   2.191 0.028478 *  
ResidualSugar       1.644e-04  1.483e-04   1.109 0.267529    
Chlorides          -5.719e-02  1.580e-02  -3.619 0.000296 ***
FreeSulfurDioxide   1.492e-04  3.374e-05   4.421 9.91e-06 ***
TotalSulfurDioxide  1.017e-04  2.173e-05   4.678 2.92e-06 ***
Density            -3.779e-01  1.878e-01  -2.012 0.044246 *  
pH                 -2.450e-02  7.404e-03  -3.310 0.000937 ***
Sulphates          -1.833e-02  5.424e-03  -3.380 0.000727 ***
Alcohol             5.327e-03  1.355e-03   3.931 8.51e-05 ***
LabelAppeal         1.951e-01  5.910e-03  33.014  < 2e-16 ***
AcidIndex          -1.216e-01  4.360e-03 -27.903  < 2e-16 ***
STARS               1.946e-01  5.675e-03  34.298  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 0.9577432)

    Null deviance: 22861  on 12794  degrees of freedom
Residual deviance: 18447  on 12780  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5
print('Goodness of Fit Test:')
[1] "Goodness of Fit Test:"
with(model3, cbind(res.deviance = deviance, df = df.residual, p = pchisq(deviance, df.residual, lower.tail=FALSE)))
     res.deviance    df             p
[1,]     18446.92 12780 1.038084e-214

Model 4: Zero Inflated

We saw earlier that the dependent variable had an excess number of zeros which skewed the distribution from a typical poisson. The zero inflated model generates coefficients for the zero count part of the model as well as for the count part.

model4 <- zeroinfl(TARGET ~., data_train_imputed, dist = 'poisson')
summary(model4)

Call:
zeroinfl(formula = TARGET ~ ., data = data_train_imputed, dist = "poisson")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-2.5523 -0.3699  0.1592  0.4997  4.1466 

Count model coefficients (poisson with log link):
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         1.416e+00  2.068e-01   6.847 7.56e-12 ***
FixedAcidity        3.457e-04  8.573e-04   0.403 0.686765    
VolatileAcidity    -1.292e-02  6.876e-03  -1.878 0.060345 .  
CitricAcid         -7.268e-04  6.141e-03  -0.118 0.905787    
ResidualSugar      -1.076e-04  1.589e-04  -0.677 0.498245    
Chlorides          -1.685e-02  1.700e-02  -0.991 0.321551    
FreeSulfurDioxide   3.159e-05  3.543e-05   0.892 0.372519    
TotalSulfurDioxide -2.811e-05  2.257e-05  -1.246 0.212943    
Density            -2.906e-01  2.027e-01  -1.434 0.151694    
pH                  6.640e-03  7.965e-03   0.834 0.404528    
Sulphates           9.380e-04  5.863e-03   0.160 0.872902    
Alcohol             7.578e-03  1.444e-03   5.249 1.53e-07 ***
LabelAppeal         2.509e-01  6.423e-03  39.065  < 2e-16 ***
AcidIndex          -1.695e-02  4.991e-03  -3.396 0.000683 ***
STARS               9.212e-02  6.291e-03  14.644  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -5.6217507  1.0555623  -5.326 1.00e-07 ***
FixedAcidity        0.0034100  0.0043691   0.780   0.4351    
VolatileAcidity     0.2313922  0.0350883   6.595 4.26e-11 ***
CitricAcid         -0.0802892  0.0318198  -2.523   0.0116 *  
ResidualSugar      -0.0014474  0.0008129  -1.780   0.0750 .  
Chlorides           0.2487335  0.0866513   2.871   0.0041 ** 
FreeSulfurDioxide  -0.0007772  0.0001843  -4.216 2.49e-05 ***
TotalSulfurDioxide -0.0008167  0.0001184  -6.899 5.25e-12 ***
Density             0.7365822  1.0363870   0.711   0.4773    
pH                  0.1942998  0.0406918   4.775 1.80e-06 ***
Sulphates           0.1219135  0.0299885   4.065 4.80e-05 ***
Alcohol             0.0114074  0.0074290   1.536   0.1247    
LabelAppeal         0.3612193  0.0338672  10.666  < 2e-16 ***
AcidIndex           0.4786918  0.0199582  23.985  < 2e-16 ***
STARS              -0.6768722  0.0371960 -18.197  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 35 
Log-likelihood: -2.248e+04 on 30 Df
model4b <- zeroinfl(TARGET ~ . - FixedAcidity - Density, data_train_imputed, dist = 'poisson')
summary(model4b)

Call:
zeroinfl(formula = TARGET ~ . - FixedAcidity - Density, data = data_train_imputed, 
    dist = "poisson")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-2.5516 -0.3710  0.1591  0.4985  3.9624 

Count model coefficients (poisson with log link):
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         1.130e+00  5.243e-02  21.543  < 2e-16 ***
VolatileAcidity    -1.297e-02  6.876e-03  -1.887 0.059225 .  
CitricAcid         -6.628e-04  6.140e-03  -0.108 0.914042    
ResidualSugar      -1.083e-04  1.589e-04  -0.682 0.495384    
Chlorides          -1.743e-02  1.700e-02  -1.025 0.305133    
FreeSulfurDioxide   3.086e-05  3.541e-05   0.871 0.383511    
TotalSulfurDioxide -2.894e-05  2.256e-05  -1.283 0.199660    
pH                  6.653e-03  7.966e-03   0.835 0.403656    
Sulphates           1.048e-03  5.862e-03   0.179 0.858106    
Alcohol             7.595e-03  1.444e-03   5.261 1.44e-07 ***
LabelAppeal         2.508e-01  6.422e-03  39.059  < 2e-16 ***
AcidIndex          -1.702e-02  4.944e-03  -3.442 0.000578 ***
STARS               9.233e-02  6.290e-03  14.680  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -4.8891475  0.2417675 -20.223  < 2e-16 ***
VolatileAcidity     0.2321447  0.0350717   6.619 3.61e-11 ***
CitricAcid         -0.0806133  0.0317915  -2.536   0.0112 *  
ResidualSugar      -0.0014497  0.0008128  -1.784   0.0745 .  
Chlorides           0.2492837  0.0866164   2.878   0.0040 ** 
FreeSulfurDioxide  -0.0007745  0.0001843  -4.202 2.65e-05 ***
TotalSulfurDioxide -0.0008167  0.0001184  -6.900 5.20e-12 ***
pH                  0.1942247  0.0406838   4.774 1.81e-06 ***
Sulphates           0.1223315  0.0299698   4.082 4.47e-05 ***
Alcohol             0.0113516  0.0074270   1.528   0.1264    
LabelAppeal         0.3601954  0.0338376  10.645  < 2e-16 ***
AcidIndex           0.4818269  0.0196286  24.547  < 2e-16 ***
STARS              -0.6765979  0.0371710 -18.202  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 31 
Log-likelihood: -2.248e+04 on 26 Df

Model Selection

Based on the models tested, we select the Zero Inflated model due to the highest accuracy of the 4 models

pred_train <- data.frame(TARGET=data_train_imputed$TARGET, 
                         model2=model2$fitted.values, 
                         model3=model3$fitted.values, 
                         model4b=model4b$fitted.values)
pred_train <- round(pred_train, 0)
colnames(pred_train) <- c("TARGET","Poisson (Imputed)" ,"Quasipoisson Model", "Zero Inflated")

pred_train %>%
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~key, scale = "free",  ncol = 4) +
  geom_bar(fill="pink") +
  theme_minimal() + labs(x="Cases Bought", y = "Count", title = "Prediction Histogram")

model2_fitted.values <- factor(round(model2$fitted.values),levels=rev(0:9))
model3_fitted.values <- factor(round(model3$fitted.values),levels=rev(0:9))
model4b_fitted.values <- factor(round(model4b$fitted.values),levels=rev(0:9))


m2_cfm <- confusionMatrix(model2_fitted.values, factor(data_train_imputed$TARGET,levels=rev(0:9)))
m3_cfm <- confusionMatrix(model3_fitted.values, factor(data_train_imputed$TARGET,levels=rev(0:9)))
m4_cfm <- confusionMatrix(model4b_fitted.values, factor(data_train_imputed$TARGET,levels=rev(0:9)))

models_sum <- data.frame(m2_cfm$overall, m3_cfm$overall, m4_cfm$overall)
colnames(models_sum) <- c("Poisson (Imputed)" ,"Quasipoisson Model", "Zero Inflated")
round(models_sum, 2)
init <- mice(data_test)

 iter imp variable
  1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
meth <- init$method
predM <- init$predictorMatrix
predM[, c("TARGET")] <- 0 #this code will remove the variable as a predictor but still will be imputed
data_test_impute <- mice(data_test, method = 'rf', predictorMatrix=predM)

 iter imp variable
  1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
data_test_imputed <- complete(data_test_impute)
print(paste0("Missing value after imputation: ", sum(is.na(data_test_imputed))))
[1] "Missing value after imputation: 3335"
test_predict <- predict(model4b, newdata=data_test_imputed)
test_predict <- round(test_predict,0)
data_pred <- data.frame(TARGET=test_predict)

ggplot(data_pred, aes(x=TARGET)) + geom_bar(fill="orchid3") + theme_minimal() +
  labs(y="Count", title = "Prediction: Zero Inflated Model (Model4b)") + 
    scale_x_discrete(name = "Cases Bought", limits=c("1","2","3","4", "5", "6", "7", "8"))

write.csv(test_predict, "WinePredictions.csv")