Predicting Wine Sales with Count Regression

library(tidyverse)
library(forcats)
library(modelr)
library(skimr)
library(knitr)
library(kableExtra)
library(broom)
library(caTools)
library(pscl)
library(grid)
library(gridExtra)
library(GGally)
library(mice)
library(car)
library(MASS)
library(caret)
library(corrplot)
library(reshape)
library(ggthemes)
library(moments)
library(qqplotr)
library(gridExtra)
library(geoR)
library(alr3)
library(caret)
library(pROC)
library(DataExplorer)
library(visdat)
library(janitor)
library(pander)

evaluate_model <- function(model, test_df, yhat = FALSE){
  temp <- data.frame(yhat=c(0:8), TARGET = c(0:8), n=c(0))
  
  if(yhat){
    test_df$yhat <- yhat
  } else {
    test_df$yhat <- round(predict.glm(model, newdata=test_df, type="response"), 0)
  }
  
  test_df <- test_df %>%
    group_by(yhat, TARGET) %>%
    tally() %>%
    mutate(accuracy = ifelse(yhat > TARGET, "Over", ifelse(yhat < TARGET, "Under", "Accurate"))) %>%
    mutate(cases_sold = ifelse(yhat > TARGET, TARGET, yhat) * n,
           glut = ifelse(yhat > TARGET, yhat - TARGET, 0) * n,
           missed_opportunity = ifelse(yhat < TARGET, TARGET - yhat, 0) * n) %>%
    mutate(net_cases_sold = cases_sold - glut,
           adj_net_cases_sold = cases_sold - glut - missed_opportunity)
  
  results <- test_df %>%
    group_by(accuracy) %>%
    summarise(n = sum(n)) %>%
    spread(accuracy, n)
  
  accurate <- results$Accurate
  over <- results$Over
  under <- results$Under
  
  cases_sold <- sum(test_df$cases_sold)
  net_cases_sold <- sum(test_df$net_cases_sold)
  adj_net_cases_sold <- sum(test_df$adj_net_cases_sold)
  missed_opportunity <- sum(test_df$missed_opportunity)
  glut <- sum(test_df$glut)
  
  confusion_matrix <- test_df %>%
    bind_rows(temp) %>%
    group_by(yhat, TARGET) %>%
    summarise(n = sum(n)) %>%
    spread(TARGET, n, fill = 0)
  
  return(list("confusion_matrix" = confusion_matrix, "results" = results, "df" = test_df, "accurate" = accurate, "over" = over, "under" = under, "cases_sold" = cases_sold, "net_cases_sold" = net_cases_sold, "adj_net_cases_sold" = adj_net_cases_sold, "glut" = glut, "missed_opportunity" = missed_opportunity))
}

Analysis

The objective of this analysis to predict the number of cases of wine that will be sold given certain properties of the wine. The data available for this analysis is set forth below:

VARIABLE NAME	DEFINITION	THEORETICAL EFFECT
`INDEX`	Identification Variable (do not use)	None
`TARGET`	Number of Cases Purchased	None
`AcidIndex`	acidity of wine by using a weighted average	TBD
`Alcohol`	Alcohol Content
`Chlorides`	Chloride content of wine
`CitricAcid`	Citric Acid Content
`Density`	Density of Wine
`FixedAcidity`	Fixed Acidity of Wine
`FreeSulfurDioxide`	Sulfur Dioxide content of wine
`LabelAppeal`	Marketing Score indicator the appeal of label
`ResidualSugar`	Residual Sugar of wine
`STARS`	Wine rating by a team of experts.	A high number of stars suggests high sales
…	4 Stars = Excellent, 1 Star = Poor
`Sulphates`	Sulfate content of wine
`TotalSulfurDioxide`	Total Sulfur Dioxide of Wine
`VolatileAcidity`	Volatile Acid content of wine
`pH`	pH of wine

Data Preprocessing

After the initial data load, a variety of packages and functions were employed to preprocess the data. The object of the preprocessing phase is two fold. First we seek to develop a understanding of the data set and its variables and second we seek to identify potential problems that could undermine our prediction of wine sales.

Data Set Overview

The data set is comprised of 12,795 observations and 16 variables. Review of the str() function output below indicates that all variables are integer or numeric. There are also signs of missing data that will have to be explored further. The CitrixAcid, VolatileAcidity, Chlorides, Sulphtes and TotalSulfurDioxiode are also presenting negative value. This is conterintuitive for variables that are supposed to reflect chemical properties of wine and warrants some additional investigation.

Figure 1b, Summary Function Output, provides a convenient means to review descriptive statistics on the data set variables. Using the combination of Figures 1b and Figure 2 its evident that our dependent variable, TARGET, has a range of 0 to 8 (cases purchased) and a mode of four (4). The Summary table also re-enforces our concerns with missing and negative values. Next, we employed histograms (Figure 2) to provide an additional perspective of the data. One observation from the histograms is that many of the variables have similar histograms: uni modal, near-normal shapes with relatively fat tails. It also appears that the LabelAppeal, AcidIndex and STARS variable may be better suited as factors. The STARS variables seems to have a large percentage of missing data. Finally, there appears to be a fairly high number of zero values in the TARGET variable. Accordingly, our models will have to be be able to predict both positive and zero-value outcomes.

To drill-down on the missing data concern we employed the vis_dat() (Figure 3). This function provides a simple and visually appealing way to assess the level of missing data in a data set. Figure 3 indicates that numerous variable have a fairly a large percentage of missing data. The STARS variable suffers from particularly large percentage of missing value. This will be addressed in the preprocessing action section below.

df <- read.csv("C:/Users/mutue/OneDrive/Documents/Data 621/wine-training-data.csv")
eval <- read.csv("C:/Users/mutue/OneDrive/Documents/Data 621/wine-evaluation-data.csv")

Figure 1a. Str Function Output

str(df)

'data.frame':   12795 obs. of  16 variables:
 $ ï..INDEX          : int  1 2 4 5 6 7 8 11 12 13 ...
 $ TARGET            : int  3 3 5 3 4 0 0 4 3 6 ...
 $ FixedAcidity      : num  3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
 $ VolatileAcidity   : num  1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
 $ CitricAcid        : num  -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
 $ ResidualSugar     : num  54.2 26.1 14.8 18.8 9.4 ...
 $ Chlorides         : num  -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
 $ FreeSulfurDioxide : num  NA 15 214 22 -167 -37 287 523 -213 62 ...
 $ TotalSulfurDioxide: num  268 -327 142 115 108 15 156 551 NA 180 ...
 $ Density           : num  0.993 1.028 0.995 0.996 0.995 ...
 $ pH                : num  3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
 $ Sulphates         : num  -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
 $ Alcohol           : num  9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
 $ LabelAppeal       : int  0 -1 -1 -1 0 0 0 1 0 0 ...
 $ AcidIndex         : int  8 7 8 6 9 11 8 7 6 8 ...
 $ STARS             : int  2 3 3 1 2 NA NA 3 NA 4 ...

Figure 1b. Summary Function Output

summary(df)

    ï..INDEX         TARGET      FixedAcidity   VolatileAcidity   CitricAcid    ResidualSugar    Chlorides  
 Min.   :    1   Min.   :0.00   Min.   :-18.1   Min.   :-2.79   Min.   :-3.24   Min.   :-128   Min.   :-1   
 1st Qu.: 4038   1st Qu.:2.00   1st Qu.:  5.2   1st Qu.: 0.13   1st Qu.: 0.03   1st Qu.:  -2   1st Qu.: 0   
 Median : 8110   Median :3.00   Median :  6.9   Median : 0.28   Median : 0.31   Median :   4   Median : 0   
 Mean   : 8070   Mean   :3.03   Mean   :  7.1   Mean   : 0.32   Mean   : 0.31   Mean   :   5   Mean   : 0   
 3rd Qu.:12106   3rd Qu.:4.00   3rd Qu.:  9.5   3rd Qu.: 0.64   3rd Qu.: 0.58   3rd Qu.:  16   3rd Qu.: 0   
 Max.   :16129   Max.   :8.00   Max.   : 34.4   Max.   : 3.68   Max.   : 3.86   Max.   : 141   Max.   : 1   
                                                                                NA's   :616    NA's   :638  
 FreeSulfurDioxide TotalSulfurDioxide    Density            pH        Sulphates       Alcohol     LabelAppeal    
 Min.   :-555      Min.   :-823       Min.   :0.888   Min.   :0     Min.   :-3     Min.   :-5    Min.   :-2.000  
 1st Qu.:   0      1st Qu.:  27       1st Qu.:0.988   1st Qu.:3     1st Qu.: 0     1st Qu.: 9    1st Qu.:-1.000  
 Median :  30      Median : 123       Median :0.994   Median :3     Median : 0     Median :10    Median : 0.000  
 Mean   :  31      Mean   : 121       Mean   :0.994   Mean   :3     Mean   : 1     Mean   :10    Mean   :-0.009  
 3rd Qu.:  70      3rd Qu.: 208       3rd Qu.:1.001   3rd Qu.:3     3rd Qu.: 1     3rd Qu.:12    3rd Qu.: 1.000  
 Max.   : 623      Max.   :1057       Max.   :1.099   Max.   :6     Max.   : 4     Max.   :26    Max.   : 2.000  
 NA's   :647       NA's   :682                        NA's   :395   NA's   :1210   NA's   :653                   
   AcidIndex         STARS     
 Min.   : 4.00   Min.   :1     
 1st Qu.: 7.00   1st Qu.:1     
 Median : 8.00   Median :2     
 Mean   : 7.77   Mean   :2     
 3rd Qu.: 8.00   3rd Qu.:3     
 Max.   :17.00   Max.   :4     
                 NA's   :3359

Figure 2. Data Histograms

mylevels <- names(df[,2:15])
summary_plot <- df %>%
  gather() %>% 
  mutate(facet = factor(key, levels=mylevels)) %>% 
  ggplot(aes(value)) +  
  facet_wrap(~ facet, scales = "free") + 
  geom_histogram()  + theme_pander()  + 
  theme(axis.text.y = element_text(size=7),
        strip.text.x = element_text(size= 9),
        axis.text.x = element_text(size=6),
        plot.title = element_text(hjust = 0.5, size=10),
        axis.title.y = element_text(size=8)) + 
  labs(x=NULL, title="Wine Data Histograms")
summary_plot

Figure 3. Vis_dat Function Output

vis_dat(df)

Data Preprocessing Actions

Given our review of the data we will take the following preprocessing actions:

Issues	Action
Missing Data	Use the mice function to impute missing values
Missing STARS	Make NA its own level
Negative Values1	Take absolute value of negative values in the models (abs(x))
New Factors	Make STARS, LabelAppeal and AcidIndex Factors
Change Factors	Use Forcat’s Recode function to collapse AcidIndex into four buckets
Index Variable	Remove from data set via dplyr

Update str and summary tables are set forth below, post pre-processing actions. Note - negative values will still be visible given the decision to transform the negative values in the model specification utilizing the abs() function.

# Factors

df <- df %>% 
  dplyr::select(-"ï..INDEX") %>% 
  mutate(STARS = factor(STARS)) %>% 
  mutate(STARS = fct_explicit_na(STARS,na_level = "0")) %>% 
  mutate(LabelAppeal = factor(LabelAppeal)) %>%
  mutate(AcidIndex = if_else(AcidIndex <= 7,4L,AcidIndex)) %>%
  mutate(AcidIndex = if_else(AcidIndex == 8 | AcidIndex == 9 ,3L,AcidIndex)) %>%
  mutate(AcidIndex = if_else(AcidIndex == 10 | AcidIndex == 15 ,2L,AcidIndex)) %>%
  mutate(AcidIndex = if_else(AcidIndex == 16 | AcidIndex == 17 | AcidIndex == 11 | AcidIndex == 12 | AcidIndex == 13 |  AcidIndex == 14 ,1L,AcidIndex)) %>%
  mutate(AcidIndex = factor(AcidIndex))  
  
  #Remove negativ values

# Impute missing value. 
  
tmp_data <- mice(df,maxit=3, method='pmm',seed=20, print=F)
df <- complete(tmp_data,1)

df$FixedAcidity <- abs(df$FixedAcidity)
df$VolatileAcidity <- abs(df$VolatileAcidity) 
df$CitricAcid <- abs(df$CitricAcid)
df$ResidualSugar <- abs(df$ResidualSugar)
df$Chlorides <- abs(df$Chlorides) 
df$FreeSulfurDioxide <- abs(df$FreeSulfurDioxide)
df$TotalSulfurDioxide <- abs(df$TotalSulfurDioxide)
df$Sulphates <- abs(df$Sulphates)
df$Alcohol <- abs(df$Alcohol)

Figure 1c. Str Function Output - Post Pre-Processing

str(df)

'data.frame':   12795 obs. of  15 variables:
 $ TARGET            : int  3 3 5 3 4 0 0 4 3 6 ...
 $ FixedAcidity      : num  3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
 $ VolatileAcidity   : num  1.16 0.16 2.64 0.385 0.33 0.32 0.29 1.22 0.27 0.22 ...
 $ CitricAcid        : num  0.98 0.81 0.88 0.04 1.26 0.59 0.4 0.34 1.05 0.39 ...
 $ ResidualSugar     : num  54.2 26.1 14.8 18.8 9.4 ...
 $ Chlorides         : num  0.567 0.425 0.037 0.425 0.047 0.556 0.06 0.04 0.007 0.277 ...
 $ FreeSulfurDioxide : num  50 15 214 22 167 37 287 523 213 62 ...
 $ TotalSulfurDioxide: num  268 327 142 115 108 15 156 551 307 180 ...
 $ Density           : num  0.993 1.028 0.995 0.996 0.995 ...
 $ pH                : num  3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
 $ Sulphates         : num  0.59 0.7 0.48 1.83 1.77 1.29 1.21 0.42 0.26 0.75 ...
 $ Alcohol           : num  9.9 3.1 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
 $ LabelAppeal       : Factor w/ 5 levels "-2","-1","0",..: 3 2 2 2 3 3 3 4 3 3 ...
 $ AcidIndex         : Factor w/ 4 levels "1","2","3","4": 3 4 3 4 3 1 3 4 4 3 ...
 $ STARS             : Factor w/ 5 levels "1","2","3","4",..: 2 3 3 1 2 5 5 3 5 4 ...

Figure 1d. Summary Function Output - Post Pre-Processing

summary(df)

     TARGET      FixedAcidity  VolatileAcidity   CitricAcid   ResidualSugar     Chlorides     FreeSulfurDioxide
 Min.   :0.00   Min.   : 0.0   Min.   :0.00    Min.   :0.00   Min.   :  0.0   Min.   :0.000   Min.   :  0      
 1st Qu.:2.00   1st Qu.: 5.6   1st Qu.:0.25    1st Qu.:0.28   1st Qu.:  3.6   1st Qu.:0.045   1st Qu.: 28      
 Median :3.00   Median : 7.0   Median :0.41    Median :0.44   Median : 12.9   Median :0.098   Median : 56      
 Mean   :3.03   Mean   : 8.1   Mean   :0.64    Mean   :0.69   Mean   : 23.4   Mean   :0.222   Mean   :107      
 3rd Qu.:4.00   3rd Qu.: 9.8   3rd Qu.:0.91    3rd Qu.:0.97   3rd Qu.: 38.6   3rd Qu.:0.368   3rd Qu.:171      
 Max.   :8.00   Max.   :34.4   Max.   :3.68    Max.   :3.86   Max.   :141.2   Max.   :1.351   Max.   :623      
 TotalSulfurDioxide    Density            pH         Sulphates       Alcohol     LabelAppeal AcidIndex STARS   
 Min.   :   0       Min.   :0.888   Min.   :0.48   Min.   :0.00   Min.   : 0.0   -2: 504     1: 514    1:3042  
 1st Qu.: 100       1st Qu.:0.988   1st Qu.:2.96   1st Qu.:0.44   1st Qu.: 9.0   -1:3136     2: 559    2:3570  
 Median : 154       Median :0.994   Median :3.20   Median :0.59   Median :10.4   0 :5617     3:5569    3:2212  
 Mean   : 205       Mean   :0.994   Mean   :3.21   Mean   :0.85   Mean   :10.5   1 :3048     4:6153    4: 612  
 3rd Qu.: 263       3rd Qu.:1.001   3rd Qu.:3.47   3rd Qu.:1.10   3rd Qu.:12.4   2 : 490               0:3359  
 Max.   :1057       Max.   :1.099   Max.   :6.13   Max.   :4.24   Max.   :26.5

Data Exploration

With out preprocessing complete we will not utilize some visual EDA to gain additional insight into the data and to inform our opinion on models specification and variable selection. EDA will utilize a variety of charts and plots including bar, box plots, correlation plots and more.

Bar Plots - Factors

The box plots below set forth show fairly strong relationships between the TARGET value and out factors. Theses are encouraging plots that indicate these variables will likely play a role in our prediction models. Note that the , per our pre-processing actions, has be recoded into four buckets. Also missing values in the STAR variable have been assigned to the 0 bucket.

bp1 <- ggplot(df, aes(LabelAppeal,TARGET)) + geom_boxplot() + theme_fivethirtyeight() + 
  theme(axis.title = element_text(size=10), 
        plot.title = element_text(hjust= 0.5, size = 12)) +
  labs(title = 'TARGET vs LabelAppeal')
  
bp2 <- ggplot(df, aes(STARS,TARGET)) + geom_boxplot() + theme_fivethirtyeight() +
  theme(axis.title = element_text(size=10), 
        plot.title = element_text(hjust= 0.5, size = 12)) +
  labs(title = 'TARGET vs STARS')
  
bp3 <- ggplot(df, aes(AcidIndex,TARGET)) + geom_boxplot() + theme_fivethirtyeight() + 
  theme(axis.title = element_text(size=10), 
        plot.title = element_text(hjust= 0.5, size = 12)) +
  labs(title = 'TARGET vs AcidIndex')  
  
grid.arrange(bp1, bp2, bp3, ncol = 3)

Numeric Variable Loess vs Target

The Loess plots are bit difficult to interpret. For some variables there does appear to be a near-linear relationship between the variable and Target. Alcohol and VolatileAcidity, for example are two variables that show promise from a modeling perspective. That said, the correlation plot and the loess plots lead one to believe that the factor variables with be the stars in our modeling phase.

var_plot <- df %>% 
  dplyr::select(-STARS, -AcidIndex, -LabelAppeal) %>% 
  gather(key="var", value="val", -TARGET) %>% 
  ggplot(aes(val, TARGET))+
  facet_wrap(~var, scales = "free") +
  geom_smooth(method="loess") +
  theme_fivethirtyeight()
  
var_plot

Correlation Plot

The correlation plot is striking for its apparent lack of correlation. This would seem to suggest that the chemical makeup of wine has no bearing on sales. This would also lead one to believe that these variables may play a diminished role in our models. There does not seem to be any particularly strong correlation between variables.

set.seed(4763)
train_index <- createDataPartition(df$TARGET, p = .7, list = FALSE, times = 1)
train <- df[train_index,]
test <- df[-train_index,]

Build Models

Before starting our modeling our data was partitioned into a train set (70%) and a test set (30%). The training data will be utilized to create the models and then newly created models will be applied to the test data for predictions and matrix analysis.

Our approach to modeling was to make strong use of the factor variable and limited use of the continuous variables given the uncertainty around the negative values. When continuous values were employed the absolute value of the variable is utilized in the model. We also employed three varieties of models in our analysis: Linear, Poisson, Negative Binomial Zero-Inflated.

A manually iterative process was employed to narrow the models down to the five contenders. Model summaries and confusion matrix data is presented for each model. The model evaluation section then picks a winner based upon a variety of factors, including: prediction ability (can the model predict all relevant value ranges), accuracy, AIC, BIC and LogLik.

Mod1 (Poisson) - glm(TARGET ~ STARS + AcidIndex + LabelAppeal + abs(Alcohol)

mod1 <- glm(TARGET ~ STARS + AcidIndex + LabelAppeal + Alcohol, family = poisson, train)
summary(mod1)


Call:
glm(formula = TARGET ~ STARS + AcidIndex + LabelAppeal + Alcohol, 
    family = poisson, data = train)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-3.218  -0.652   0.001   0.460   3.757  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -0.02884    0.06718   -0.43    0.668    
STARS2         0.32492    0.01718   18.92  < 2e-16 ***
STARS3         0.44835    0.01862   24.08  < 2e-16 ***
STARS4         0.57340    0.02623   21.86  < 2e-16 ***
STARS0        -0.75863    0.02335  -32.49  < 2e-16 ***
AcidIndex2     0.31552    0.05954    5.30  1.2e-07 ***
AcidIndex3     0.54272    0.04764   11.39  < 2e-16 ***
AcidIndex4     0.60928    0.04752   12.82  < 2e-16 ***
LabelAppeal-1  0.24764    0.04581    5.41  6.5e-08 ***
LabelAppeal0   0.43080    0.04469    9.64  < 2e-16 ***
LabelAppeal1   0.57071    0.04542   12.56  < 2e-16 ***
LabelAppeal2   0.69160    0.05112   13.53  < 2e-16 ***
Alcohol        0.00430    0.00169    2.55    0.011 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 16053.8  on 8957  degrees of freedom
Residual deviance:  9614.9  on 8945  degrees of freedom
AIC: 31980

Number of Fisher Scoring iterations: 6

Mod1 Confusion Matrix

pred <- predict(mod1, newdata=test, type='response')
predRound <- as.factor(round(pred,0)-1)
testData <- as.factor(test$TARGET)
levels(predRound) <- c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "0")
levels(testData) <- c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10")
cm <- confusionMatrix(predRound, testData)
cm$overall[1]

Accuracy 
   0.217

cm$table

          Reference
Prediction   0   1   2   3   4   5   6   7   8   9  10
        0    0   0   0   0   0   0   0   0   0   0   0
        1    3   0   0   0   0   0   0   0   0   0   0
        2  546  39 100 135  76  20   1   1   0   0   0
        3  144  32 142 143  45  22   6   1   1   0   0
        4   99   9  74 292 264  87   6   2   0   0   0
        5   13   0  15 205 406 197  41   0   0   0   0
        6    4   0   0   8 156 233 110  18   1   0   0
        7    0   0   0   0   6  46  59  16   5   0   0
        8    0   0   0   0   0   1   4   2   1   0   0
        9    0   0   0   0   0   0   0   0   0   0   0
        10   0   0   0   0   0   0   0   0   0   0   0

Mod2 (Linear) - lm(TARGET ~ STARS + AcidIndex + LabelAppeal + abs(Alcohol)

mod2 <- lm(TARGET ~ STARS + AcidIndex + LabelAppeal + Alcohol, data = train)
summary(mod2)


Call:
lm(formula = TARGET ~ STARS + AcidIndex + LabelAppeal + Alcohol, 
    data = train)

Residuals:
   Min     1Q Median     3Q    Max 
-4.936 -0.849  0.060  0.822  6.201 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    0.70717    0.11117    6.36  2.1e-10 ***
STARS2         1.05570    0.03924   26.90  < 2e-16 ***
STARS3         1.63031    0.04514   36.12  < 2e-16 ***
STARS4         2.33799    0.07308   31.99  < 2e-16 ***
STARS0        -1.36059    0.03964  -34.33  < 2e-16 ***
AcidIndex2     0.45676    0.09666    4.73  2.3e-06 ***
AcidIndex3     0.96510    0.07308   13.21  < 2e-16 ***
AcidIndex4     1.17590    0.07303   16.10  < 2e-16 ***
LabelAppeal-1  0.38411    0.07623    5.04  4.8e-07 ***
LabelAppeal0   0.82929    0.07431   11.16  < 2e-16 ***
LabelAppeal1   1.31885    0.07754   17.01  < 2e-16 ***
LabelAppeal2   1.84886    0.10192   18.14  < 2e-16 ***
Alcohol        0.01346    0.00384    3.50  0.00046 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.31 on 8945 degrees of freedom
Multiple R-squared:  0.536, Adjusted R-squared:  0.535 
F-statistic:  861 on 12 and 8945 DF,  p-value: <2e-16

mod2_results <- evaluate_model(mod2, test)

Mod2 Confusion Matrix

pred <- predict(mod2, newdata=test)
predRound <- as.factor(round(pred,0))
levels(predRound) <- levels(as.factor(test$TARGET))
confusionMatrix(predRound, as.factor(test$TARGET))

Confusion Matrix and Statistics

          Reference
Prediction   0   1   2   3   4   5   6   7   8
         0   2   0   0   0   0   0   0   0   0
         1  67   7   9   6   2   0   0   0   0
         2 386  32  88 125  52   9   1   0   0
         3 224  31 134 137  55  26   5   2   1
         4 109  10  78 279 254  78   5   0   0
         5  20   0  22 229 465 262  63   2   0
         6   1   0   0   7 120 188 103  20   2
         7   0   0   0   0   5  43  50  16   5
         8   0   0   0   0   0   0   0   0   0

Overall Statistics
                                       
               Accuracy : 0.226        
                 95% CI : (0.213, 0.24)
    No Information Rate : 0.248        
    P-Value [Acc > NIR] : 0.999        
                                       
                  Kappa : 0.087        
                                       
 Mcnemar's Test P-Value : NA           

Statistics by Class:

                     Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
Sensitivity          0.002472  0.08750   0.2659   0.1750   0.2665   0.4323   0.4537  0.40000  0.00000
Specificity          1.000000  0.97764   0.8274   0.8435   0.8062   0.7521   0.9064  0.97287  1.00000
Pos Pred Value       1.000000  0.07692   0.1270   0.2228   0.3124   0.2465   0.2336  0.13445      NaN
Neg Pred Value       0.789570  0.98051   0.9227   0.7995   0.7688   0.8760   0.9635  0.99354  0.99792
Prevalence           0.210842  0.02085   0.0863   0.2041   0.2484   0.1579   0.0592  0.01042  0.00208
Detection Rate       0.000521  0.00182   0.0229   0.0357   0.0662   0.0683   0.0268  0.00417  0.00000
Detection Prevalence 0.000521  0.02372   0.1806   0.1603   0.2119   0.2770   0.1149  0.03101  0.00000
Balanced Accuracy    0.501236  0.53257   0.5466   0.5092   0.5363   0.5922   0.6801  0.68644  0.50000

Mod3 (Poisson) - glm(TARGET ~ STARS + AcidIndex + LabelAppeal + abs(VolatileAcidity)

mod3 <- glm(TARGET ~ STARS + AcidIndex + LabelAppeal +  VolatileAcidity, family = poisson, train)
summary(mod3)


Call:
glm(formula = TARGET ~ STARS + AcidIndex + LabelAppeal + VolatileAcidity, 
    family = poisson, data = train)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-3.239  -0.644  -0.010   0.463   3.751  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)       0.0431     0.0653    0.66   0.5094    
STARS2            0.3234     0.0172   18.82  < 2e-16 ***
STARS3            0.4484     0.0186   24.09  < 2e-16 ***
STARS4            0.5761     0.0262   22.01  < 2e-16 ***
STARS0           -0.7589     0.0233  -32.50  < 2e-16 ***
AcidIndex2        0.3126     0.0595    5.25  1.5e-07 ***
AcidIndex3        0.5399     0.0476   11.33  < 2e-16 ***
AcidIndex4        0.6072     0.0475   12.77  < 2e-16 ***
LabelAppeal-1     0.2453     0.0458    5.35  8.6e-08 ***
LabelAppeal0      0.4279     0.0447    9.58  < 2e-16 ***
LabelAppeal1      0.5672     0.0454   12.49  < 2e-16 ***
LabelAppeal2      0.6901     0.0511   13.50  < 2e-16 ***
VolatileAcidity  -0.0332     0.0113   -2.94   0.0033 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 16053.8  on 8957  degrees of freedom
Residual deviance:  9612.6  on 8945  degrees of freedom
AIC: 31977

Number of Fisher Scoring iterations: 6

Mod 3 Confusion Matrix

pred <- predict(mod3, newdata=test, type='response')
predRound <- as.factor(round(pred,0)-1)
testData <- as.factor(test$TARGET)
levels(predRound) <- c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "0")
levels(testData) <- c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10")
cm <- confusionMatrix(predRound, testData)
cm$overall[1]

Accuracy 
    0.22

cm$table

          Reference
Prediction   0   1   2   3   4   5   6   7   8   9  10
        0    0   0   0   0   0   0   0   0   0   0   0
        1    3   0   0   0   0   0   0   0   0   0   0
        2  552  39 100 135  76  22   1   1   0   0   0
        3  136  32 142 141  50  19   6   1   1   0   0
        4  100   9  75 294 265  86   6   2   0   0   0
        5   14   0  14 194 392 206  40   1   0   0   0
        6    4   0   0  19 164 219 110  12   0   0   0
        7    0   0   0   0   6  53  61  22   5   0   0
        8    0   0   0   0   0   1   3   1   2   0   0
        9    0   0   0   0   0   0   0   0   0   0   0
        10   0   0   0   0   0   0   0   0   0   0   0

Mod4 (Linear) - lm(TARGET ~ STARS + AcidIndex + LabelAppeal + abs(VolatileAcidity)

mod4 <- lm(TARGET ~ STARS + AcidIndex + LabelAppeal +  VolatileAcidity, data = train)
summary(mod4)


Call:
lm(formula = TARGET ~ STARS + AcidIndex + LabelAppeal + VolatileAcidity, 
    data = train)

Residuals:
   Min     1Q Median     3Q    Max 
-4.977 -0.854  0.053  0.831  6.189 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.9320     0.1047    8.90  < 2e-16 ***
STARS2            1.0511     0.0392   26.78  < 2e-16 ***
STARS3            1.6306     0.0451   36.14  < 2e-16 ***
STARS4            2.3454     0.0730   32.14  < 2e-16 ***
STARS0           -1.3618     0.0396  -34.37  < 2e-16 ***
AcidIndex2        0.4483     0.0966    4.64  3.5e-06 ***
AcidIndex3        0.9541     0.0731   13.06  < 2e-16 ***
AcidIndex4        1.1664     0.0730   15.97  < 2e-16 ***
LabelAppeal-1     0.3776     0.0762    4.95  7.4e-07 ***
LabelAppeal0      0.8227     0.0743   11.07  < 2e-16 ***
LabelAppeal1      1.3102     0.0775   16.90  < 2e-16 ***
LabelAppeal2      1.8465     0.1019   18.12  < 2e-16 ***
VolatileAcidity  -0.1028     0.0252   -4.09  4.4e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.31 on 8945 degrees of freedom
Multiple R-squared:  0.536, Adjusted R-squared:  0.536 
F-statistic:  862 on 12 and 8945 DF,  p-value: <2e-16

Mod 4 Confusion Matrix

pred <- predict(mod4, newdata=test)
predRound <- as.factor(round(pred,0))
levels(predRound) <- levels(as.factor(test$TARGET))
confusionMatrix(predRound, as.factor(test$TARGET))

Confusion Matrix and Statistics

          Reference
Prediction   0   1   2   3   4   5   6   7   8
         0   1   0   0   0   0   0   0   0   0
         1  69   9  10   5   2   0   0   0   0
         2 376  30  90 113  46   9   0   0   0
         3 234  31 131 151  68  26   6   2   1
         4 109  10  74 271 243  81   5   0   0
         5  19   0  26 225 454 258  62   2   0
         6   1   0   0  18 134 183 101  15   0
         7   0   0   0   0   6  49  53  21   7
         8   0   0   0   0   0   0   0   0   0

Overall Statistics
                                        
               Accuracy : 0.228         
                 95% CI : (0.215, 0.241)
    No Information Rate : 0.248         
    P-Value [Acc > NIR] : 0.999         
                                        
                  Kappa : 0.09          
                                        
 Mcnemar's Test P-Value : NA            

Statistics by Class:

                     Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
Sensitivity          0.001236  0.11250   0.2719   0.1928   0.2550   0.4257   0.4449  0.52500  0.00000
Specificity          1.000000  0.97711   0.8363   0.8366   0.8093   0.7561   0.9028  0.96971  1.00000
Pos Pred Value       1.000000  0.09474   0.1355   0.2323   0.3064   0.2467   0.2235  0.15441      NaN
Neg Pred Value       0.789364  0.98103   0.9240   0.8017   0.7668   0.8753   0.9628  0.99487  0.99792
Prevalence           0.210842  0.02085   0.0863   0.2041   0.2484   0.1579   0.0592  0.01042  0.00208
Detection Rate       0.000261  0.00235   0.0235   0.0394   0.0633   0.0672   0.0263  0.00547  0.00000
Detection Prevalence 0.000261  0.02476   0.1731   0.1694   0.2067   0.2726   0.1178  0.03544  0.00000
Balanced Accuracy    0.500618  0.54480   0.5541   0.5147   0.5321   0.5909   0.6739  0.74736  0.50000

Mod5 (ZINB) - zeroinfl(TARGET ~ STARS + LabelAppeal + AcidIndex + TotalSulfurDioxide + VolatileAcidity

mod5 <- zeroinfl(TARGET ~ STARS  + LabelAppeal + AcidIndex + TotalSulfurDioxide + VolatileAcidity, data=train, dist="negbin")

summary(mod5)


Call:
zeroinfl(formula = TARGET ~ STARS + LabelAppeal + AcidIndex + TotalSulfurDioxide + VolatileAcidity, 
    data = train, dist = "negbin")

Pearson residuals:
   Min     1Q Median     3Q    Max 
-2.289 -0.447  0.034  0.400  4.371 

Count model coefficients (negbin with log link):
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)         4.76e-01   7.00e-02    6.80  1.1e-11 ***
STARS2              1.19e-01   1.80e-02    6.61  3.9e-11 ***
STARS3              2.20e-01   1.93e-02   11.38  < 2e-16 ***
STARS4              3.28e-01   2.67e-02   12.26  < 2e-16 ***
STARS0             -6.56e-02   2.53e-02   -2.60   0.0094 ** 
LabelAppeal-1       4.43e-01   5.07e-02    8.74  < 2e-16 ***
LabelAppeal0        7.35e-01   4.96e-02   14.81  < 2e-16 ***
LabelAppeal1        9.26e-01   5.04e-02   18.38  < 2e-16 ***
LabelAppeal2        1.08e+00   5.57e-02   19.37  < 2e-16 ***
AcidIndex2         -5.76e-02   6.16e-02   -0.94   0.3497    
AcidIndex3          1.91e-02   4.90e-02    0.39   0.6973    
AcidIndex4          4.88e-02   4.88e-02    1.00   0.3172    
TotalSulfurDioxide -4.77e-05   3.71e-05   -1.28   0.1995    
VolatileAcidity    -8.65e-03   1.15e-02   -0.75   0.4506    
Log(theta)          1.97e+01         NA      NA       NA    

Zero-inflation model coefficients (binomial with logit link):
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -1.63e+00   4.80e-01   -3.38  0.00072 ***
STARS2             -3.89e+00   4.44e-01   -8.78  < 2e-16 ***
STARS3             -1.84e+01   4.68e+02   -0.04  0.96855    
STARS4             -1.86e+01   8.95e+02   -0.02  0.98340    
STARS0              2.04e+00   8.95e-02   22.79  < 2e-16 ***
LabelAppeal-1       1.44e+00   4.46e-01    3.22  0.00128 ** 
LabelAppeal0        2.26e+00   4.43e-01    5.10  3.5e-07 ***
LabelAppeal1        2.91e+00   4.49e-01    6.48  9.2e-11 ***
LabelAppeal2        3.48e+00   5.08e-01    6.84  7.9e-12 ***
AcidIndex2         -1.32e+00   2.44e-01   -5.39  7.0e-08 ***
AcidIndex3         -2.03e+00   1.94e-01  -10.49  < 2e-16 ***
AcidIndex4         -2.50e+00   1.96e-01  -12.75  < 2e-16 ***
TotalSulfurDioxide -1.22e-03   2.59e-04   -4.70  2.5e-06 ***
VolatileAcidity     2.12e-01   7.06e-02    3.01  0.00262 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 352438663.38 
Number of iterations in BFGS optimization: 41 
Log-likelihood: -1.43e+04 on 29 Df

Mod 5 Confusion Matrix

pred <- predict(mod5, newdata=test, type='response')
predRound <- as.factor(round(pred,0))
testData <- as.factor(test$TARGET)
cm <- confusionMatrix(predRound, testData)
cm$overall[1]

Accuracy 
   0.346

cm$table

          Reference
Prediction   0   1   2   3   4   5   6   7   8
         0  91   1   0   4   2   1   0   0   0
         1 431  32  66  79  54  20   3   2   1
         2 168  38 181 189  62  17   4   0   0
         3 100   9  74 315 275  81   5   0   0
         4  15   0  10 191 391 177  34   1   0
         5   4   0   0   5 161 247 105  10   0
         6   0   0   0   0   8  58  66  22   3
         7   0   0   0   0   0   5  10   5   4
         8   0   0   0   0   0   0   0   0   0

Select Models

The selection process is fairly straightforward. Only the Zero-Inflated Negative Binomial model (ZINB) was able to meet or prediction ability criteria. Specifically all other models struggled or simply were unable to predict zero values. Given the business purpose of this analysis, this is a fundamental requirement.

The ZINB model also outperformed all other models in terms of confusion matrix accuracy, AIC, BIC, logLik and length of model name. Summary results are set forth below.

mod1_out <- cbind(AIC=AIC(mod1), BIC = BIC(mod1), loglik=logLik(mod1))
mod2_out <- cbind(AIC=AIC(mod2),BIC = BIC(mod2), loglik=logLik(mod2))
mod3_out <- cbind(AIC=AIC(mod3),BIC = BIC(mod3), loglik=logLik(mod3))
mod4_out <- cbind(AIC=AIC(mod4), BIC = BIC(mod4), loglik=logLik(mod4))
mod5_out <- cbind(AIC=AIC(mod5), BIC = BIC(mod5), loglik=logLik(mod5))
model_comp <- rbind(mod1_out, mod2_out,mod3_out,mod4_out,mod5_out)
  
  
rownames(model_comp) <- c("mod1_out","mod2_out","mod3_out","mod4_out","mod5_out")

c <-kable(model_comp) %>% 
  kable_styling()
c

	AIC	BIC	loglik
mod1_out	31980	32072	-15977
mod2_out	30331	30431	-15152
mod3_out	31977	32070	-15976
mod4_out	30327	30426	-15149
mod5_out	28603	28809	-14273

The final step of our analysis was to apply the same transformations and imputations to the evaluation data set that we applied to the training data set and predict target values using our selected model.

Key Findings

Key findings from our analysis are set forth below.

STAR ratings, LabelAppeal are likely better predictor of sales than variables that capture the chemical composition of wine.
Poisson and linear models had difficulty forecasting zero values
The missing values in the STARS rating may have provided explanatory power to our models.