Predicting Wine Sales With Certain Properties

INTRODUCTION:

In this study we will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Our objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. Below is a short description of the variables of interest in the data set:

DATA EXPLORATION:

In this section we load and explore the training data set. We will try get familiarize ourselves with different variables i.e. dependent and independent variables, and check out their distributions. The problem at hand is about counting the number of wines sold that contains certain properties which indicates that we will be dealing with a lot of variables since these kind of problems are dependent on multiple factors. So before any further due let’s begin by loading the data set.

Loading The Data set:

Below code chunk loads the required data set that we can use to train our model.

training <- read_csv("https://raw.githubusercontent.com/Umerfarooq122/predict-the-number-of-cases-of-wine-that-will-be-sold-given-certain-properties-of-the-wine/main/wine-training-data.csv")

Let’s display the fist five row of the data set to check if everything has been loaded into our work environment correctly

knitr::kable(head(training))
INDEX TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol LabelAppeal AcidIndex STARS
1 3 3.2 1.160 -0.98 54.2 -0.567 NA 268 0.99280 3.33 -0.59 9.9 0 8 2
2 3 4.5 0.160 -0.81 26.1 -0.425 15 -327 1.02792 3.38 0.70 NA -1 7 3
4 5 7.1 2.640 -0.88 14.8 0.037 214 142 0.99518 3.12 0.48 22.0 -1 8 3
5 3 5.7 0.385 0.04 18.8 -0.425 22 115 0.99640 2.24 1.83 6.2 -1 6 1
6 4 8.0 0.330 -1.26 9.4 NA -167 108 0.99457 3.12 1.77 13.7 0 9 2
7 0 11.3 0.320 0.59 2.2 0.556 -37 15 0.99940 3.20 1.29 15.4 0 11 NA

Checking Out The Dimensions, Descriptive Summary And Distributions:

We can see that we have got all the columns that are mentioned in the introduction about data set. Let’s check out the dimension of the data set

dim(training)
## [1] 12795    16

As we can see that we have got 16 columns in total and 12795 observations. One of those columns in an index column and we usually do not need it for the analysis so lets remove that from our data set.

training <- training[-1]

Let’s quickly peek into the descriptive summary of our data set

knitr::kable(summary(training))
TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol LabelAppeal AcidIndex STARS
Min. :0.000 Min. :-18.100 Min. :-2.7900 Min. :-3.2400 Min. :-127.800 Min. :-1.1710 Min. :-555.00 Min. :-823.0 Min. :0.8881 Min. :0.480 Min. :-3.1300 Min. :-4.70 Min. :-2.000000 Min. : 4.000 Min. :1.000
1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300 1st Qu.: 0.0300 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00 1st Qu.: 27.0 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800 1st Qu.: 9.00 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
Median :3.000 Median : 6.900 Median : 0.2800 Median : 0.3100 Median : 3.900 Median : 0.0460 Median : 30.00 Median : 123.0 Median :0.9945 Median :3.200 Median : 0.5000 Median :10.40 Median : 0.000000 Median : 8.000 Median :2.000
Mean :3.029 Mean : 7.076 Mean : 0.3241 Mean : 0.3084 Mean : 5.419 Mean : 0.0548 Mean : 30.85 Mean : 120.7 Mean :0.9942 Mean :3.208 Mean : 0.5271 Mean :10.49 Mean :-0.009066 Mean : 7.773 Mean :2.042
3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400 3rd Qu.: 0.5800 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00 3rd Qu.: 208.0 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600 3rd Qu.:12.40 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
Max. :8.000 Max. : 34.400 Max. : 3.6800 Max. : 3.8600 Max. : 141.150 Max. : 1.3510 Max. : 623.00 Max. :1057.0 Max. :1.0992 Max. :6.130 Max. : 4.2400 Max. :26.50 Max. : 2.000000 Max. :17.000 Max. :4.000
NA NA NA NA NA’s :616 NA’s :638 NA’s :647 NA’s :682 NA NA’s :395 NA’s :1210 NA’s :653 NA NA NA’s :3359

As we See that are multiple missing values in the dataset and we can also see that our columns have a variety of ranges. The TARGET columns mean is at around 3 and we can also check the variance to see if later on during modeling the poisson distribution would be the right model.

var(training$TARGET)
## [1] 3.710895

Variance is very close to the mean of the TARGET column so I think poisson would be an optimal fit. We can check out the structure of data set.

str(training)
## tibble [12,795 × 15] (S3: tbl_df/tbl/data.frame)
##  $ TARGET            : num [1:12795] 3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity      : num [1:12795] 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity   : num [1:12795] 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid        : num [1:12795] -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar     : num [1:12795] 54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides         : num [1:12795] -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide : num [1:12795] NA 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide: num [1:12795] 268 -327 142 115 108 15 156 551 NA 180 ...
##  $ Density           : num [1:12795] 0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num [1:12795] 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates         : num [1:12795] -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
##  $ Alcohol           : num [1:12795] 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal       : num [1:12795] 0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : num [1:12795] 8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS             : num [1:12795] 2 3 3 1 2 NA NA 3 NA 4 ...

We can see that some column does not have the right data type so we have to fix that. For instance, STARS column was suppose to a nominal data rather than numeric so we have to fix that. We can also look the distribution of all the columns and its correlation with our TARGET column using visualization as shown below:

par(mfrow = c(3,5))
plot_histogram(training)

pairs.panels(training[, c(1, 2:6)], main = "Scatter Plot Matrix for Training Dataset")

pairs.panels(training[, c(1, 7:11)], main = "")

pairs.panels(training[, c(1, 11:15)], main = "")

DATA PREPARATION:

In this section we will prepare our data for modeling. We can set the data type for column like STARS and convert them into factors which a much more acceptable data type when it comes to Modeling.

Fixing The Data Types:

training$STARS <- as.factor(training$STARS)

Imputing The Missing Values:

Since our data set has a mixture of continuous and categorical variables so we will consider a method that can handle both types and my personal pick would be to use random forest method to look at. Random forest can handle both data type plus it is an ensemble method which is a better approach to predict something.

set.seed(32)
training <- mice(training, m=5, maxit = 3, method = 'rf')
## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
training <- complete(training)

let’s do a quick check of any missing values in the data set:

sum(is.na(training))
## [1] 0

As our data set is ready so now we can go ahead and create our models

plot_histogram(training)

BUILDING MODELS:

Splitting the Data set:

Now let’s split the data into training and testing. We will split the data set training into partial_train and validation. Partial_train contains 80% of the data from training and the rest is in the validation that we will use for testing or evaluating the performance of our model.

set.seed(32)
split <- createDataPartition(training$TARGET, p=.80, list=FALSE)
partial_train <- training[split, ]
validation <- training[ -split, ]

Poisson Regression:

p1 <- glm(formula = TARGET~. ,family = 'poisson', data = partial_train)
summary(p1)
## 
## Call:
## glm(formula = TARGET ~ ., family = "poisson", data = partial_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3869  -0.6120   0.0528   0.5468   2.7575  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.592e+00  2.174e-01   7.324 2.40e-13 ***
## FixedAcidity       -1.379e-04  9.156e-04  -0.151  0.88026    
## VolatileAcidity    -3.292e-02  7.304e-03  -4.507 6.56e-06 ***
## CitricAcid          1.119e-02  6.553e-03   1.708  0.08759 .  
## ResidualSugar       2.846e-05  1.696e-04   0.168  0.86670    
## Chlorides          -7.553e-02  1.802e-02  -4.191 2.78e-05 ***
## FreeSulfurDioxide   1.235e-04  3.812e-05   3.240  0.00120 ** 
## TotalSulfurDioxide  7.042e-05  2.450e-05   2.874  0.00405 ** 
## Density            -3.029e-01  2.132e-01  -1.421  0.15544    
## pH                 -6.771e-03  8.405e-03  -0.806  0.42047    
## Sulphates          -1.349e-02  6.107e-03  -2.209  0.02715 *  
## Alcohol             4.374e-03  1.537e-03   2.846  0.00443 ** 
## LabelAppeal         1.518e-01  6.858e-03  22.134  < 2e-16 ***
## AcidIndex          -9.285e-02  5.080e-03 -18.277  < 2e-16 ***
## STARS2              6.553e-01  1.491e-02  43.954  < 2e-16 ***
## STARS3              8.180e-01  1.645e-02  49.726  < 2e-16 ***
## STARS4              9.316e-01  2.325e-02  40.076  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18190  on 10237  degrees of freedom
## Residual deviance: 12155  on 10221  degrees of freedom
## AIC: 37790
## 
## Number of Fisher Scoring iterations: 5

Poisson Regression Reduced:

p2 <- glm(formula = TARGET~ VolatileAcidity + Chlorides + TotalSulfurDioxide + FreeSulfurDioxide + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS , data = partial_train, family = 'poisson')
summary(p2)
## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + TotalSulfurDioxide + 
##     FreeSulfurDioxide + Sulphates + Alcohol + LabelAppeal + AcidIndex + 
##     STARS, family = "poisson", data = partial_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3568  -0.6124   0.0531   0.5459   2.7593  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.268e+00  4.484e-02  28.268  < 2e-16 ***
## VolatileAcidity    -3.342e-02  7.301e-03  -4.577 4.71e-06 ***
## Chlorides          -7.558e-02  1.801e-02  -4.196 2.72e-05 ***
## TotalSulfurDioxide  7.023e-05  2.448e-05   2.869  0.00412 ** 
## FreeSulfurDioxide   1.230e-04  3.810e-05   3.229  0.00124 ** 
## Sulphates          -1.369e-02  6.103e-03  -2.244  0.02484 *  
## Alcohol             4.465e-03  1.536e-03   2.907  0.00364 ** 
## LabelAppeal         1.518e-01  6.856e-03  22.135  < 2e-16 ***
## AcidIndex          -9.245e-02  4.997e-03 -18.501  < 2e-16 ***
## STARS2              6.563e-01  1.490e-02  44.059  < 2e-16 ***
## STARS3              8.188e-01  1.644e-02  49.798  < 2e-16 ***
## STARS4              9.315e-01  2.324e-02  40.079  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18190  on 10237  degrees of freedom
## Residual deviance: 12161  on 10226  degrees of freedom
## AIC: 37785
## 
## Number of Fisher Scoring iterations: 5

Negative Binomial Model:

nbp1 <- glm.nb(formula = TARGET~., data = partial_train, link = log, maxit=2000)
## Warning in sqrt(1/i): NaNs produced

## Warning in sqrt(1/i): NaNs produced
#summary(nbp1)

Negative Binomial Reduced:

nbp2 <- glm.nb(formula = TARGET~VolatileAcidity + Chlorides + TotalSulfurDioxide + FreeSulfurDioxide + Alcohol + LabelAppeal + AcidIndex + STARS , data = partial_train, link = log, maxit = 2000)
summary(nbp2)
## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + Chlorides + TotalSulfurDioxide + 
##     FreeSulfurDioxide + Alcohol + LabelAppeal + AcidIndex + STARS, 
##     data = partial_train, maxit = 2000, link = log, init.theta = 2.45087425e+17)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## 0.0000  0.0000  0.4522  1.5316  4.1016  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.262e+00  4.478e-02  28.190  < 2e-16 ***
## VolatileAcidity    -3.340e-02  7.302e-03  -4.574 4.78e-06 ***
## Chlorides          -7.508e-02  1.801e-02  -4.169 3.06e-05 ***
## TotalSulfurDioxide  7.011e-05  2.448e-05   2.864  0.00418 ** 
## FreeSulfurDioxide   1.217e-04  3.810e-05   3.193  0.00141 ** 
## Alcohol             4.458e-03  1.536e-03   2.903  0.00370 ** 
## LabelAppeal         1.517e-01  6.856e-03  22.120  < 2e-16 ***
## AcidIndex          -9.271e-02  4.995e-03 -18.559  < 2e-16 ***
## STARS2              6.563e-01  1.490e-02  44.056  < 2e-16 ***
## STARS3              8.194e-01  1.644e-02  49.840  < 2e-16 ***
## STARS4              9.326e-01  2.324e-02  40.128  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.450874e+17) family taken to be 1)
## 
##     Null deviance: 18190  on 10237  degrees of freedom
## Residual deviance: 12166  on 10227  degrees of freedom
## AIC: 24
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.450874e+17 
##           Std. Err.:  3.560756e+14 
## 
##  2 x log-likelihood:  0

NOTE: As we can see that coefficients from poisson and negative binomial model are exactly alike mainly because of the closeness in the value of mean and variance for response variable.

Multiple Linear Regression:

ml1 <- glm(formula = TARGET~., data = partial_train, family = gaussian())
summary(ml1)
## 
## Call:
## glm(formula = TARGET ~ ., family = gaussian(), data = partial_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.5639  -0.9518   0.1146   0.9943   4.3821  
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.468e+00  5.267e-01   8.484  < 2e-16 ***
## FixedAcidity       -2.496e-04  2.232e-03  -0.112 0.910975    
## VolatileAcidity    -9.971e-02  1.778e-02  -5.607 2.11e-08 ***
## CitricAcid          3.572e-02  1.608e-02   2.221 0.026356 *  
## ResidualSugar       1.101e-04  4.134e-04   0.266 0.790036    
## Chlorides          -2.313e-01  4.380e-02  -5.282 1.31e-07 ***
## FreeSulfurDioxide   3.680e-04  9.301e-05   3.957 7.65e-05 ***
## TotalSulfurDioxide  2.136e-04  5.950e-05   3.589 0.000333 ***
## Density            -8.467e-01  5.190e-01  -1.631 0.102826    
## pH                 -1.185e-02  2.039e-02  -0.581 0.560972    
## Sulphates          -3.642e-02  1.495e-02  -2.436 0.014883 *  
## Alcohol             1.518e-02  3.733e-03   4.068 4.78e-05 ***
## LabelAppeal         4.495e-01  1.639e-02  27.422  < 2e-16 ***
## AcidIndex          -2.359e-01  1.085e-02 -21.749  < 2e-16 ***
## STARS2              1.628e+00  3.261e-02  49.913  < 2e-16 ***
## STARS3              2.343e+00  3.991e-02  58.700  < 2e-16 ***
## STARS4              3.012e+00  6.691e-02  45.006  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.962023)
## 
##     Null deviance: 37813  on 10237  degrees of freedom
## Residual deviance: 20054  on 10221  degrees of freedom
## AIC: 35973
## 
## Number of Fisher Scoring iterations: 2

Multiple Linear Regression Reduced:

ml2 <- glm(formula = TARGET~VolatileAcidity + Chlorides + TotalSulfurDioxide + FreeSulfurDioxide + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS , data = partial_train, family = gaussian())
summary(ml2)
## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + TotalSulfurDioxide + 
##     FreeSulfurDioxide + Sulphates + Alcohol + LabelAppeal + AcidIndex + 
##     STARS, family = gaussian(), data = partial_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.5088  -0.9478   0.1142   0.9970   4.3845  
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.587e+00  9.817e-02  36.541  < 2e-16 ***
## VolatileAcidity    -1.012e-01  1.778e-02  -5.692 1.29e-08 ***
## Chlorides          -2.318e-01  4.379e-02  -5.293 1.23e-07 ***
## TotalSulfurDioxide  2.144e-04  5.948e-05   3.605 0.000313 ***
## FreeSulfurDioxide   3.670e-04  9.298e-05   3.946 7.98e-05 ***
## Sulphates          -3.696e-02  1.495e-02  -2.472 0.013434 *  
## Alcohol             1.539e-02  3.731e-03   4.125 3.74e-05 ***
## LabelAppeal         4.497e-01  1.639e-02  27.433  < 2e-16 ***
## AcidIndex          -2.348e-01  1.063e-02 -22.098  < 2e-16 ***
## STARS2              1.630e+00  3.258e-02  50.034  < 2e-16 ***
## STARS3              2.344e+00  3.989e-02  58.766  < 2e-16 ***
## STARS4              3.012e+00  6.692e-02  45.003  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.962636)
## 
##     Null deviance: 37813  on 10237  degrees of freedom
## Residual deviance: 20070  on 10226  degrees of freedom
## AIC: 35972
## 
## Number of Fisher Scoring iterations: 2

Zero Inflation Poisson:

zip1 <- zeroinfl(TARGET~.|STARS, data = partial_train)
summary(zip1)
## 
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = partial_train)
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.38259 -0.49502  0.06398  0.49093  2.09559 
## 
## Count model coefficients (poisson with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.669e+00  2.260e-01   7.385 1.52e-13 ***
## FixedAcidity        9.777e-05  9.435e-04   0.104   0.9175    
## VolatileAcidity    -1.534e-02  7.545e-03  -2.033   0.0420 *  
## CitricAcid          3.973e-03  6.733e-03   0.590   0.5552    
## ResidualSugar      -7.725e-05  1.744e-04  -0.443   0.6578    
## Chlorides          -3.634e-02  1.872e-02  -1.941   0.0523 .  
## FreeSulfurDioxide   4.443e-05  3.871e-05   1.148   0.2511    
## TotalSulfurDioxide -7.829e-07  2.453e-05  -0.032   0.9745    
## Density            -3.177e-01  2.215e-01  -1.434   0.1515    
## pH                  3.379e-03  8.685e-03   0.389   0.6973    
## Sulphates          -1.669e-03  6.351e-03  -0.263   0.7927    
## Alcohol             6.882e-03  1.567e-03   4.392 1.12e-05 ***
## LabelAppeal         2.268e-01  7.168e-03  31.639  < 2e-16 ***
## AcidIndex          -3.718e-02  5.717e-03  -6.504 7.82e-11 ***
## STARS2              1.470e-01  1.611e-02   9.125  < 2e-16 ***
## STARS3              2.503e-01  1.749e-02  14.310  < 2e-16 ***
## STARS4              3.392e-01  2.440e-02  13.903  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.38037    0.03323 -11.447  < 2e-16 ***
## STARS2       -2.75256    0.11609 -23.712  < 2e-16 ***
## STARS3      -19.25488  529.84908  -0.036 0.971011    
## STARS4       -5.51533    1.42253  -3.877 0.000106 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 41 
## Log-likelihood: -1.726e+04 on 21 Df

Zero Inflation Poisson Reduced:

zip2 <- zeroinfl(TARGET ~ VolatileAcidity + Chlorides + 
    Alcohol + LabelAppeal + AcidIndex | STARS, data =partial_train)
summary(zip2)
## 
## Call:
## zeroinfl(formula = TARGET ~ VolatileAcidity + Chlorides + Alcohol + LabelAppeal + 
##     AcidIndex | STARS, data = partial_train)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2263 -0.4010  0.1354  0.4714  2.2612 
## 
## Count model coefficients (poisson with log link):
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.515934   0.046473  32.619  < 2e-16 ***
## VolatileAcidity -0.018742   0.007499  -2.499   0.0125 *  
## Chlorides       -0.036041   0.018637  -1.934   0.0531 .  
## Alcohol          0.008768   0.001553   5.645 1.65e-08 ***
## LabelAppeal      0.272548   0.006562  41.536  < 2e-16 ***
## AcidIndex       -0.039840   0.005591  -7.126 1.04e-12 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.33864    0.03212  -10.54   <2e-16 ***
## STARS2       -2.79757    0.11348  -24.65   <2e-16 ***
## STARS3      -13.72141   29.86030   -0.46    0.646    
## STARS4      -12.66981   57.62269   -0.22    0.826    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 22 
## Log-likelihood: -1.74e+04 on 10 Df

SELECTING MODELS AND EVALUATION:

Model Selection:

p1_pred <- predict(p1, newdata = validation, type = "response")
p2_pred <- predict(p2, newdata = validation, type = "response")
nbp1_pred <- predict(nbp1, newdata = validation, type = "response")
nbp2_pred <- predict(nbp2 , newdata = validation, type = "response")
ml1_pred <- predict(ml1, newdata = validation, type = "response")
ml2_pred <- predict(ml2, newdata = validation, type = "response")
zip1_pred <- predict(zip1, newdata = validation, type = "response")
zip2_pred <- predict(zip2, newdata = validation, type = "response")
p1_mae <- mae(validation$TARGET, p1_pred)
p1_mse <- mse(validation$TARGET, p1_pred)
p2_mae <- mae(validation$TARGET, p2_pred)
p2_mse <- mse(validation$TARGET, p2_pred)
nbp1_mae <- mae(validation$TARGET, nbp1_pred)
nbp1_mse <- mse(validation$TARGET, nbp1_pred)
nbp2_mae <- mae(validation$TARGET, nbp2_pred)
nbp2_mse <- mse(validation$TARGET, nbp2_pred)
ml1_mae <- mae(validation$TARGET, ml1_pred)
ml1_mse <- mse(validation$TARGET, ml1_pred)
ml2_mae <- mae(validation$TARGET, ml2_pred)
ml2_mse <- mse(validation$TARGET, ml2_pred)
zip1_mae <- mae(validation$TARGET, zip1_pred)
zip1_mse <- mse(validation$TARGET, zip1_pred)
zip2_mae <- mae(validation$TARGET, zip2_pred)
zip2_mse <- mse(validation$TARGET, zip2_pred)
model_names <- c("Poisson", "Poisson Reduced", "Negative Binomial", "Negative Binomial Reduced", "Multiple Linear Regression", "Multiple Linear Regression Reduced", "Zero Inflation Poisson", "Zero Inflation Poisson Reduced")
mae_values <- c(p1_mae, p2_mae, nbp1_mae,nbp2_mae, ml1_mae, ml2_mae, zip1_mae, zip2_mae)
mse_values <- c(p1_mse, p2_mse, nbp1_mse,nbp2_mse, ml1_mse, ml2_mse, zip1_mse, zip2_mse)
result_df <- data.frame(Model = model_names, MAE = mae_values, MSE = mse_values)
knitr::kable(result_df)
Model MAE MSE
Poisson 1.141460 1.967859
Poisson Reduced 1.140504 1.967155
Negative Binomial 1.141460 1.967859
Negative Binomial Reduced 1.140341 1.968341
Multiple Linear Regression 1.132244 1.976968
Multiple Linear Regression Reduced 1.132479 1.978670
Zero Inflation Poisson 1.142168 1.993674
Zero Inflation Poisson Reduced 1.178747 2.171632

Predictions:

Let’s load the testing data set for predictions:

testing <- read.csv("https://raw.githubusercontent.com/Umerfarooq122/predict-the-number-of-cases-of-wine-that-will-be-sold-given-certain-properties-of-the-wine/main/wine-evaluation-data.csv")

Displaying the first few row of our testing data set:

knitr::kable(head(testing))
IN TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol LabelAppeal AcidIndex STARS
3 NA 5.4 -0.860 0.27 -10.7 0.092 23 398 0.98527 5.02 0.64 12.30 -1 6 NA
9 NA 12.4 0.385 -0.76 -19.7 1.169 -37 68 0.99048 3.37 1.09 16.00 0 6 2
10 NA 7.2 1.750 0.17 -33.0 0.065 9 76 1.04641 4.61 0.68 8.55 0 8 1
18 NA 6.2 0.100 1.80 1.0 -0.179 104 89 0.98877 3.20 2.11 12.30 -1 8 1
21 NA 11.4 0.210 0.28 1.2 0.038 70 53 1.02899 2.54 -0.07 4.80 0 10 NA
30 NA 17.6 0.040 -1.15 1.4 0.535 -250 140 0.95028 3.06 -0.02 11.40 1 8 4

Removing unwanted Columns

testing <- testing[-c(1,2)]

Looking for any missing values in our testing data set:

colSums(is.na(testing))
##       FixedAcidity    VolatileAcidity         CitricAcid      ResidualSugar 
##                  0                  0                  0                168 
##          Chlorides  FreeSulfurDioxide TotalSulfurDioxide            Density 
##                138                152                157                  0 
##                 pH          Sulphates            Alcohol        LabelAppeal 
##                104                310                185                  0 
##          AcidIndex              STARS 
##                  0                841
testing$STARS <- as.factor(testing$STARS)
set.seed(32)
testing <- mice(testing, m=5, maxit = 3, method = 'rf')
## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
testing <- complete(testing)

We are going to use the poisson reduced model to predict the outcomes of our testing set.

predictions <- predict(p2, newdata = testing, type = "response")

Here is the histogram of our predictions

hist(predictions)

CONCLUSION:

This particular study was focused in solving the count of wines sold to restaurants using poisson regression. We started off with a data set that required some cleaning and pre-processing before feeding it to an algorithm to train a model. After cleaning and preparing the data set we tried multiple techniques to perform count regression. We fed the data to poisson regression model, negative binomial model, Multiple linear regression and Zero Inflation poisson model. We compared the results of all the models based on mean absolute error (MSE) and mean absolute error (MAE), and selected the on with relatively less predictors and comparatively accepting values for MSE and MAE. Then we load the evaluation data set and performed predictions using reduced poisson regression model. The results from different wqe very similar to each mainly because of the pre-processing where random forest was utilized to fill in for missing values.