Predicting Wine Sales With Certain Properties

INTRODUCTION:

In this study we will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Our objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. Below is a short description of the variables of interest in the data set:

DATA EXPLORATION:

In this section we load and explore the training data set. We will try get familiarize ourselves with different variables i.e. dependent and independent variables, and check out their distributions. The problem at hand is about counting the number of wines sold that contains certain properties which indicates that we will be dealing with a lot of variables since these kind of problems are dependent on multiple factors. So before any further due let’s begin by loading the data set.

Loading The Data set:

Below code chunk loads the required data set that we can use to train our model.

training <- read_csv("https://raw.githubusercontent.com/Umerfarooq122/predict-the-number-of-cases-of-wine-that-will-be-sold-given-certain-properties-of-the-wine/main/wine-training-data.csv")

Let’s display the fist five row of the data set to check if everything has been loaded into our work environment correctly

knitr::kable(head(training))

INDEX	TARGET	FixedAcidity	VolatileAcidity	CitricAcid	ResidualSugar	Chlorides	FreeSulfurDioxide	TotalSulfurDioxide	Density	pH	Sulphates	Alcohol	LabelAppeal	AcidIndex	STARS
1	3	3.2	1.160	-0.98	54.2	-0.567	NA	268	0.99280	3.33	-0.59	9.9	0	8	2
2	3	4.5	0.160	-0.81	26.1	-0.425	15	-327	1.02792	3.38	0.70	NA	-1	7	3
4	5	7.1	2.640	-0.88	14.8	0.037	214	142	0.99518	3.12	0.48	22.0	-1	8	3
5	3	5.7	0.385	0.04	18.8	-0.425	22	115	0.99640	2.24	1.83	6.2	-1	6	1
6	4	8.0	0.330	-1.26	9.4	NA	-167	108	0.99457	3.12	1.77	13.7	0	9	2
7	0	11.3	0.320	0.59	2.2	0.556	-37	15	0.99940	3.20	1.29	15.4	0	11	NA

Checking Out The Dimensions, Descriptive Summary And Distributions:

We can see that we have got all the columns that are mentioned in the introduction about data set. Let’s check out the dimension of the data set

dim(training)

## [1] 12795    16

As we can see that we have got 16 columns in total and 12795 observations. One of those columns in an index column and we usually do not need it for the analysis so lets remove that from our data set.

training <- training[-1]

Let’s quickly peek into the descriptive summary of our data set

knitr::kable(summary(training))

TARGET	FixedAcidity	VolatileAcidity	CitricAcid	ResidualSugar	Chlorides	FreeSulfurDioxide	TotalSulfurDioxide	Density	pH	Sulphates	Alcohol	LabelAppeal	AcidIndex	STARS
Min. :0.000	Min. :-18.100	Min. :-2.7900	Min. :-3.2400	Min. :-127.800	Min. :-1.1710	Min. :-555.00	Min. :-823.0	Min. :0.8881	Min. :0.480	Min. :-3.1300	Min. :-4.70	Min. :-2.000000	Min. : 4.000	Min. :1.000
1st Qu.:2.000	1st Qu.: 5.200	1st Qu.: 0.1300	1st Qu.: 0.0300	1st Qu.: -2.000	1st Qu.:-0.0310	1st Qu.: 0.00	1st Qu.: 27.0	1st Qu.:0.9877	1st Qu.:2.960	1st Qu.: 0.2800	1st Qu.: 9.00	1st Qu.:-1.000000	1st Qu.: 7.000	1st Qu.:1.000
Median :3.000	Median : 6.900	Median : 0.2800	Median : 0.3100	Median : 3.900	Median : 0.0460	Median : 30.00	Median : 123.0	Median :0.9945	Median :3.200	Median : 0.5000	Median :10.40	Median : 0.000000	Median : 8.000	Median :2.000
Mean :3.029	Mean : 7.076	Mean : 0.3241	Mean : 0.3084	Mean : 5.419	Mean : 0.0548	Mean : 30.85	Mean : 120.7	Mean :0.9942	Mean :3.208	Mean : 0.5271	Mean :10.49	Mean :-0.009066	Mean : 7.773	Mean :2.042
3rd Qu.:4.000	3rd Qu.: 9.500	3rd Qu.: 0.6400	3rd Qu.: 0.5800	3rd Qu.: 15.900	3rd Qu.: 0.1530	3rd Qu.: 70.00	3rd Qu.: 208.0	3rd Qu.:1.0005	3rd Qu.:3.470	3rd Qu.: 0.8600	3rd Qu.:12.40	3rd Qu.: 1.000000	3rd Qu.: 8.000	3rd Qu.:3.000
Max. :8.000	Max. : 34.400	Max. : 3.6800	Max. : 3.8600	Max. : 141.150	Max. : 1.3510	Max. : 623.00	Max. :1057.0	Max. :1.0992	Max. :6.130	Max. : 4.2400	Max. :26.50	Max. : 2.000000	Max. :17.000	Max. :4.000
NA	NA	NA	NA	NA’s :616	NA’s :638	NA’s :647	NA’s :682	NA	NA’s :395	NA’s :1210	NA’s :653	NA	NA	NA’s :3359

As we See that are multiple missing values in the dataset and we can also see that our columns have a variety of ranges. The TARGET columns mean is at around 3 and we can also check the variance to see if later on during modeling the poisson distribution would be the right model.

var(training$TARGET)

## [1] 3.710895

Variance is very close to the mean of the TARGET column so I think poisson would be an optimal fit. We can check out the structure of data set.

str(training)

## tibble [12,795 × 15] (S3: tbl_df/tbl/data.frame)
##  $ TARGET            : num [1:12795] 3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity      : num [1:12795] 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity   : num [1:12795] 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid        : num [1:12795] -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar     : num [1:12795] 54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides         : num [1:12795] -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide : num [1:12795] NA 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide: num [1:12795] 268 -327 142 115 108 15 156 551 NA 180 ...
##  $ Density           : num [1:12795] 0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num [1:12795] 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates         : num [1:12795] -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
##  $ Alcohol           : num [1:12795] 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal       : num [1:12795] 0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : num [1:12795] 8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS             : num [1:12795] 2 3 3 1 2 NA NA 3 NA 4 ...

We can see that some column does not have the right data type so we have to fix that. For instance, STARS column was suppose to a nominal data rather than numeric so we have to fix that. We can also look the distribution of all the columns and its correlation with our TARGET column using visualization as shown below:

par(mfrow = c(3,5))
plot_histogram(training)

pairs.panels(training[, c(1, 2:6)], main = "Scatter Plot Matrix for Training Dataset")

pairs.panels(training[, c(1, 7:11)], main = "")

pairs.panels(training[, c(1, 11:15)], main = "")

DATA PREPARATION:

In this section we will prepare our data for modeling. We can set the data type for column like STARS and convert them into factors which a much more acceptable data type when it comes to Modeling.

Fixing The Data Types:

training$STARS <- as.factor(training$STARS)

Imputing The Missing Values:

Since our data set has a mixture of continuous and categorical variables so we will consider a method that can handle both types and my personal pick would be to use random forest method to look at. Random forest can handle both data type plus it is an ensemble method which is a better approach to predict something.

set.seed(32)
training <- mice(training, m=5, maxit = 3, method = 'rf')

## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS

training <- complete(training)

let’s do a quick check of any missing values in the data set:

sum(is.na(training))

## [1] 0

As our data set is ready so now we can go ahead and create our models

plot_histogram(training)

BUILDING MODELS:

Splitting the Data set:

Now let’s split the data into training and testing. We will split the data set training into partial_train and validation. Partial_train contains 80% of the data from training and the rest is in the validation that we will use for testing or evaluating the performance of our model.

set.seed(32)
split <- createDataPartition(training$TARGET, p=.80, list=FALSE)
partial_train <- training[split, ]
validation <- training[ -split, ]

Poisson Regression:

p1 <- glm(formula = TARGET~. ,family = 'poisson', data = partial_train)

summary(p1)

## 
## Call:
## glm(formula = TARGET ~ ., family = "poisson", data = partial_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3869  -0.6120   0.0528   0.5468   2.7575  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.592e+00  2.174e-01   7.324 2.40e-13 ***
## FixedAcidity       -1.379e-04  9.156e-04  -0.151  0.88026    
## VolatileAcidity    -3.292e-02  7.304e-03  -4.507 6.56e-06 ***
## CitricAcid          1.119e-02  6.553e-03   1.708  0.08759 .  
## ResidualSugar       2.846e-05  1.696e-04   0.168  0.86670    
## Chlorides          -7.553e-02  1.802e-02  -4.191 2.78e-05 ***
## FreeSulfurDioxide   1.235e-04  3.812e-05   3.240  0.00120 ** 
## TotalSulfurDioxide  7.042e-05  2.450e-05   2.874  0.00405 ** 
## Density            -3.029e-01  2.132e-01  -1.421  0.15544    
## pH                 -6.771e-03  8.405e-03  -0.806  0.42047    
## Sulphates          -1.349e-02  6.107e-03  -2.209  0.02715 *  
## Alcohol             4.374e-03  1.537e-03   2.846  0.00443 ** 
## LabelAppeal         1.518e-01  6.858e-03  22.134  < 2e-16 ***
## AcidIndex          -9.285e-02  5.080e-03 -18.277  < 2e-16 ***
## STARS2              6.553e-01  1.491e-02  43.954  < 2e-16 ***
## STARS3              8.180e-01  1.645e-02  49.726  < 2e-16 ***
## STARS4              9.316e-01  2.325e-02  40.076  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18190  on 10237  degrees of freedom
## Residual deviance: 12155  on 10221  degrees of freedom
## AIC: 37790
## 
## Number of Fisher Scoring iterations: 5

Poisson Regression Reduced:

p2 <- glm(formula = TARGET~ VolatileAcidity + Chlorides + TotalSulfurDioxide + FreeSulfurDioxide + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS , data = partial_train, family = 'poisson')

summary(p2)

## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + TotalSulfurDioxide + 
##     FreeSulfurDioxide + Sulphates + Alcohol + LabelAppeal + AcidIndex + 
##     STARS, family = "poisson", data = partial_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3568  -0.6124   0.0531   0.5459   2.7593  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.268e+00  4.484e-02  28.268  < 2e-16 ***
## VolatileAcidity    -3.342e-02  7.301e-03  -4.577 4.71e-06 ***
## Chlorides          -7.558e-02  1.801e-02  -4.196 2.72e-05 ***
## TotalSulfurDioxide  7.023e-05  2.448e-05   2.869  0.00412 ** 
## FreeSulfurDioxide   1.230e-04  3.810e-05   3.229  0.00124 ** 
## Sulphates          -1.369e-02  6.103e-03  -2.244  0.02484 *  
## Alcohol             4.465e-03  1.536e-03   2.907  0.00364 ** 
## LabelAppeal         1.518e-01  6.856e-03  22.135  < 2e-16 ***
## AcidIndex          -9.245e-02  4.997e-03 -18.501  < 2e-16 ***
## STARS2              6.563e-01  1.490e-02  44.059  < 2e-16 ***
## STARS3              8.188e-01  1.644e-02  49.798  < 2e-16 ***
## STARS4              9.315e-01  2.324e-02  40.079  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18190  on 10237  degrees of freedom
## Residual deviance: 12161  on 10226  degrees of freedom
## AIC: 37785
## 
## Number of Fisher Scoring iterations: 5

Negative Binomial Model:

nbp1 <- glm.nb(formula = TARGET~., data = partial_train, link = log, maxit=2000)

## Warning in sqrt(1/i): NaNs produced

## Warning in sqrt(1/i): NaNs produced

#summary(nbp1)

Negative Binomial Reduced:

nbp2 <- glm.nb(formula = TARGET~VolatileAcidity + Chlorides + TotalSulfurDioxide + FreeSulfurDioxide + Alcohol + LabelAppeal + AcidIndex + STARS , data = partial_train, link = log, maxit = 2000)

summary(nbp2)

## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + Chlorides + TotalSulfurDioxide + 
##     FreeSulfurDioxide + Alcohol + LabelAppeal + AcidIndex + STARS, 
##     data = partial_train, maxit = 2000, link = log, init.theta = 2.45087425e+17)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## 0.0000  0.0000  0.4522  1.5316  4.1016  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.262e+00  4.478e-02  28.190  < 2e-16 ***
## VolatileAcidity    -3.340e-02  7.302e-03  -4.574 4.78e-06 ***
## Chlorides          -7.508e-02  1.801e-02  -4.169 3.06e-05 ***
## TotalSulfurDioxide  7.011e-05  2.448e-05   2.864  0.00418 ** 
## FreeSulfurDioxide   1.217e-04  3.810e-05   3.193  0.00141 ** 
## Alcohol             4.458e-03  1.536e-03   2.903  0.00370 ** 
## LabelAppeal         1.517e-01  6.856e-03  22.120  < 2e-16 ***
## AcidIndex          -9.271e-02  4.995e-03 -18.559  < 2e-16 ***
## STARS2              6.563e-01  1.490e-02  44.056  < 2e-16 ***
## STARS3              8.194e-01  1.644e-02  49.840  < 2e-16 ***
## STARS4              9.326e-01  2.324e-02  40.128  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.450874e+17) family taken to be 1)
## 
##     Null deviance: 18190  on 10237  degrees of freedom
## Residual deviance: 12166  on 10227  degrees of freedom
## AIC: 24
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.450874e+17 
##           Std. Err.:  3.560756e+14 
## 
##  2 x log-likelihood:  0

NOTE: As we can see that coefficients from poisson and negative binomial model are exactly alike mainly because of the closeness in the value of mean and variance for response variable.

Multiple Linear Regression:

ml1 <- glm(formula = TARGET~., data = partial_train, family = gaussian())

summary(ml1)

## 
## Call:
## glm(formula = TARGET ~ ., family = gaussian(), data = partial_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.5639  -0.9518   0.1146   0.9943   4.3821  
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.468e+00  5.267e-01   8.484  < 2e-16 ***
## FixedAcidity       -2.496e-04  2.232e-03  -0.112 0.910975    
## VolatileAcidity    -9.971e-02  1.778e-02  -5.607 2.11e-08 ***
## CitricAcid          3.572e-02  1.608e-02   2.221 0.026356 *  
## ResidualSugar       1.101e-04  4.134e-04   0.266 0.790036    
## Chlorides          -2.313e-01  4.380e-02  -5.282 1.31e-07 ***
## FreeSulfurDioxide   3.680e-04  9.301e-05   3.957 7.65e-05 ***
## TotalSulfurDioxide  2.136e-04  5.950e-05   3.589 0.000333 ***
## Density            -8.467e-01  5.190e-01  -1.631 0.102826    
## pH                 -1.185e-02  2.039e-02  -0.581 0.560972    
## Sulphates          -3.642e-02  1.495e-02  -2.436 0.014883 *  
## Alcohol             1.518e-02  3.733e-03   4.068 4.78e-05 ***
## LabelAppeal         4.495e-01  1.639e-02  27.422  < 2e-16 ***
## AcidIndex          -2.359e-01  1.085e-02 -21.749  < 2e-16 ***
## STARS2              1.628e+00  3.261e-02  49.913  < 2e-16 ***
## STARS3              2.343e+00  3.991e-02  58.700  < 2e-16 ***
## STARS4              3.012e+00  6.691e-02  45.006  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.962023)
## 
##     Null deviance: 37813  on 10237  degrees of freedom
## Residual deviance: 20054  on 10221  degrees of freedom
## AIC: 35973
## 
## Number of Fisher Scoring iterations: 2

Multiple Linear Regression Reduced:

ml2 <- glm(formula = TARGET~VolatileAcidity + Chlorides + TotalSulfurDioxide + FreeSulfurDioxide + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS , data = partial_train, family = gaussian())

summary(ml2)

## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + TotalSulfurDioxide + 
##     FreeSulfurDioxide + Sulphates + Alcohol + LabelAppeal + AcidIndex + 
##     STARS, family = gaussian(), data = partial_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.5088  -0.9478   0.1142   0.9970   4.3845  
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.587e+00  9.817e-02  36.541  < 2e-16 ***
## VolatileAcidity    -1.012e-01  1.778e-02  -5.692 1.29e-08 ***
## Chlorides          -2.318e-01  4.379e-02  -5.293 1.23e-07 ***
## TotalSulfurDioxide  2.144e-04  5.948e-05   3.605 0.000313 ***
## FreeSulfurDioxide   3.670e-04  9.298e-05   3.946 7.98e-05 ***
## Sulphates          -3.696e-02  1.495e-02  -2.472 0.013434 *  
## Alcohol             1.539e-02  3.731e-03   4.125 3.74e-05 ***
## LabelAppeal         4.497e-01  1.639e-02  27.433  < 2e-16 ***
## AcidIndex          -2.348e-01  1.063e-02 -22.098  < 2e-16 ***
## STARS2              1.630e+00  3.258e-02  50.034  < 2e-16 ***
## STARS3              2.344e+00  3.989e-02  58.766  < 2e-16 ***
## STARS4              3.012e+00  6.692e-02  45.003  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.962636)
## 
##     Null deviance: 37813  on 10237  degrees of freedom
## Residual deviance: 20070  on 10226  degrees of freedom
## AIC: 35972
## 
## Number of Fisher Scoring iterations: 2

Zero Inflation Poisson:

zip1 <- zeroinfl(TARGET~.|STARS, data = partial_train)

summary(zip1)

## 
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = partial_train)
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.38259 -0.49502  0.06398  0.49093  2.09559 
## 
## Count model coefficients (poisson with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.669e+00  2.260e-01   7.385 1.52e-13 ***
## FixedAcidity        9.777e-05  9.435e-04   0.104   0.9175    
## VolatileAcidity    -1.534e-02  7.545e-03  -2.033   0.0420 *  
## CitricAcid          3.973e-03  6.733e-03   0.590   0.5552    
## ResidualSugar      -7.725e-05  1.744e-04  -0.443   0.6578    
## Chlorides          -3.634e-02  1.872e-02  -1.941   0.0523 .  
## FreeSulfurDioxide   4.443e-05  3.871e-05   1.148   0.2511    
## TotalSulfurDioxide -7.829e-07  2.453e-05  -0.032   0.9745    
## Density            -3.177e-01  2.215e-01  -1.434   0.1515    
## pH                  3.379e-03  8.685e-03   0.389   0.6973    
## Sulphates          -1.669e-03  6.351e-03  -0.263   0.7927    
## Alcohol             6.882e-03  1.567e-03   4.392 1.12e-05 ***
## LabelAppeal         2.268e-01  7.168e-03  31.639  < 2e-16 ***
## AcidIndex          -3.718e-02  5.717e-03  -6.504 7.82e-11 ***
## STARS2              1.470e-01  1.611e-02   9.125  < 2e-16 ***
## STARS3              2.503e-01  1.749e-02  14.310  < 2e-16 ***
## STARS4              3.392e-01  2.440e-02  13.903  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.38037    0.03323 -11.447  < 2e-16 ***
## STARS2       -2.75256    0.11609 -23.712  < 2e-16 ***
## STARS3      -19.25488  529.84908  -0.036 0.971011    
## STARS4       -5.51533    1.42253  -3.877 0.000106 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 41 
## Log-likelihood: -1.726e+04 on 21 Df

Zero Inflation Poisson Reduced:

zip2 <- zeroinfl(TARGET ~ VolatileAcidity + Chlorides + 
    Alcohol + LabelAppeal + AcidIndex | STARS, data =partial_train)

summary(zip2)

## 
## Call:
## zeroinfl(formula = TARGET ~ VolatileAcidity + Chlorides + Alcohol + LabelAppeal + 
##     AcidIndex | STARS, data = partial_train)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2263 -0.4010  0.1354  0.4714  2.2612 
## 
## Count model coefficients (poisson with log link):
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.515934   0.046473  32.619  < 2e-16 ***
## VolatileAcidity -0.018742   0.007499  -2.499   0.0125 *  
## Chlorides       -0.036041   0.018637  -1.934   0.0531 .  
## Alcohol          0.008768   0.001553   5.645 1.65e-08 ***
## LabelAppeal      0.272548   0.006562  41.536  < 2e-16 ***
## AcidIndex       -0.039840   0.005591  -7.126 1.04e-12 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.33864    0.03212  -10.54   <2e-16 ***
## STARS2       -2.79757    0.11348  -24.65   <2e-16 ***
## STARS3      -13.72141   29.86030   -0.46    0.646    
## STARS4      -12.66981   57.62269   -0.22    0.826    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 22 
## Log-likelihood: -1.74e+04 on 10 Df

SELECTING MODELS AND EVALUATION:

Model Selection:

p1_pred <- predict(p1, newdata = validation, type = "response")
p2_pred <- predict(p2, newdata = validation, type = "response")
nbp1_pred <- predict(nbp1, newdata = validation, type = "response")
nbp2_pred <- predict(nbp2 , newdata = validation, type = "response")
ml1_pred <- predict(ml1, newdata = validation, type = "response")
ml2_pred <- predict(ml2, newdata = validation, type = "response")
zip1_pred <- predict(zip1, newdata = validation, type = "response")
zip2_pred <- predict(zip2, newdata = validation, type = "response")

p1_mae <- mae(validation$TARGET, p1_pred)
p1_mse <- mse(validation$TARGET, p1_pred)
p2_mae <- mae(validation$TARGET, p2_pred)
p2_mse <- mse(validation$TARGET, p2_pred)
nbp1_mae <- mae(validation$TARGET, nbp1_pred)
nbp1_mse <- mse(validation$TARGET, nbp1_pred)
nbp2_mae <- mae(validation$TARGET, nbp2_pred)
nbp2_mse <- mse(validation$TARGET, nbp2_pred)
ml1_mae <- mae(validation$TARGET, ml1_pred)
ml1_mse <- mse(validation$TARGET, ml1_pred)
ml2_mae <- mae(validation$TARGET, ml2_pred)
ml2_mse <- mse(validation$TARGET, ml2_pred)
zip1_mae <- mae(validation$TARGET, zip1_pred)
zip1_mse <- mse(validation$TARGET, zip1_pred)
zip2_mae <- mae(validation$TARGET, zip2_pred)
zip2_mse <- mse(validation$TARGET, zip2_pred)

model_names <- c("Poisson", "Poisson Reduced", "Negative Binomial", "Negative Binomial Reduced", "Multiple Linear Regression", "Multiple Linear Regression Reduced", "Zero Inflation Poisson", "Zero Inflation Poisson Reduced")
mae_values <- c(p1_mae, p2_mae, nbp1_mae,nbp2_mae, ml1_mae, ml2_mae, zip1_mae, zip2_mae)
mse_values <- c(p1_mse, p2_mse, nbp1_mse,nbp2_mse, ml1_mse, ml2_mse, zip1_mse, zip2_mse)
result_df <- data.frame(Model = model_names, MAE = mae_values, MSE = mse_values)
knitr::kable(result_df)

Model	MAE	MSE
Poisson	1.141460	1.967859
Poisson Reduced	1.140504	1.967155
Negative Binomial	1.141460	1.967859
Negative Binomial Reduced	1.140341	1.968341
Multiple Linear Regression	1.132244	1.976968
Multiple Linear Regression Reduced	1.132479	1.978670
Zero Inflation Poisson	1.142168	1.993674
Zero Inflation Poisson Reduced	1.178747	2.171632

Predictions:

Let’s load the testing data set for predictions:

testing <- read.csv("https://raw.githubusercontent.com/Umerfarooq122/predict-the-number-of-cases-of-wine-that-will-be-sold-given-certain-properties-of-the-wine/main/wine-evaluation-data.csv")

Displaying the first few row of our testing data set:

knitr::kable(head(testing))

IN	TARGET	FixedAcidity	VolatileAcidity	CitricAcid	ResidualSugar	Chlorides	FreeSulfurDioxide	TotalSulfurDioxide	Density	pH	Sulphates	Alcohol	LabelAppeal	AcidIndex	STARS
3	NA	5.4	-0.860	0.27	-10.7	0.092	23	398	0.98527	5.02	0.64	12.30	-1	6	NA
9	NA	12.4	0.385	-0.76	-19.7	1.169	-37	68	0.99048	3.37	1.09	16.00	0	6	2
10	NA	7.2	1.750	0.17	-33.0	0.065	9	76	1.04641	4.61	0.68	8.55	0	8	1
18	NA	6.2	0.100	1.80	1.0	-0.179	104	89	0.98877	3.20	2.11	12.30	-1	8	1
21	NA	11.4	0.210	0.28	1.2	0.038	70	53	1.02899	2.54	-0.07	4.80	0	10	NA
30	NA	17.6	0.040	-1.15	1.4	0.535	-250	140	0.95028	3.06	-0.02	11.40	1	8	4

Removing unwanted Columns

testing <- testing[-c(1,2)]

Looking for any missing values in our testing data set:

colSums(is.na(testing))

##       FixedAcidity    VolatileAcidity         CitricAcid      ResidualSugar 
##                  0                  0                  0                168 
##          Chlorides  FreeSulfurDioxide TotalSulfurDioxide            Density 
##                138                152                157                  0 
##                 pH          Sulphates            Alcohol        LabelAppeal 
##                104                310                185                  0 
##          AcidIndex              STARS 
##                  0                841

testing$STARS <- as.factor(testing$STARS)

set.seed(32)
testing <- mice(testing, m=5, maxit = 3, method = 'rf')

## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS

testing <- complete(testing)

We are going to use the poisson reduced model to predict the outcomes of our testing set.

predictions <- predict(p2, newdata = testing, type = "response")

Here is the histogram of our predictions

hist(predictions)

CONCLUSION:

This particular study was focused in solving the count of wines sold to restaurants using poisson regression. We started off with a data set that required some cleaning and pre-processing before feeding it to an algorithm to train a model. After cleaning and preparing the data set we tried multiple techniques to perform count regression. We fed the data to poisson regression model, negative binomial model, Multiple linear regression and Zero Inflation poisson model. We compared the results of all the models based on mean absolute error (MSE) and mean absolute error (MAE), and selected the on with relatively less predictors and comparatively accepting values for MSE and MAE. Then we load the evaluation data set and performed predictions using reduced poisson regression model. The results from different wqe very similar to each mainly because of the pre-processing where random forest was utilized to fill in for missing values.