Predicting Wine Sales With Certain Properties
INTRODUCTION:
In this study we will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.
Our objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. Below is a short description of the variables of interest in the data set:
DATA EXPLORATION:
In this section we load and explore the training data set. We will try get familiarize ourselves with different variables i.e. dependent and independent variables, and check out their distributions. The problem at hand is about counting the number of wines sold that contains certain properties which indicates that we will be dealing with a lot of variables since these kind of problems are dependent on multiple factors. So before any further due let’s begin by loading the data set.
Loading The Data set:
Below code chunk loads the required data set that we can use to train our model.
training <- read_csv("https://raw.githubusercontent.com/Umerfarooq122/predict-the-number-of-cases-of-wine-that-will-be-sold-given-certain-properties-of-the-wine/main/wine-training-data.csv")Let’s display the fist five row of the data set to check if everything has been loaded into our work environment correctly
| INDEX | TARGET | FixedAcidity | VolatileAcidity | CitricAcid | ResidualSugar | Chlorides | FreeSulfurDioxide | TotalSulfurDioxide | Density | pH | Sulphates | Alcohol | LabelAppeal | AcidIndex | STARS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 3 | 3.2 | 1.160 | -0.98 | 54.2 | -0.567 | NA | 268 | 0.99280 | 3.33 | -0.59 | 9.9 | 0 | 8 | 2 |
| 2 | 3 | 4.5 | 0.160 | -0.81 | 26.1 | -0.425 | 15 | -327 | 1.02792 | 3.38 | 0.70 | NA | -1 | 7 | 3 |
| 4 | 5 | 7.1 | 2.640 | -0.88 | 14.8 | 0.037 | 214 | 142 | 0.99518 | 3.12 | 0.48 | 22.0 | -1 | 8 | 3 |
| 5 | 3 | 5.7 | 0.385 | 0.04 | 18.8 | -0.425 | 22 | 115 | 0.99640 | 2.24 | 1.83 | 6.2 | -1 | 6 | 1 |
| 6 | 4 | 8.0 | 0.330 | -1.26 | 9.4 | NA | -167 | 108 | 0.99457 | 3.12 | 1.77 | 13.7 | 0 | 9 | 2 |
| 7 | 0 | 11.3 | 0.320 | 0.59 | 2.2 | 0.556 | -37 | 15 | 0.99940 | 3.20 | 1.29 | 15.4 | 0 | 11 | NA |
Checking Out The Dimensions, Descriptive Summary And Distributions:
We can see that we have got all the columns that are mentioned in the introduction about data set. Let’s check out the dimension of the data set
## [1] 12795 16
As we can see that we have got 16 columns in total and 12795 observations. One of those columns in an index column and we usually do not need it for the analysis so lets remove that from our data set.
Let’s quickly peek into the descriptive summary of our data set
| TARGET | FixedAcidity | VolatileAcidity | CitricAcid | ResidualSugar | Chlorides | FreeSulfurDioxide | TotalSulfurDioxide | Density | pH | Sulphates | Alcohol | LabelAppeal | AcidIndex | STARS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. :0.000 | Min. :-18.100 | Min. :-2.7900 | Min. :-3.2400 | Min. :-127.800 | Min. :-1.1710 | Min. :-555.00 | Min. :-823.0 | Min. :0.8881 | Min. :0.480 | Min. :-3.1300 | Min. :-4.70 | Min. :-2.000000 | Min. : 4.000 | Min. :1.000 | |
| 1st Qu.:2.000 | 1st Qu.: 5.200 | 1st Qu.: 0.1300 | 1st Qu.: 0.0300 | 1st Qu.: -2.000 | 1st Qu.:-0.0310 | 1st Qu.: 0.00 | 1st Qu.: 27.0 | 1st Qu.:0.9877 | 1st Qu.:2.960 | 1st Qu.: 0.2800 | 1st Qu.: 9.00 | 1st Qu.:-1.000000 | 1st Qu.: 7.000 | 1st Qu.:1.000 | |
| Median :3.000 | Median : 6.900 | Median : 0.2800 | Median : 0.3100 | Median : 3.900 | Median : 0.0460 | Median : 30.00 | Median : 123.0 | Median :0.9945 | Median :3.200 | Median : 0.5000 | Median :10.40 | Median : 0.000000 | Median : 8.000 | Median :2.000 | |
| Mean :3.029 | Mean : 7.076 | Mean : 0.3241 | Mean : 0.3084 | Mean : 5.419 | Mean : 0.0548 | Mean : 30.85 | Mean : 120.7 | Mean :0.9942 | Mean :3.208 | Mean : 0.5271 | Mean :10.49 | Mean :-0.009066 | Mean : 7.773 | Mean :2.042 | |
| 3rd Qu.:4.000 | 3rd Qu.: 9.500 | 3rd Qu.: 0.6400 | 3rd Qu.: 0.5800 | 3rd Qu.: 15.900 | 3rd Qu.: 0.1530 | 3rd Qu.: 70.00 | 3rd Qu.: 208.0 | 3rd Qu.:1.0005 | 3rd Qu.:3.470 | 3rd Qu.: 0.8600 | 3rd Qu.:12.40 | 3rd Qu.: 1.000000 | 3rd Qu.: 8.000 | 3rd Qu.:3.000 | |
| Max. :8.000 | Max. : 34.400 | Max. : 3.6800 | Max. : 3.8600 | Max. : 141.150 | Max. : 1.3510 | Max. : 623.00 | Max. :1057.0 | Max. :1.0992 | Max. :6.130 | Max. : 4.2400 | Max. :26.50 | Max. : 2.000000 | Max. :17.000 | Max. :4.000 | |
| NA | NA | NA | NA | NA’s :616 | NA’s :638 | NA’s :647 | NA’s :682 | NA | NA’s :395 | NA’s :1210 | NA’s :653 | NA | NA | NA’s :3359 |
As we See that are multiple missing values in the dataset and we can
also see that our columns have a variety of ranges. The
TARGET columns mean is at around 3 and we can also check
the variance to see if later on during modeling the poisson distribution
would be the right model.
## [1] 3.710895
Variance is very close to the mean of the TARGET column
so I think poisson would be an optimal fit. We can check out the
structure of data set.
## tibble [12,795 × 15] (S3: tbl_df/tbl/data.frame)
## $ TARGET : num [1:12795] 3 3 5 3 4 0 0 4 3 6 ...
## $ FixedAcidity : num [1:12795] 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
## $ VolatileAcidity : num [1:12795] 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
## $ CitricAcid : num [1:12795] -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
## $ ResidualSugar : num [1:12795] 54.2 26.1 14.8 18.8 9.4 ...
## $ Chlorides : num [1:12795] -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
## $ FreeSulfurDioxide : num [1:12795] NA 15 214 22 -167 -37 287 523 -213 62 ...
## $ TotalSulfurDioxide: num [1:12795] 268 -327 142 115 108 15 156 551 NA 180 ...
## $ Density : num [1:12795] 0.993 1.028 0.995 0.996 0.995 ...
## $ pH : num [1:12795] 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
## $ Sulphates : num [1:12795] -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
## $ Alcohol : num [1:12795] 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
## $ LabelAppeal : num [1:12795] 0 -1 -1 -1 0 0 0 1 0 0 ...
## $ AcidIndex : num [1:12795] 8 7 8 6 9 11 8 7 6 8 ...
## $ STARS : num [1:12795] 2 3 3 1 2 NA NA 3 NA 4 ...
We can see that some column does not have the right data type so we
have to fix that. For instance, STARS column was suppose to
a nominal data rather than numeric so we have to fix that. We can also
look the distribution of all the columns and its correlation with our
TARGET column using visualization as shown below:
DATA PREPARATION:
In this section we will prepare our data for modeling. We can set the
data type for column like STARS and convert them into
factors which a much more acceptable data type when it comes to
Modeling.
Imputing The Missing Values:
Since our data set has a mixture of continuous and categorical variables so we will consider a method that can handle both types and my personal pick would be to use random forest method to look at. Random forest can handle both data type plus it is an ensemble method which is a better approach to predict something.
##
## iter imp variable
## 1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
let’s do a quick check of any missing values in the data set:
## [1] 0
As our data set is ready so now we can go ahead and create our models
BUILDING MODELS:
Splitting the Data set:
Now let’s split the data into training and testing. We will split the
data set training into partial_train and
validation. Partial_train contains 80% of the
data from training and the rest is in the validation that
we will use for testing or evaluating the performance of our model.
Poisson Regression:
##
## Call:
## glm(formula = TARGET ~ ., family = "poisson", data = partial_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3869 -0.6120 0.0528 0.5468 2.7575
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.592e+00 2.174e-01 7.324 2.40e-13 ***
## FixedAcidity -1.379e-04 9.156e-04 -0.151 0.88026
## VolatileAcidity -3.292e-02 7.304e-03 -4.507 6.56e-06 ***
## CitricAcid 1.119e-02 6.553e-03 1.708 0.08759 .
## ResidualSugar 2.846e-05 1.696e-04 0.168 0.86670
## Chlorides -7.553e-02 1.802e-02 -4.191 2.78e-05 ***
## FreeSulfurDioxide 1.235e-04 3.812e-05 3.240 0.00120 **
## TotalSulfurDioxide 7.042e-05 2.450e-05 2.874 0.00405 **
## Density -3.029e-01 2.132e-01 -1.421 0.15544
## pH -6.771e-03 8.405e-03 -0.806 0.42047
## Sulphates -1.349e-02 6.107e-03 -2.209 0.02715 *
## Alcohol 4.374e-03 1.537e-03 2.846 0.00443 **
## LabelAppeal 1.518e-01 6.858e-03 22.134 < 2e-16 ***
## AcidIndex -9.285e-02 5.080e-03 -18.277 < 2e-16 ***
## STARS2 6.553e-01 1.491e-02 43.954 < 2e-16 ***
## STARS3 8.180e-01 1.645e-02 49.726 < 2e-16 ***
## STARS4 9.316e-01 2.325e-02 40.076 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18190 on 10237 degrees of freedom
## Residual deviance: 12155 on 10221 degrees of freedom
## AIC: 37790
##
## Number of Fisher Scoring iterations: 5
Poisson Regression Reduced:
p2 <- glm(formula = TARGET~ VolatileAcidity + Chlorides + TotalSulfurDioxide + FreeSulfurDioxide + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS , data = partial_train, family = 'poisson')##
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + TotalSulfurDioxide +
## FreeSulfurDioxide + Sulphates + Alcohol + LabelAppeal + AcidIndex +
## STARS, family = "poisson", data = partial_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3568 -0.6124 0.0531 0.5459 2.7593
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.268e+00 4.484e-02 28.268 < 2e-16 ***
## VolatileAcidity -3.342e-02 7.301e-03 -4.577 4.71e-06 ***
## Chlorides -7.558e-02 1.801e-02 -4.196 2.72e-05 ***
## TotalSulfurDioxide 7.023e-05 2.448e-05 2.869 0.00412 **
## FreeSulfurDioxide 1.230e-04 3.810e-05 3.229 0.00124 **
## Sulphates -1.369e-02 6.103e-03 -2.244 0.02484 *
## Alcohol 4.465e-03 1.536e-03 2.907 0.00364 **
## LabelAppeal 1.518e-01 6.856e-03 22.135 < 2e-16 ***
## AcidIndex -9.245e-02 4.997e-03 -18.501 < 2e-16 ***
## STARS2 6.563e-01 1.490e-02 44.059 < 2e-16 ***
## STARS3 8.188e-01 1.644e-02 49.798 < 2e-16 ***
## STARS4 9.315e-01 2.324e-02 40.079 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18190 on 10237 degrees of freedom
## Residual deviance: 12161 on 10226 degrees of freedom
## AIC: 37785
##
## Number of Fisher Scoring iterations: 5
Negative Binomial Model:
## Warning in sqrt(1/i): NaNs produced
## Warning in sqrt(1/i): NaNs produced
Negative Binomial Reduced:
nbp2 <- glm.nb(formula = TARGET~VolatileAcidity + Chlorides + TotalSulfurDioxide + FreeSulfurDioxide + Alcohol + LabelAppeal + AcidIndex + STARS , data = partial_train, link = log, maxit = 2000)##
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + Chlorides + TotalSulfurDioxide +
## FreeSulfurDioxide + Alcohol + LabelAppeal + AcidIndex + STARS,
## data = partial_train, maxit = 2000, link = log, init.theta = 2.45087425e+17)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## 0.0000 0.0000 0.4522 1.5316 4.1016
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.262e+00 4.478e-02 28.190 < 2e-16 ***
## VolatileAcidity -3.340e-02 7.302e-03 -4.574 4.78e-06 ***
## Chlorides -7.508e-02 1.801e-02 -4.169 3.06e-05 ***
## TotalSulfurDioxide 7.011e-05 2.448e-05 2.864 0.00418 **
## FreeSulfurDioxide 1.217e-04 3.810e-05 3.193 0.00141 **
## Alcohol 4.458e-03 1.536e-03 2.903 0.00370 **
## LabelAppeal 1.517e-01 6.856e-03 22.120 < 2e-16 ***
## AcidIndex -9.271e-02 4.995e-03 -18.559 < 2e-16 ***
## STARS2 6.563e-01 1.490e-02 44.056 < 2e-16 ***
## STARS3 8.194e-01 1.644e-02 49.840 < 2e-16 ***
## STARS4 9.326e-01 2.324e-02 40.128 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2.450874e+17) family taken to be 1)
##
## Null deviance: 18190 on 10237 degrees of freedom
## Residual deviance: 12166 on 10227 degrees of freedom
## AIC: 24
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2.450874e+17
## Std. Err.: 3.560756e+14
##
## 2 x log-likelihood: 0
NOTE: As we can see that coefficients from poisson and
negative binomial model are exactly alike mainly because of the
closeness in the value of mean and variance for response variable.
Multiple Linear Regression:
##
## Call:
## glm(formula = TARGET ~ ., family = gaussian(), data = partial_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.5639 -0.9518 0.1146 0.9943 4.3821
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.468e+00 5.267e-01 8.484 < 2e-16 ***
## FixedAcidity -2.496e-04 2.232e-03 -0.112 0.910975
## VolatileAcidity -9.971e-02 1.778e-02 -5.607 2.11e-08 ***
## CitricAcid 3.572e-02 1.608e-02 2.221 0.026356 *
## ResidualSugar 1.101e-04 4.134e-04 0.266 0.790036
## Chlorides -2.313e-01 4.380e-02 -5.282 1.31e-07 ***
## FreeSulfurDioxide 3.680e-04 9.301e-05 3.957 7.65e-05 ***
## TotalSulfurDioxide 2.136e-04 5.950e-05 3.589 0.000333 ***
## Density -8.467e-01 5.190e-01 -1.631 0.102826
## pH -1.185e-02 2.039e-02 -0.581 0.560972
## Sulphates -3.642e-02 1.495e-02 -2.436 0.014883 *
## Alcohol 1.518e-02 3.733e-03 4.068 4.78e-05 ***
## LabelAppeal 4.495e-01 1.639e-02 27.422 < 2e-16 ***
## AcidIndex -2.359e-01 1.085e-02 -21.749 < 2e-16 ***
## STARS2 1.628e+00 3.261e-02 49.913 < 2e-16 ***
## STARS3 2.343e+00 3.991e-02 58.700 < 2e-16 ***
## STARS4 3.012e+00 6.691e-02 45.006 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.962023)
##
## Null deviance: 37813 on 10237 degrees of freedom
## Residual deviance: 20054 on 10221 degrees of freedom
## AIC: 35973
##
## Number of Fisher Scoring iterations: 2
Multiple Linear Regression Reduced:
ml2 <- glm(formula = TARGET~VolatileAcidity + Chlorides + TotalSulfurDioxide + FreeSulfurDioxide + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS , data = partial_train, family = gaussian())##
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + TotalSulfurDioxide +
## FreeSulfurDioxide + Sulphates + Alcohol + LabelAppeal + AcidIndex +
## STARS, family = gaussian(), data = partial_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.5088 -0.9478 0.1142 0.9970 4.3845
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.587e+00 9.817e-02 36.541 < 2e-16 ***
## VolatileAcidity -1.012e-01 1.778e-02 -5.692 1.29e-08 ***
## Chlorides -2.318e-01 4.379e-02 -5.293 1.23e-07 ***
## TotalSulfurDioxide 2.144e-04 5.948e-05 3.605 0.000313 ***
## FreeSulfurDioxide 3.670e-04 9.298e-05 3.946 7.98e-05 ***
## Sulphates -3.696e-02 1.495e-02 -2.472 0.013434 *
## Alcohol 1.539e-02 3.731e-03 4.125 3.74e-05 ***
## LabelAppeal 4.497e-01 1.639e-02 27.433 < 2e-16 ***
## AcidIndex -2.348e-01 1.063e-02 -22.098 < 2e-16 ***
## STARS2 1.630e+00 3.258e-02 50.034 < 2e-16 ***
## STARS3 2.344e+00 3.989e-02 58.766 < 2e-16 ***
## STARS4 3.012e+00 6.692e-02 45.003 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.962636)
##
## Null deviance: 37813 on 10237 degrees of freedom
## Residual deviance: 20070 on 10226 degrees of freedom
## AIC: 35972
##
## Number of Fisher Scoring iterations: 2
Zero Inflation Poisson:
##
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = partial_train)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.38259 -0.49502 0.06398 0.49093 2.09559
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.669e+00 2.260e-01 7.385 1.52e-13 ***
## FixedAcidity 9.777e-05 9.435e-04 0.104 0.9175
## VolatileAcidity -1.534e-02 7.545e-03 -2.033 0.0420 *
## CitricAcid 3.973e-03 6.733e-03 0.590 0.5552
## ResidualSugar -7.725e-05 1.744e-04 -0.443 0.6578
## Chlorides -3.634e-02 1.872e-02 -1.941 0.0523 .
## FreeSulfurDioxide 4.443e-05 3.871e-05 1.148 0.2511
## TotalSulfurDioxide -7.829e-07 2.453e-05 -0.032 0.9745
## Density -3.177e-01 2.215e-01 -1.434 0.1515
## pH 3.379e-03 8.685e-03 0.389 0.6973
## Sulphates -1.669e-03 6.351e-03 -0.263 0.7927
## Alcohol 6.882e-03 1.567e-03 4.392 1.12e-05 ***
## LabelAppeal 2.268e-01 7.168e-03 31.639 < 2e-16 ***
## AcidIndex -3.718e-02 5.717e-03 -6.504 7.82e-11 ***
## STARS2 1.470e-01 1.611e-02 9.125 < 2e-16 ***
## STARS3 2.503e-01 1.749e-02 14.310 < 2e-16 ***
## STARS4 3.392e-01 2.440e-02 13.903 < 2e-16 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.38037 0.03323 -11.447 < 2e-16 ***
## STARS2 -2.75256 0.11609 -23.712 < 2e-16 ***
## STARS3 -19.25488 529.84908 -0.036 0.971011
## STARS4 -5.51533 1.42253 -3.877 0.000106 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 41
## Log-likelihood: -1.726e+04 on 21 Df
Zero Inflation Poisson Reduced:
zip2 <- zeroinfl(TARGET ~ VolatileAcidity + Chlorides +
Alcohol + LabelAppeal + AcidIndex | STARS, data =partial_train)##
## Call:
## zeroinfl(formula = TARGET ~ VolatileAcidity + Chlorides + Alcohol + LabelAppeal +
## AcidIndex | STARS, data = partial_train)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.2263 -0.4010 0.1354 0.4714 2.2612
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.515934 0.046473 32.619 < 2e-16 ***
## VolatileAcidity -0.018742 0.007499 -2.499 0.0125 *
## Chlorides -0.036041 0.018637 -1.934 0.0531 .
## Alcohol 0.008768 0.001553 5.645 1.65e-08 ***
## LabelAppeal 0.272548 0.006562 41.536 < 2e-16 ***
## AcidIndex -0.039840 0.005591 -7.126 1.04e-12 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.33864 0.03212 -10.54 <2e-16 ***
## STARS2 -2.79757 0.11348 -24.65 <2e-16 ***
## STARS3 -13.72141 29.86030 -0.46 0.646
## STARS4 -12.66981 57.62269 -0.22 0.826
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 22
## Log-likelihood: -1.74e+04 on 10 Df
SELECTING MODELS AND EVALUATION:
Model Selection:
p1_pred <- predict(p1, newdata = validation, type = "response")
p2_pred <- predict(p2, newdata = validation, type = "response")
nbp1_pred <- predict(nbp1, newdata = validation, type = "response")
nbp2_pred <- predict(nbp2 , newdata = validation, type = "response")
ml1_pred <- predict(ml1, newdata = validation, type = "response")
ml2_pred <- predict(ml2, newdata = validation, type = "response")
zip1_pred <- predict(zip1, newdata = validation, type = "response")
zip2_pred <- predict(zip2, newdata = validation, type = "response")p1_mae <- mae(validation$TARGET, p1_pred)
p1_mse <- mse(validation$TARGET, p1_pred)
p2_mae <- mae(validation$TARGET, p2_pred)
p2_mse <- mse(validation$TARGET, p2_pred)
nbp1_mae <- mae(validation$TARGET, nbp1_pred)
nbp1_mse <- mse(validation$TARGET, nbp1_pred)
nbp2_mae <- mae(validation$TARGET, nbp2_pred)
nbp2_mse <- mse(validation$TARGET, nbp2_pred)
ml1_mae <- mae(validation$TARGET, ml1_pred)
ml1_mse <- mse(validation$TARGET, ml1_pred)
ml2_mae <- mae(validation$TARGET, ml2_pred)
ml2_mse <- mse(validation$TARGET, ml2_pred)
zip1_mae <- mae(validation$TARGET, zip1_pred)
zip1_mse <- mse(validation$TARGET, zip1_pred)
zip2_mae <- mae(validation$TARGET, zip2_pred)
zip2_mse <- mse(validation$TARGET, zip2_pred)model_names <- c("Poisson", "Poisson Reduced", "Negative Binomial", "Negative Binomial Reduced", "Multiple Linear Regression", "Multiple Linear Regression Reduced", "Zero Inflation Poisson", "Zero Inflation Poisson Reduced")
mae_values <- c(p1_mae, p2_mae, nbp1_mae,nbp2_mae, ml1_mae, ml2_mae, zip1_mae, zip2_mae)
mse_values <- c(p1_mse, p2_mse, nbp1_mse,nbp2_mse, ml1_mse, ml2_mse, zip1_mse, zip2_mse)
result_df <- data.frame(Model = model_names, MAE = mae_values, MSE = mse_values)
knitr::kable(result_df)| Model | MAE | MSE |
|---|---|---|
| Poisson | 1.141460 | 1.967859 |
| Poisson Reduced | 1.140504 | 1.967155 |
| Negative Binomial | 1.141460 | 1.967859 |
| Negative Binomial Reduced | 1.140341 | 1.968341 |
| Multiple Linear Regression | 1.132244 | 1.976968 |
| Multiple Linear Regression Reduced | 1.132479 | 1.978670 |
| Zero Inflation Poisson | 1.142168 | 1.993674 |
| Zero Inflation Poisson Reduced | 1.178747 | 2.171632 |
Predictions:
Let’s load the testing data set for predictions:
testing <- read.csv("https://raw.githubusercontent.com/Umerfarooq122/predict-the-number-of-cases-of-wine-that-will-be-sold-given-certain-properties-of-the-wine/main/wine-evaluation-data.csv")Displaying the first few row of our testing data set:
| IN | TARGET | FixedAcidity | VolatileAcidity | CitricAcid | ResidualSugar | Chlorides | FreeSulfurDioxide | TotalSulfurDioxide | Density | pH | Sulphates | Alcohol | LabelAppeal | AcidIndex | STARS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | NA | 5.4 | -0.860 | 0.27 | -10.7 | 0.092 | 23 | 398 | 0.98527 | 5.02 | 0.64 | 12.30 | -1 | 6 | NA |
| 9 | NA | 12.4 | 0.385 | -0.76 | -19.7 | 1.169 | -37 | 68 | 0.99048 | 3.37 | 1.09 | 16.00 | 0 | 6 | 2 |
| 10 | NA | 7.2 | 1.750 | 0.17 | -33.0 | 0.065 | 9 | 76 | 1.04641 | 4.61 | 0.68 | 8.55 | 0 | 8 | 1 |
| 18 | NA | 6.2 | 0.100 | 1.80 | 1.0 | -0.179 | 104 | 89 | 0.98877 | 3.20 | 2.11 | 12.30 | -1 | 8 | 1 |
| 21 | NA | 11.4 | 0.210 | 0.28 | 1.2 | 0.038 | 70 | 53 | 1.02899 | 2.54 | -0.07 | 4.80 | 0 | 10 | NA |
| 30 | NA | 17.6 | 0.040 | -1.15 | 1.4 | 0.535 | -250 | 140 | 0.95028 | 3.06 | -0.02 | 11.40 | 1 | 8 | 4 |
Removing unwanted Columns
Looking for any missing values in our testing data set:
## FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## 0 0 0 168
## Chlorides FreeSulfurDioxide TotalSulfurDioxide Density
## 138 152 157 0
## pH Sulphates Alcohol LabelAppeal
## 104 310 185 0
## AcidIndex STARS
## 0 841
##
## iter imp variable
## 1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
We are going to use the poisson reduced model to predict the outcomes of our testing set.
Here is the histogram of our predictions
CONCLUSION:
This particular study was focused in solving the count of wines sold to restaurants using poisson regression. We started off with a data set that required some cleaning and pre-processing before feeding it to an algorithm to train a model. After cleaning and preparing the data set we tried multiple techniques to perform count regression. We fed the data to poisson regression model, negative binomial model, Multiple linear regression and Zero Inflation poisson model. We compared the results of all the models based on mean absolute error (MSE) and mean absolute error (MAE), and selected the on with relatively less predictors and comparatively accepting values for MSE and MAE. Then we load the evaluation data set and performed predictions using reduced poisson regression model. The results from different wqe very similar to each mainly because of the pre-processing where random forest was utilized to fill in for missing values.