Background

Given a set of metrics describing the quality of different wines, can we predict the number of sales of each wine? Since the target is a count response variable, we will build models appropriate for count regression. After evaluating our set of models, we will select the best one and predict the number of sales in a wine data set the model has not yet seen.

Data Exploration

Statistical Summary of Variables

##     ï..INDEX         TARGET       FixedAcidity     VolatileAcidity  
##  Min.   :    1   Min.   :0.000   Min.   :-18.100   Min.   :-2.7900  
##  1st Qu.: 4038   1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300  
##  Median : 8110   Median :3.000   Median :  6.900   Median : 0.2800  
##  Mean   : 8070   Mean   :3.029   Mean   :  7.076   Mean   : 0.3241  
##  3rd Qu.:12106   3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400  
##  Max.   :16129   Max.   :8.000   Max.   : 34.400   Max.   : 3.6800  
##                                                                     
##    CitricAcid      ResidualSugar        Chlorides       FreeSulfurDioxide
##  Min.   :-3.2400   Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00  
##  1st Qu.: 0.0300   1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00  
##  Median : 0.3100   Median :   3.900   Median : 0.0460   Median :  30.00  
##  Mean   : 0.3084   Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85  
##  3rd Qu.: 0.5800   3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00  
##  Max.   : 3.8600   Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00  
##                    NA's   :616        NA's   :638       NA's   :647      
##  TotalSulfurDioxide    Density             pH          Sulphates      
##  Min.   :-823.0     Min.   :0.8881   Min.   :0.480   Min.   :-3.1300  
##  1st Qu.:  27.0     1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800  
##  Median : 123.0     Median :0.9945   Median :3.200   Median : 0.5000  
##  Mean   : 120.7     Mean   :0.9942   Mean   :3.208   Mean   : 0.5271  
##  3rd Qu.: 208.0     3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600  
##  Max.   :1057.0     Max.   :1.0992   Max.   :6.130   Max.   : 4.2400  
##  NA's   :682                         NA's   :395     NA's   :1210     
##     Alcohol       LabelAppeal          AcidIndex          STARS      
##  Min.   :-4.70   Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.: 9.00   1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median :10.40   Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :10.49   Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.:12.40   3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   :26.50   Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##  NA's   :653                                          NA's   :3359

Description of Variables

## 'data.frame':    12795 obs. of  16 variables:
##  $ ï..INDEX          : int  1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET            : int  3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity      : num  3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity   : num  1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid        : num  -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar     : num  54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides         : num  -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide : num  NA 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide: num  268 -327 142 115 108 15 156 551 NA 180 ...
##  $ Density           : num  0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num  3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates         : num  -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
##  $ Alcohol           : num  9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal       : int  0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : int  8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS             : int  2 3 3 1 2 NA NA 3 NA 4 ...
  • There are 12,795 observations and 16 columns.
  • Of these columns, 14 are predictors.
  • Of the remaining 2, there is an index column with invalid characters, so it will need to be renamed. The column with the response variable is named TARGET.
  • TARGET, the response variable, is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine.
  • Aside from STARS and LabelAppeal, the predictors are mostly chemical metrics of the wines.
  • Values in STARS are ratings given to the wines by experts whereas values in LabelAppeal are marketing scores indicate the level of visual appeal of the wine label to customers. Note thatLabelAppeal is not a score given by customers themselves, but by marketing tools that have used other sources to make assumptions.
  • The target has a small range, 0-8. This indicates that fewer than 10 of each wine has been sold.

Missing values

##           ï..INDEX             TARGET       FixedAcidity    VolatileAcidity 
##           0.000000           0.000000           0.000000           0.000000 
##         CitricAcid      ResidualSugar          Chlorides  FreeSulfurDioxide 
##           0.000000           4.814381           4.986323           5.056663 
## TotalSulfurDioxide            Density                 pH          Sulphates 
##           5.330207           0.000000           3.087143           9.456819 
##            Alcohol        LabelAppeal          AcidIndex              STARS 
##           5.103556           0.000000           0.000000          26.252442

Over 25% of the wines don’t have a value for STARS, meaning they have not been rated by experts. What is the relationship between lack of a rating and number of sales?

Correlation Plot

  • There is a positive correlation between STARS, LabelAppeal, and TARGET. This makes sense because the better a wine label appears, the more likely a customer will buy the wine. And if an expert rates a wine highly, it is indicative that other people will like it is as well and decide to buy it.
  • There is a slight negative correlation between AcidIndex and TARGET, and it is interesting that this does not appear to be the case for pH and TARGET. Since pH is also a metric for acidity, we might have expected a relationship to exist.

Distribution of Variables

Histograms of Distributions of Predictors & Target

Note that TARGET contains many 0 values. This means that many of the wines have 0 sales.

Upon further inspection, we determine that 21% of the wines have 0 sales. We will explore how much the frequency of zeroes in TARGET affects our models later.

Scatterplots of Target vs Predictors

For the AcidIndex, STARS, and LabelAppeal predictors, due to the small range of both them and the target variable, it may be easier to see their relationships if the data is jittered. Also, how large is the difference between pH and AcidIndex?

  • There is a clear positive relationship between LabelAppeal, STARS, and TARGET.
  • The AcidIndex plot reveals that most of the wines have a lower total acidity, between 5 and 10.
  • The pH of most of the wine samples are also low, between 2 and 5.

Build Models

Model #1: Full Multiple Linear Regression

## 
## Call:
## lm(formula = TARGET ~ ., data = select(wine.train, -"INDEX"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0614 -0.5143  0.1240  0.7170  3.2419 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.563e+00  5.530e-01   8.251  < 2e-16 ***
## FixedAcidity        1.685e-03  2.319e-03   0.727   0.4675    
## VolatileAcidity    -9.466e-02  1.846e-02  -5.129 3.00e-07 ***
## CitricAcid         -4.836e-03  1.675e-02  -0.289   0.7728    
## ResidualSugar      -2.513e-04  4.276e-04  -0.588   0.5567    
## Chlorides          -1.134e-01  4.546e-02  -2.494   0.0126 *  
## FreeSulfurDioxide   2.264e-04  9.711e-05   2.332   0.0198 *  
## TotalSulfurDioxide  7.810e-05  6.288e-05   1.242   0.2142    
## Density            -1.281e+00  5.435e-01  -2.357   0.0185 *  
## pH                 -9.441e-03  2.121e-02  -0.445   0.6563    
## Sulphates          -1.727e-02  1.558e-02  -1.109   0.2676    
## Alcohol             1.653e-02  3.887e-03   4.252 2.15e-05 ***
## LabelAppeal         6.442e-01  1.743e-02  36.947  < 2e-16 ***
## AcidIndex          -1.649e-01  1.235e-02 -13.346  < 2e-16 ***
## STARS               7.278e-01  1.710e-02  42.571  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.153 on 6421 degrees of freedom
##   (6359 observations deleted due to missingness)
## Multiple R-squared:  0.445,  Adjusted R-squared:  0.4438 
## F-statistic: 367.8 on 14 and 6421 DF,  p-value: < 2.2e-16

44% of the variability observed in the number of sales is explained by the model.

Model #2: Multiple regression with Manually Selected Features

We choose the highly significant variables as outputted by the previous model:

  • VolatileAcidity
  • Sulphates
  • Alcohol
  • LabelAppeal
  • AcidIndex
  • STARS
## 
## Call:
## lm(formula = TARGET ~ VolatileAcidity + Sulphates + Alcohol + 
##     LabelAppeal + AcidIndex + STARS, data = wine.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0351 -0.5123  0.1289  0.7224  3.1485 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.282082   0.100168  32.766  < 2e-16 ***
## VolatileAcidity -0.090141   0.016389  -5.500 3.91e-08 ***
## Sulphates       -0.014904   0.013840  -1.077    0.282    
## Alcohol          0.020198   0.003436   5.879 4.29e-09 ***
## LabelAppeal      0.653326   0.015456  42.269  < 2e-16 ***
## AcidIndex       -0.167059   0.010937 -15.274  < 2e-16 ***
## STARS            0.719125   0.015136  47.511  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.152 on 8127 degrees of freedom
##   (4661 observations deleted due to missingness)
## Multiple R-squared:  0.4472, Adjusted R-squared:  0.4468 
## F-statistic:  1096 on 6 and 8127 DF,  p-value: < 2.2e-16

The Adjusted \(R^{2}\) is hardly better, as seen by the output below:

## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1     0.447         0.447  1.15     1096.       0     6 -12687. 25391. 25447.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Evaluation of the Multiple Linear Regression Models

Multicollinearity is when independent variables are correlated. Variance Inflation (VIF) is a metric that can be used to determine multicollinearity between variables in a model. A score over 5 is considered severe, and the variable would not be as statistically significant. If there is a problem with multicollinearity, one solution is to carefully trim the model by removing some of the offending variables.

Variance Inflation and Diagnostic Plots for Model 1

##       FixedAcidity    VolatileAcidity         CitricAcid      ResidualSugar 
##           1.027295           1.003524           1.006280           1.002597 
##          Chlorides  FreeSulfurDioxide TotalSulfurDioxide            Density 
##           1.003084           1.004023           1.002931           1.004776 
##                 pH          Sulphates            Alcohol        LabelAppeal 
##           1.004116           1.004483           1.009935           1.117057 
##          AcidIndex              STARS 
##           1.048563           1.134199

Variance Inflation and Diagnostic Plots for Model 2

## VolatileAcidity       Sulphates         Alcohol     LabelAppeal       AcidIndex 
##        1.001844        1.000869        1.006664        1.126009        1.012806 
##           STARS 
##        1.138001

In each model, the VIF scores are very close to 1, which is good. Still, there is overall evidence that a multiple linear regression is not the right fit for this data.

Model #3: Full Poisson

## 
## Call:
## glm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), family = poisson, 
##     data = wine.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2212  -0.2662   0.0461   0.3943   1.7274  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               7.306e-01  5.130e-01   1.424  0.15441    
## FixedAcidity              5.803e-04  1.055e-03   0.550  0.58216    
## VolatileAcidity          -2.371e-02  8.369e-03  -2.833  0.00461 ** 
## CitricAcid               -2.313e-03  7.581e-03  -0.305  0.76034    
## ResidualSugar            -7.068e-05  1.943e-04  -0.364  0.71598    
## Chlorides                -3.261e-02  2.056e-02  -1.587  0.11260    
## FreeSulfurDioxide         5.617e-05  4.399e-05   1.277  0.20173    
## TotalSulfurDioxide        1.985e-05  2.858e-05   0.695  0.48734    
## Density                  -3.803e-01  2.464e-01  -1.543  0.12274    
## pH                       -1.103e-03  9.614e-03  -0.115  0.90867    
## Sulphates                -5.343e-03  7.055e-03  -0.757  0.44883    
## Alcohol                   4.762e-03  1.774e-03   2.685  0.00726 ** 
## as.factor(LabelAppeal)-1  2.701e-01  5.337e-02   5.061 4.18e-07 ***
## as.factor(LabelAppeal)0   4.943e-01  5.205e-02   9.497  < 2e-16 ***
## as.factor(LabelAppeal)1   6.493e-01  5.277e-02  12.305  < 2e-16 ***
## as.factor(LabelAppeal)2   7.637e-01  5.840e-02  13.077  < 2e-16 ***
## as.factor(AcidIndex)5     1.186e-01  4.553e-01   0.260  0.79455    
## as.factor(AcidIndex)6     1.902e-01  4.487e-01   0.424  0.67166    
## as.factor(AcidIndex)7     1.536e-01  4.484e-01   0.343  0.73190    
## as.factor(AcidIndex)8     1.286e-01  4.485e-01   0.287  0.77430    
## as.factor(AcidIndex)9     7.794e-02  4.488e-01   0.174  0.86215    
## as.factor(AcidIndex)10   -1.910e-02  4.502e-01  -0.042  0.96617    
## as.factor(AcidIndex)11   -2.417e-01  4.540e-01  -0.532  0.59456    
## as.factor(AcidIndex)12   -2.259e-01  4.605e-01  -0.490  0.62379    
## as.factor(AcidIndex)13   -1.328e-01  4.634e-01  -0.287  0.77447    
## as.factor(AcidIndex)14   -1.998e-01  4.813e-01  -0.415  0.67806    
## as.factor(AcidIndex)15    2.322e-02  5.591e-01   0.042  0.96687    
## as.factor(AcidIndex)16   -2.005e-01  6.341e-01  -0.316  0.75185    
## as.factor(AcidIndex)17    7.303e-02  6.351e-01   0.115  0.90845    
## as.factor(STARS)2         3.175e-01  1.744e-02  18.198  < 2e-16 ***
## as.factor(STARS)3         4.320e-01  1.899e-02  22.747  < 2e-16 ***
## as.factor(STARS)4         5.519e-01  2.692e-02  20.504  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 5844.1  on 6435  degrees of freedom
## Residual deviance: 3890.7  on 6404  degrees of freedom
##   (6359 observations deleted due to missingness)
## AIC: 23087
## 
## Number of Fisher Scoring iterations: 5

Test Dispersion on Model 3

## 
##  Overdispersion test
## 
## data:  Poisson_Model1
## z = -46.626, p-value = 1
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##      alpha 
## -0.5779479

Since the p value is 1, meaning this is not Over-dispersion Which is good.

Model #4: Poisson With Selected Predictors

We choose the highly significant variables as outputted by Model 3:

  • VolatileAcidity
  • Chlorides
  • FreeSulfurDioxide
  • TotalSulfurDioxide
  • Sulphates
  • Alcohol
  • LabelAppeal
  • AcidIndex
  • STARS
## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), family = poisson, 
##     data = wine.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2167  -0.2669   0.0479   0.3933   2.0507  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               3.750e-01  4.512e-01   0.831  0.40593    
## VolatileAcidity          -2.133e-02  8.008e-03  -2.663  0.00775 ** 
## Chlorides                -2.800e-02  1.973e-02  -1.420  0.15571    
## FreeSulfurDioxide         7.109e-05  4.206e-05   1.690  0.09101 .  
## TotalSulfurDioxide        2.420e-05  2.745e-05   0.881  0.37807    
## Sulphates                -4.931e-03  6.766e-03  -0.729  0.46614    
## Alcohol                   4.738e-03  1.703e-03   2.782  0.00541 ** 
## as.factor(LabelAppeal)-1  2.515e-01  5.140e-02   4.892 9.96e-07 ***
## as.factor(LabelAppeal)0   4.722e-01  5.016e-02   9.414  < 2e-16 ***
## as.factor(LabelAppeal)1   6.277e-01  5.086e-02  12.340  < 2e-16 ***
## as.factor(LabelAppeal)2   7.395e-01  5.605e-02  13.194  < 2e-16 ***
## as.factor(AcidIndex)5     1.177e-01  4.545e-01   0.259  0.79571    
## as.factor(AcidIndex)6     1.790e-01  4.482e-01   0.399  0.68958    
## as.factor(AcidIndex)7     1.493e-01  4.479e-01   0.333  0.73883    
## as.factor(AcidIndex)8     1.213e-01  4.480e-01   0.271  0.78663    
## as.factor(AcidIndex)9     6.778e-02  4.483e-01   0.151  0.87982    
## as.factor(AcidIndex)10   -1.896e-02  4.496e-01  -0.042  0.96636    
## as.factor(AcidIndex)11   -2.527e-01  4.533e-01  -0.557  0.57724    
## as.factor(AcidIndex)12   -2.125e-01  4.590e-01  -0.463  0.64335    
## as.factor(AcidIndex)13   -1.314e-01  4.622e-01  -0.284  0.77627    
## as.factor(AcidIndex)14   -3.701e-01  4.807e-01  -0.770  0.44131    
## as.factor(AcidIndex)15    1.800e-02  5.585e-01   0.032  0.97430    
## as.factor(AcidIndex)16   -1.916e-01  6.333e-01  -0.303  0.76226    
## as.factor(AcidIndex)17    3.635e-02  6.337e-01   0.057  0.95425    
## as.factor(STARS)2         3.226e-01  1.679e-02  19.210  < 2e-16 ***
## as.factor(STARS)3         4.381e-01  1.826e-02  23.999  < 2e-16 ***
## as.factor(STARS)4         5.566e-01  2.565e-02  21.701  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6339.3  on 6952  degrees of freedom
## Residual deviance: 4216.8  on 6926  degrees of freedom
##   (5842 observations deleted due to missingness)
## AIC: 24948
## 
## Number of Fisher Scoring iterations: 5

The deviance residuals increases than before with increase degrees of freedom. Furthermore, the AIC score increased significantly from 23087 to 24948. So we can say Poisson Model 1 is better fit than Model2.

Since the residual deviance is smaller than the degrees of freedom, then our data is under-dispersion.

Test Dispersion on Model 4

## 
##  Overdispersion test
## 
## data:  Poisson_Model2
## z = -48.265, p-value = 1
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##      alpha 
## -0.5757886

Since the p value is exactly 1, meaning this is not Over-dispersion Which is good.

Model #5: Full Negative Binomial

## 
## Call:
## glm.nb(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), data = wine.train, 
##     init.theta = 134433.0376, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2212  -0.2662   0.0461   0.3943   1.7274  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               7.306e-01  5.131e-01   1.424  0.15442    
## FixedAcidity              5.803e-04  1.055e-03   0.550  0.58216    
## VolatileAcidity          -2.371e-02  8.370e-03  -2.833  0.00461 ** 
## CitricAcid               -2.313e-03  7.581e-03  -0.305  0.76034    
## ResidualSugar            -7.068e-05  1.943e-04  -0.364  0.71599    
## Chlorides                -3.261e-02  2.056e-02  -1.587  0.11260    
## FreeSulfurDioxide         5.617e-05  4.400e-05   1.277  0.20174    
## TotalSulfurDioxide        1.985e-05  2.858e-05   0.695  0.48734    
## Density                  -3.803e-01  2.464e-01  -1.543  0.12274    
## pH                       -1.103e-03  9.614e-03  -0.115  0.90867    
## Sulphates                -5.343e-03  7.055e-03  -0.757  0.44884    
## Alcohol                   4.762e-03  1.774e-03   2.685  0.00726 ** 
## as.factor(LabelAppeal)-1  2.701e-01  5.337e-02   5.060 4.18e-07 ***
## as.factor(LabelAppeal)0   4.943e-01  5.205e-02   9.497  < 2e-16 ***
## as.factor(LabelAppeal)1   6.493e-01  5.277e-02  12.305  < 2e-16 ***
## as.factor(LabelAppeal)2   7.637e-01  5.840e-02  13.077  < 2e-16 ***
## as.factor(AcidIndex)5     1.186e-01  4.553e-01   0.260  0.79455    
## as.factor(AcidIndex)6     1.902e-01  4.487e-01   0.424  0.67167    
## as.factor(AcidIndex)7     1.536e-01  4.484e-01   0.343  0.73190    
## as.factor(AcidIndex)8     1.286e-01  4.485e-01   0.287  0.77430    
## as.factor(AcidIndex)9     7.794e-02  4.489e-01   0.174  0.86215    
## as.factor(AcidIndex)10   -1.910e-02  4.502e-01  -0.042  0.96617    
## as.factor(AcidIndex)11   -2.417e-01  4.540e-01  -0.532  0.59456    
## as.factor(AcidIndex)12   -2.259e-01  4.605e-01  -0.490  0.62380    
## as.factor(AcidIndex)13   -1.328e-01  4.634e-01  -0.287  0.77447    
## as.factor(AcidIndex)14   -1.998e-01  4.813e-01  -0.415  0.67806    
## as.factor(AcidIndex)15    2.322e-02  5.591e-01   0.042  0.96687    
## as.factor(AcidIndex)16   -2.005e-01  6.341e-01  -0.316  0.75185    
## as.factor(AcidIndex)17    7.303e-02  6.351e-01   0.115  0.90845    
## as.factor(STARS)2         3.175e-01  1.744e-02  18.198  < 2e-16 ***
## as.factor(STARS)3         4.320e-01  1.899e-02  22.747  < 2e-16 ***
## as.factor(STARS)4         5.519e-01  2.692e-02  20.504  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(134433) family taken to be 1)
## 
##     Null deviance: 5843.9  on 6435  degrees of freedom
## Residual deviance: 3890.6  on 6404  degrees of freedom
##   (6359 observations deleted due to missingness)
## AIC: 23089
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  134433 
##           Std. Err.:  217492 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -23023.41

Model #6: Negative Binomial with Selected Variables

We choose the highly significant variables as outputted by Model 5:

  • VolatileAcidity
  • Chlorides
  • FreeSulfurDioxide
  • TotalSulfurDioxide
  • Sulphates
  • Alcohol
  • LabelAppeal
  • AcidIndex
  • STARS
## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), data = wine.train, 
##     init.theta = 133561.251, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2167  -0.2669   0.0479   0.3933   2.0506  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               3.750e-01  4.512e-01   0.831  0.40593    
## VolatileAcidity          -2.133e-02  8.008e-03  -2.663  0.00775 ** 
## Chlorides                -2.800e-02  1.973e-02  -1.420  0.15572    
## FreeSulfurDioxide         7.109e-05  4.206e-05   1.690  0.09101 .  
## TotalSulfurDioxide        2.420e-05  2.745e-05   0.881  0.37807    
## Sulphates                -4.931e-03  6.766e-03  -0.729  0.46614    
## Alcohol                   4.738e-03  1.703e-03   2.782  0.00541 ** 
## as.factor(LabelAppeal)-1  2.515e-01  5.140e-02   4.892 9.96e-07 ***
## as.factor(LabelAppeal)0   4.722e-01  5.016e-02   9.414  < 2e-16 ***
## as.factor(LabelAppeal)1   6.277e-01  5.086e-02  12.340  < 2e-16 ***
## as.factor(LabelAppeal)2   7.395e-01  5.605e-02  13.194  < 2e-16 ***
## as.factor(AcidIndex)5     1.177e-01  4.545e-01   0.259  0.79572    
## as.factor(AcidIndex)6     1.790e-01  4.482e-01   0.399  0.68958    
## as.factor(AcidIndex)7     1.493e-01  4.479e-01   0.333  0.73883    
## as.factor(AcidIndex)8     1.213e-01  4.480e-01   0.271  0.78663    
## as.factor(AcidIndex)9     6.778e-02  4.483e-01   0.151  0.87982    
## as.factor(AcidIndex)10   -1.896e-02  4.496e-01  -0.042  0.96636    
## as.factor(AcidIndex)11   -2.527e-01  4.533e-01  -0.557  0.57725    
## as.factor(AcidIndex)12   -2.125e-01  4.590e-01  -0.463  0.64335    
## as.factor(AcidIndex)13   -1.314e-01  4.622e-01  -0.284  0.77628    
## as.factor(AcidIndex)14   -3.701e-01  4.807e-01  -0.770  0.44132    
## as.factor(AcidIndex)15    1.799e-02  5.585e-01   0.032  0.97430    
## as.factor(AcidIndex)16   -1.916e-01  6.333e-01  -0.303  0.76226    
## as.factor(AcidIndex)17    3.635e-02  6.337e-01   0.057  0.95425    
## as.factor(STARS)2         3.226e-01  1.679e-02  19.209  < 2e-16 ***
## as.factor(STARS)3         4.381e-01  1.826e-02  23.999  < 2e-16 ***
## as.factor(STARS)4         5.566e-01  2.565e-02  21.700  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(133561.3) family taken to be 1)
## 
##     Null deviance: 6339.2  on 6952  degrees of freedom
## Residual deviance: 4216.8  on 6926  degrees of freedom
##   (5842 observations deleted due to missingness)
## AIC: 24950
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  133561 
##           Std. Err.:  206991 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -24894.47

Looking into the AIC value, we can say that Model 5 is better than Model 6.

Model #7: Full Quasi-Poisson

Since the data set indicates under-dispersion it is a good idea to fit a Quasi-Poisson regression model and check whether we see any difference in the standard error estimation for the model regression coefficients.

## 
## Call:
## glm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), family = quasipoisson, 
##     data = wine.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2212  -0.2662   0.0461   0.3943   1.7274  
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               7.306e-01  3.338e-01   2.189   0.0287 *  
## FixedAcidity              5.803e-04  6.862e-04   0.846   0.3978    
## VolatileAcidity          -2.371e-02  5.446e-03  -4.354 1.36e-05 ***
## CitricAcid               -2.313e-03  4.933e-03  -0.469   0.6392    
## ResidualSugar            -7.068e-05  1.264e-04  -0.559   0.5761    
## Chlorides                -3.261e-02  1.337e-02  -2.438   0.0148 *  
## FreeSulfurDioxide         5.617e-05  2.863e-05   1.962   0.0498 *  
## TotalSulfurDioxide        1.985e-05  1.860e-05   1.067   0.2858    
## Density                  -3.803e-01  1.603e-01  -2.372   0.0177 *  
## pH                       -1.103e-03  6.256e-03  -0.176   0.8601    
## Sulphates                -5.343e-03  4.590e-03  -1.164   0.2445    
## Alcohol                   4.762e-03  1.154e-03   4.126 3.73e-05 ***
## as.factor(LabelAppeal)-1  2.701e-01  3.473e-02   7.777 8.57e-15 ***
## as.factor(LabelAppeal)0   4.943e-01  3.387e-02  14.596  < 2e-16 ***
## as.factor(LabelAppeal)1   6.493e-01  3.433e-02  18.912  < 2e-16 ***
## as.factor(LabelAppeal)2   7.637e-01  3.800e-02  20.098  < 2e-16 ***
## as.factor(AcidIndex)5     1.186e-01  2.963e-01   0.400   0.6890    
## as.factor(AcidIndex)6     1.902e-01  2.919e-01   0.651   0.5148    
## as.factor(AcidIndex)7     1.536e-01  2.917e-01   0.527   0.5985    
## as.factor(AcidIndex)8     1.286e-01  2.918e-01   0.441   0.6594    
## as.factor(AcidIndex)9     7.794e-02  2.920e-01   0.267   0.7896    
## as.factor(AcidIndex)10   -1.910e-02  2.929e-01  -0.065   0.9480    
## as.factor(AcidIndex)11   -2.417e-01  2.954e-01  -0.818   0.4134    
## as.factor(AcidIndex)12   -2.259e-01  2.996e-01  -0.754   0.4510    
## as.factor(AcidIndex)13   -1.328e-01  3.015e-01  -0.440   0.6597    
## as.factor(AcidIndex)14   -1.998e-01  3.132e-01  -0.638   0.5235    
## as.factor(AcidIndex)15    2.322e-02  3.638e-01   0.064   0.9491    
## as.factor(AcidIndex)16   -2.005e-01  4.126e-01  -0.486   0.6270    
## as.factor(AcidIndex)17    7.303e-02  4.132e-01   0.177   0.8597    
## as.factor(STARS)2         3.175e-01  1.135e-02  27.968  < 2e-16 ***
## as.factor(STARS)3         4.320e-01  1.236e-02  34.960  < 2e-16 ***
## as.factor(STARS)4         5.519e-01  1.751e-02  31.513  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 0.4233659)
## 
##     Null deviance: 5844.1  on 6435  degrees of freedom
## Residual deviance: 3890.7  on 6404  degrees of freedom
##   (6359 observations deleted due to missingness)
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

Model #8: Quasi-Poisson with Selected Variables

## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), family = quasipoisson, 
##     data = wine.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2167  -0.2669   0.0479   0.3933   2.0507  
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               3.750e-01  2.942e-01   1.275  0.20247    
## VolatileAcidity          -2.133e-02  5.221e-03  -4.084 4.47e-05 ***
## Chlorides                -2.800e-02  1.286e-02  -2.177  0.02949 *  
## FreeSulfurDioxide         7.109e-05  2.743e-05   2.592  0.00956 ** 
## TotalSulfurDioxide        2.420e-05  1.790e-05   1.352  0.17645    
## Sulphates                -4.931e-03  4.411e-03  -1.118  0.26371    
## Alcohol                   4.738e-03  1.111e-03   4.267 2.01e-05 ***
## as.factor(LabelAppeal)-1  2.515e-01  3.351e-02   7.504 6.98e-14 ***
## as.factor(LabelAppeal)0   4.722e-01  3.271e-02  14.438  < 2e-16 ***
## as.factor(LabelAppeal)1   6.277e-01  3.316e-02  18.926  < 2e-16 ***
## as.factor(LabelAppeal)2   7.395e-01  3.654e-02  20.237  < 2e-16 ***
## as.factor(AcidIndex)5     1.177e-01  2.964e-01   0.397  0.69132    
## as.factor(AcidIndex)6     1.790e-01  2.922e-01   0.613  0.54015    
## as.factor(AcidIndex)7     1.493e-01  2.920e-01   0.511  0.60912    
## as.factor(AcidIndex)8     1.213e-01  2.921e-01   0.415  0.67803    
## as.factor(AcidIndex)9     6.778e-02  2.923e-01   0.232  0.81662    
## as.factor(AcidIndex)10   -1.896e-02  2.931e-01  -0.065  0.94843    
## as.factor(AcidIndex)11   -2.527e-01  2.955e-01  -0.855  0.39262    
## as.factor(AcidIndex)12   -2.125e-01  2.993e-01  -0.710  0.47764    
## as.factor(AcidIndex)13   -1.314e-01  3.014e-01  -0.436  0.66296    
## as.factor(AcidIndex)14   -3.701e-01  3.134e-01  -1.181  0.23766    
## as.factor(AcidIndex)15    1.800e-02  3.642e-01   0.049  0.96059    
## as.factor(AcidIndex)16   -1.916e-01  4.129e-01  -0.464  0.64268    
## as.factor(AcidIndex)17    3.635e-02  4.131e-01   0.088  0.92989    
## as.factor(STARS)2         3.226e-01  1.095e-02  29.462  < 2e-16 ***
## as.factor(STARS)3         4.381e-01  1.190e-02  36.808  < 2e-16 ***
## as.factor(STARS)4         5.566e-01  1.672e-02  33.283  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 0.4251092)
## 
##     Null deviance: 6339.3  on 6952  degrees of freedom
## Residual deviance: 4216.8  on 6926  degrees of freedom
##   (5842 observations deleted due to missingness)
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

Comparison Negative Binomial, Poisson Regression, and Quasi Poisson Regression models: Coefficients and Std Errors

pois.coef negbinom.coef pois.quasi.coef pois.stderr negbinom.stderr pois.quasi.stderr
(Intercept) 0.3749910 0.3749910 0.3749910 0.4512069 0.4512152 0.2941887
VolatileAcidity -0.0213258 -0.0213259 -0.0213258 0.0080083 0.0080084 0.0052214
Chlorides -0.0280045 -0.0280046 -0.0280045 0.0197264 0.0197267 0.0128617
FreeSulfurDioxide 0.0000711 0.0000711 0.0000711 0.0000421 0.0000421 0.0000274
TotalSulfurDioxide 0.0000242 0.0000242 0.0000242 0.0000275 0.0000275 0.0000179
Sulphates -0.0049309 -0.0049309 -0.0049309 0.0067660 0.0067661 0.0044115
Alcohol 0.0047382 0.0047382 0.0047382 0.0017033 0.0017033 0.0011106
as.factor(LabelAppeal)-1 0.2514805 0.2514805 0.2514805 0.0514025 0.0514029 0.0335146
as.factor(LabelAppeal)0 0.4722327 0.4722327 0.4722327 0.0501642 0.0501646 0.0327072
as.factor(LabelAppeal)1 0.6276530 0.6276530 0.6276530 0.0508641 0.0508645 0.0331636
  • From the above table we can see that the model coefficients and standard errors for Poisson and Negative Binomial regression models are the same up to 4 decimal places. This can be due to the fact that under-dispersion in the dataset is not severe enough to impact the accuracy of the Poisson regression model.

  • The model coefficients for Poisson Regression and Quasi-Poisson Regression models are same, but the estimates for the standard errors are different. This is expected since the data set has under-dispersion.

  • Standard error estimations for regression coefficients of the Poisson regression model will not be accurate. We need to rely on standard error estimates from the Quasi-Poisson regression model, which is better suited for data sets exhibiting under-dispersion or over-dispersion.

  • If we need to use these coefficients for inference, it is better to rely on standard error estimates from the Quasi Poisson regression model to calculate the confidence intervals, rather than from the normal Poisson regression model, for better accuracy of inference.

Consider Zero-Inflation

Previously, we determined that 21% of the wines have 0 sales. Also, over 25% of the wines have not been rated by an expert, indicated by STARS. Is there a relationship? Of the predictors with missing values, we can visualize the relationship between them and the number of sales. Recall that the number of sales amongst all the wines ranged from 0-8.

The bar at 0 in the STARS plot stands out as the largest. It indicates that about 2000 wines without experts’ ratings had no sales. It is much more than any other predictor. There is a clear relationship between the number of wines that don’t have an expert’s rating and the number of sales. When there is no rating, no wine is sold.

Since there is a large number of 0 sales that is likely related to the STARS predictor, we can build a zero-inflated negative binomial model. A zero-inflated model assumes that a zero outcome is due to two different processes. For this model, it assumes that if there is no expert rating, then a zero is produced. If there is an expert rating, then the count portion of the model will be used instead. Since the other portion has only 2 outcomes, we can use the negative binomial model. In other words, the overall model is a combination of two models.

Here, the model will take the STARS predictor for the negative binomial portion, and all of the predictors in the count portion.

Model #9: Full Zero-Inflated Negative Binomial

## 
## Call:
## zeroinfl(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) | 
##     STARS, data = wine.train, dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.43095 -0.25615  0.05577  0.35670  2.45205 
## 
## Count model coefficients (negbin with log link):
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               9.520e-01  5.154e-01   1.847   0.0647 .  
## FixedAcidity              6.000e-04  1.075e-03   0.558   0.5768    
## VolatileAcidity          -1.691e-02  8.479e-03  -1.995   0.0461 *  
## CitricAcid               -1.945e-03  7.700e-03  -0.253   0.8005    
## ResidualSugar            -7.271e-05  1.962e-04  -0.371   0.7109    
## Chlorides                -2.722e-02  2.101e-02  -1.296   0.1951    
## FreeSulfurDioxide         7.817e-06  4.425e-05   0.177   0.8598    
## TotalSulfurDioxide       -2.343e-06  2.827e-05  -0.083   0.9340    
## Density                  -3.595e-01  2.516e-01  -1.429   0.1530    
## pH                        4.591e-03  9.780e-03   0.469   0.6388    
## Sulphates                -1.880e-03  7.167e-03  -0.262   0.7930    
## Alcohol                   8.381e-03  1.794e-03   4.673 2.97e-06 ***
## as.factor(LabelAppeal)-1  3.574e-01  5.428e-02   6.585 4.56e-11 ***
## as.factor(LabelAppeal)0   6.550e-01  5.271e-02  12.427  < 2e-16 ***
## as.factor(LabelAppeal)1   8.703e-01  5.314e-02  16.377  < 2e-16 ***
## as.factor(LabelAppeal)2   1.060e+00  5.859e-02  18.099  < 2e-16 ***
## as.factor(AcidIndex)5    -3.590e-02  4.549e-01  -0.079   0.9371    
## as.factor(AcidIndex)6     5.884e-02  4.482e-01   0.131   0.8956    
## as.factor(AcidIndex)7     1.884e-02  4.479e-01   0.042   0.9665    
## as.factor(AcidIndex)8    -1.553e-03  4.480e-01  -0.003   0.9972    
## as.factor(AcidIndex)9    -3.515e-02  4.484e-01  -0.078   0.9375    
## as.factor(AcidIndex)10   -1.195e-01  4.499e-01  -0.266   0.7904    
## as.factor(AcidIndex)11   -1.876e-01  4.550e-01  -0.412   0.6800    
## as.factor(AcidIndex)12   -1.279e-01  4.633e-01  -0.276   0.7825    
## as.factor(AcidIndex)13   -3.604e-02  4.668e-01  -0.077   0.9385    
## as.factor(AcidIndex)14   -4.569e-02  4.891e-01  -0.093   0.9256    
## as.factor(AcidIndex)15   -2.199e-02  5.747e-01  -0.038   0.9695    
## as.factor(AcidIndex)16    2.464e-01  6.638e-01   0.371   0.7104    
## as.factor(AcidIndex)17   -1.053e-01  6.345e-01  -0.166   0.8683    
## Log(theta)                1.811e+01  1.852e+00   9.779  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.3357     0.5346   4.369 1.25e-05 ***
## STARS        -3.8692     0.5248  -7.373 1.66e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 73631876.5559 
## Number of iterations in BFGS optimization: 40 
## Log-likelihood: -1.134e+04 on 32 Df

The STARS predictor is statistically significant, as well as VolatileAcidity, Alcohol, and LabelAppeal. A simpler model with these predictors can be built.

Model #10: Zero-Inflated Negative Binomial with Selected Variables

## 
## Call:
## zeroinfl(formula = TARGET ~ VolatileAcidity + Alcohol + as.factor(LabelAppeal) | 
##     STARS, data = wine.train, dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.44720 -0.29223  0.06532  0.35521  2.19805 
## 
## Count model coefficients (negbin with log link):
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               0.603290   0.047589  12.677  < 2e-16 ***
## VolatileAcidity          -0.014676   0.007191  -2.041   0.0413 *  
## Alcohol                   0.009111   0.001496   6.092 1.12e-09 ***
## as.factor(LabelAppeal)-1  0.362691   0.046620   7.780 7.26e-15 ***
## as.factor(LabelAppeal)0   0.657882   0.045361  14.503  < 2e-16 ***
## as.factor(LabelAppeal)1   0.879500   0.045672  19.257  < 2e-16 ***
## as.factor(LabelAppeal)2   1.065331   0.049974  21.318  < 2e-16 ***
## Log(theta)               17.274647        NaN     NaN      NaN    
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.2194     0.4315   5.143  2.7e-07 ***
## STARS        -3.7580     0.4227  -8.891  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 31789517.9144 
## Number of iterations in BFGS optimization: 22 
## Log-likelihood: -1.584e+04 on 10 Df

One takeaway from the model output is the log odds of the number of sales, TARGET, being an excessive zero would decrease by 3.7 for every additional unit increase in the expert rating. In other words, the higher the expert rating, the more likely that the wine had at least one sale.

Predictions

Does the evaluation set have characteristics to the training set?

##                 IN             TARGET       FixedAcidity    VolatileAcidity 
##           0.000000         100.000000           0.000000           0.000000 
##         CitricAcid      ResidualSugar          Chlorides  FreeSulfurDioxide 
##           0.000000           5.037481           4.137931           4.557721 
## TotalSulfurDioxide            Density                 pH          Sulphates 
##           4.707646           0.000000           3.118441           9.295352 
##            Alcohol        LabelAppeal          AcidIndex              STARS 
##           5.547226           0.000000           0.000000          25.217391

Similar to the training set, over 25% of the wines don’t have a value for STARS. Now that we have revealed the relationship between the absence of an expert’s rating and 0 sales, we decide to use the simpler zero-inflated negative binomial model.

We notice that there are many NAs for TARGET. How many are owed to NAs in STARS? The number of observations where both STARS is NA and TARGET is NA is 3,335. The number of observations where STARS is NA and TARGET is not NA is 0.

Every NA in TARGET is an effect of an NA in STARS. As seen in the training set, the number of sales for wines without an expert rating is overwhelmingly zero. We can add these zeroes in place of the NAs in TARGET. Finally, make predictions on the evaluation set.

Distribution of Predictions of Sales

Although the evaluation set is about 25% of the size of the training set, the distribution of TARGET appears similar. We also note that the range of TARGET is smaller for the evaluation set, 0-6. Also, there are only 2 wines that were predicted to sell only once.

Conclusion

Based on diagnostic plots and visualizations of relationships between variables, were able to determine that there is a strong connection between lack of an expert’s rating and whether or not the wine was sold. In order to maximize sales, we propose prioritizing having the wines rated.

Appendix —–

knitr::opts_chunk$set(echo = F, warning = F, message = F)

library(corrplot)
library(reshape2)
library(MASS)
library(tidyverse)
library(ggplot2)
library(ggfortify)
library(ggthemes)
library(knitr)
library(broom)
library(caret)
library(leaps)
library(MASS)
library(magrittr)
library(betareg)
library(pscl)
library(gtsummary)
library(nnet)
library(arm)
library(AER)
library(kableExtra)



wine.train <- read.csv('https://raw.githubusercontent.com/djunga/DATA621HW5/main/wine-training-data.csv')
wine.eval <- read.csv('https://raw.githubusercontent.com/djunga/DATA621HW5/main/wine-evaluation-data.csv')

summary(wine.train)
##     ï..INDEX         TARGET       FixedAcidity     VolatileAcidity  
##  Min.   :    1   Min.   :0.000   Min.   :-18.100   Min.   :-2.7900  
##  1st Qu.: 4038   1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300  
##  Median : 8110   Median :3.000   Median :  6.900   Median : 0.2800  
##  Mean   : 8070   Mean   :3.029   Mean   :  7.076   Mean   : 0.3241  
##  3rd Qu.:12106   3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400  
##  Max.   :16129   Max.   :8.000   Max.   : 34.400   Max.   : 3.6800  
##                                                                     
##    CitricAcid      ResidualSugar        Chlorides       FreeSulfurDioxide
##  Min.   :-3.2400   Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00  
##  1st Qu.: 0.0300   1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00  
##  Median : 0.3100   Median :   3.900   Median : 0.0460   Median :  30.00  
##  Mean   : 0.3084   Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85  
##  3rd Qu.: 0.5800   3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00  
##  Max.   : 3.8600   Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00  
##                    NA's   :616        NA's   :638       NA's   :647      
##  TotalSulfurDioxide    Density             pH          Sulphates      
##  Min.   :-823.0     Min.   :0.8881   Min.   :0.480   Min.   :-3.1300  
##  1st Qu.:  27.0     1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800  
##  Median : 123.0     Median :0.9945   Median :3.200   Median : 0.5000  
##  Mean   : 120.7     Mean   :0.9942   Mean   :3.208   Mean   : 0.5271  
##  3rd Qu.: 208.0     3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600  
##  Max.   :1057.0     Max.   :1.0992   Max.   :6.130   Max.   : 4.2400  
##  NA's   :682                         NA's   :395     NA's   :1210     
##     Alcohol       LabelAppeal          AcidIndex          STARS      
##  Min.   :-4.70   Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.: 9.00   1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median :10.40   Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :10.49   Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.:12.40   3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   :26.50   Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##  NA's   :653                                          NA's   :3359
str(wine.train)
## 'data.frame':    12795 obs. of  16 variables:
##  $ ï..INDEX          : int  1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET            : int  3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity      : num  3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity   : num  1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid        : num  -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar     : num  54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides         : num  -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide : num  NA 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide: num  268 -327 142 115 108 15 156 551 NA 180 ...
##  $ Density           : num  0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num  3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates         : num  -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
##  $ Alcohol           : num  9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal       : int  0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : int  8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS             : int  2 3 3 1 2 NA NA 3 NA 4 ...
(colSums(is.na(wine.train)) / nrow(wine.train)) * 100
##           ï..INDEX             TARGET       FixedAcidity    VolatileAcidity 
##           0.000000           0.000000           0.000000           0.000000 
##         CitricAcid      ResidualSugar          Chlorides  FreeSulfurDioxide 
##           0.000000           4.814381           4.986323           5.056663 
## TotalSulfurDioxide            Density                 pH          Sulphates 
##           5.330207           0.000000           3.087143           9.456819 
##            Alcohol        LabelAppeal          AcidIndex              STARS 
##           5.103556           0.000000           0.000000          26.252442
missing_val <- data.frame(num_missing=colSums(is.na(wine.train)))
ggplot(wine.train, aes(STARS, TARGET)) + geom_jitter(width=0.5, height=0.5)

colnames(wine.train)[1] <- "INDEX"
corrplot::corrplot(cor(wine.train, use = "complete.obs"), tl.col="black", tl.cex=0.6, order='AOE')

mlt.train = wine.train 
mlt.train = melt(mlt.train, id.vars = "INDEX")

ggplot(aes(value), data = mlt.train) + geom_histogram(stat = "bin", fill = "navyblue") + facet_wrap(~variable, scales = "free") + labs(title = "Distributions of Continuous Variables", x = "Variable", y = "Count") 

sum(wine.train %>% select("TARGET") == 0) / nrow(wine.train)
## [1] 0.2136772
mlt.train <- melt(wine.train, id.vars = c("INDEX", "TARGET"))
ggplot(aes(value, TARGET), data = mlt.train) + geom_point() + facet_wrap(~variable, scales = "free") + labs(title = "Distributions of Continuous Variables", x = "Variable", y = "TARGET") 

mlt.train <- melt(select(wine.train, "pH", "AcidIndex", "LabelAppeal", "STARS", "TARGET"), id.vars = c("TARGET"))
ggplot(aes(value, TARGET), data = mlt.train) + geom_jitter(width = 0.5, height = 0.5) + facet_wrap(~variable, scales = "free") + labs(title = "Distributions of Continuous Variables Subset, Jittered", x = "Variable", y = "TARGET") 

mod1 <- lm(TARGET ~ ., data=select(wine.train, -"INDEX"))
summary(mod1)
## 
## Call:
## lm(formula = TARGET ~ ., data = select(wine.train, -"INDEX"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0614 -0.5143  0.1240  0.7170  3.2419 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.563e+00  5.530e-01   8.251  < 2e-16 ***
## FixedAcidity        1.685e-03  2.319e-03   0.727   0.4675    
## VolatileAcidity    -9.466e-02  1.846e-02  -5.129 3.00e-07 ***
## CitricAcid         -4.836e-03  1.675e-02  -0.289   0.7728    
## ResidualSugar      -2.513e-04  4.276e-04  -0.588   0.5567    
## Chlorides          -1.134e-01  4.546e-02  -2.494   0.0126 *  
## FreeSulfurDioxide   2.264e-04  9.711e-05   2.332   0.0198 *  
## TotalSulfurDioxide  7.810e-05  6.288e-05   1.242   0.2142    
## Density            -1.281e+00  5.435e-01  -2.357   0.0185 *  
## pH                 -9.441e-03  2.121e-02  -0.445   0.6563    
## Sulphates          -1.727e-02  1.558e-02  -1.109   0.2676    
## Alcohol             1.653e-02  3.887e-03   4.252 2.15e-05 ***
## LabelAppeal         6.442e-01  1.743e-02  36.947  < 2e-16 ***
## AcidIndex          -1.649e-01  1.235e-02 -13.346  < 2e-16 ***
## STARS               7.278e-01  1.710e-02  42.571  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.153 on 6421 degrees of freedom
##   (6359 observations deleted due to missingness)
## Multiple R-squared:  0.445,  Adjusted R-squared:  0.4438 
## F-statistic: 367.8 on 14 and 6421 DF,  p-value: < 2.2e-16
mod2 <- lm(TARGET ~ VolatileAcidity + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS, data=wine.train)
summary(mod2)
## 
## Call:
## lm(formula = TARGET ~ VolatileAcidity + Sulphates + Alcohol + 
##     LabelAppeal + AcidIndex + STARS, data = wine.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0351 -0.5123  0.1289  0.7224  3.1485 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.282082   0.100168  32.766  < 2e-16 ***
## VolatileAcidity -0.090141   0.016389  -5.500 3.91e-08 ***
## Sulphates       -0.014904   0.013840  -1.077    0.282    
## Alcohol          0.020198   0.003436   5.879 4.29e-09 ***
## LabelAppeal      0.653326   0.015456  42.269  < 2e-16 ***
## AcidIndex       -0.167059   0.010937 -15.274  < 2e-16 ***
## STARS            0.719125   0.015136  47.511  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.152 on 8127 degrees of freedom
##   (4661 observations deleted due to missingness)
## Multiple R-squared:  0.4472, Adjusted R-squared:  0.4468 
## F-statistic:  1096 on 6 and 8127 DF,  p-value: < 2.2e-16
mod2 %>%glance()
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1     0.447         0.447  1.15     1096.       0     6 -12687. 25391. 25447.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
car::vif(mod1)
##       FixedAcidity    VolatileAcidity         CitricAcid      ResidualSugar 
##           1.027295           1.003524           1.006280           1.002597 
##          Chlorides  FreeSulfurDioxide TotalSulfurDioxide            Density 
##           1.003084           1.004023           1.002931           1.004776 
##                 pH          Sulphates            Alcohol        LabelAppeal 
##           1.004116           1.004483           1.009935           1.117057 
##          AcidIndex              STARS 
##           1.048563           1.134199
autoplot(mod1)

car::vif(mod2)
## VolatileAcidity       Sulphates         Alcohol     LabelAppeal       AcidIndex 
##        1.001844        1.000869        1.006664        1.126009        1.012806 
##           STARS 
##        1.138001
autoplot(mod2)

Poisson_Model1<- glm(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
              Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
              pH + Sulphates + Alcohol + 
              as.factor(LabelAppeal) +
              as.factor(AcidIndex) +
              as.factor(STARS),
              data=wine.train, 
              family=poisson
            )
summary(Poisson_Model1)
## 
## Call:
## glm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), family = poisson, 
##     data = wine.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2212  -0.2662   0.0461   0.3943   1.7274  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               7.306e-01  5.130e-01   1.424  0.15441    
## FixedAcidity              5.803e-04  1.055e-03   0.550  0.58216    
## VolatileAcidity          -2.371e-02  8.369e-03  -2.833  0.00461 ** 
## CitricAcid               -2.313e-03  7.581e-03  -0.305  0.76034    
## ResidualSugar            -7.068e-05  1.943e-04  -0.364  0.71598    
## Chlorides                -3.261e-02  2.056e-02  -1.587  0.11260    
## FreeSulfurDioxide         5.617e-05  4.399e-05   1.277  0.20173    
## TotalSulfurDioxide        1.985e-05  2.858e-05   0.695  0.48734    
## Density                  -3.803e-01  2.464e-01  -1.543  0.12274    
## pH                       -1.103e-03  9.614e-03  -0.115  0.90867    
## Sulphates                -5.343e-03  7.055e-03  -0.757  0.44883    
## Alcohol                   4.762e-03  1.774e-03   2.685  0.00726 ** 
## as.factor(LabelAppeal)-1  2.701e-01  5.337e-02   5.061 4.18e-07 ***
## as.factor(LabelAppeal)0   4.943e-01  5.205e-02   9.497  < 2e-16 ***
## as.factor(LabelAppeal)1   6.493e-01  5.277e-02  12.305  < 2e-16 ***
## as.factor(LabelAppeal)2   7.637e-01  5.840e-02  13.077  < 2e-16 ***
## as.factor(AcidIndex)5     1.186e-01  4.553e-01   0.260  0.79455    
## as.factor(AcidIndex)6     1.902e-01  4.487e-01   0.424  0.67166    
## as.factor(AcidIndex)7     1.536e-01  4.484e-01   0.343  0.73190    
## as.factor(AcidIndex)8     1.286e-01  4.485e-01   0.287  0.77430    
## as.factor(AcidIndex)9     7.794e-02  4.488e-01   0.174  0.86215    
## as.factor(AcidIndex)10   -1.910e-02  4.502e-01  -0.042  0.96617    
## as.factor(AcidIndex)11   -2.417e-01  4.540e-01  -0.532  0.59456    
## as.factor(AcidIndex)12   -2.259e-01  4.605e-01  -0.490  0.62379    
## as.factor(AcidIndex)13   -1.328e-01  4.634e-01  -0.287  0.77447    
## as.factor(AcidIndex)14   -1.998e-01  4.813e-01  -0.415  0.67806    
## as.factor(AcidIndex)15    2.322e-02  5.591e-01   0.042  0.96687    
## as.factor(AcidIndex)16   -2.005e-01  6.341e-01  -0.316  0.75185    
## as.factor(AcidIndex)17    7.303e-02  6.351e-01   0.115  0.90845    
## as.factor(STARS)2         3.175e-01  1.744e-02  18.198  < 2e-16 ***
## as.factor(STARS)3         4.320e-01  1.899e-02  22.747  < 2e-16 ***
## as.factor(STARS)4         5.519e-01  2.692e-02  20.504  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 5844.1  on 6435  degrees of freedom
## Residual deviance: 3890.7  on 6404  degrees of freedom
##   (6359 observations deleted due to missingness)
## AIC: 23087
## 
## Number of Fisher Scoring iterations: 5
Poisson_Model2 <- glm(TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Sulphates +Alcohol + 
              as.factor(LabelAppeal) + 
              as.factor(AcidIndex) + 
              as.factor(STARS),
              data=wine.train, 
              family=poisson
             )
summary(Poisson_Model2)
## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), family = poisson, 
##     data = wine.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2167  -0.2669   0.0479   0.3933   2.0507  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               3.750e-01  4.512e-01   0.831  0.40593    
## VolatileAcidity          -2.133e-02  8.008e-03  -2.663  0.00775 ** 
## Chlorides                -2.800e-02  1.973e-02  -1.420  0.15571    
## FreeSulfurDioxide         7.109e-05  4.206e-05   1.690  0.09101 .  
## TotalSulfurDioxide        2.420e-05  2.745e-05   0.881  0.37807    
## Sulphates                -4.931e-03  6.766e-03  -0.729  0.46614    
## Alcohol                   4.738e-03  1.703e-03   2.782  0.00541 ** 
## as.factor(LabelAppeal)-1  2.515e-01  5.140e-02   4.892 9.96e-07 ***
## as.factor(LabelAppeal)0   4.722e-01  5.016e-02   9.414  < 2e-16 ***
## as.factor(LabelAppeal)1   6.277e-01  5.086e-02  12.340  < 2e-16 ***
## as.factor(LabelAppeal)2   7.395e-01  5.605e-02  13.194  < 2e-16 ***
## as.factor(AcidIndex)5     1.177e-01  4.545e-01   0.259  0.79571    
## as.factor(AcidIndex)6     1.790e-01  4.482e-01   0.399  0.68958    
## as.factor(AcidIndex)7     1.493e-01  4.479e-01   0.333  0.73883    
## as.factor(AcidIndex)8     1.213e-01  4.480e-01   0.271  0.78663    
## as.factor(AcidIndex)9     6.778e-02  4.483e-01   0.151  0.87982    
## as.factor(AcidIndex)10   -1.896e-02  4.496e-01  -0.042  0.96636    
## as.factor(AcidIndex)11   -2.527e-01  4.533e-01  -0.557  0.57724    
## as.factor(AcidIndex)12   -2.125e-01  4.590e-01  -0.463  0.64335    
## as.factor(AcidIndex)13   -1.314e-01  4.622e-01  -0.284  0.77627    
## as.factor(AcidIndex)14   -3.701e-01  4.807e-01  -0.770  0.44131    
## as.factor(AcidIndex)15    1.800e-02  5.585e-01   0.032  0.97430    
## as.factor(AcidIndex)16   -1.916e-01  6.333e-01  -0.303  0.76226    
## as.factor(AcidIndex)17    3.635e-02  6.337e-01   0.057  0.95425    
## as.factor(STARS)2         3.226e-01  1.679e-02  19.210  < 2e-16 ***
## as.factor(STARS)3         4.381e-01  1.826e-02  23.999  < 2e-16 ***
## as.factor(STARS)4         5.566e-01  2.565e-02  21.701  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6339.3  on 6952  degrees of freedom
## Residual deviance: 4216.8  on 6926  degrees of freedom
##   (5842 observations deleted due to missingness)
## AIC: 24948
## 
## Number of Fisher Scoring iterations: 5
dispersiontest(Poisson_Model2, trafo = 1)
## 
##  Overdispersion test
## 
## data:  Poisson_Model2
## z = -48.265, p-value = 1
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##      alpha 
## -0.5757886
Negative_Bin_Model1 <- glm.nb(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
                Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
                pH + Sulphates + Alcohol + 
                as.factor(LabelAppeal) +
                as.factor(AcidIndex) +
                as.factor(STARS),
              data=wine.train)
summary(Negative_Bin_Model1)
## 
## Call:
## glm.nb(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), data = wine.train, 
##     init.theta = 134433.0376, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2212  -0.2662   0.0461   0.3943   1.7274  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               7.306e-01  5.131e-01   1.424  0.15442    
## FixedAcidity              5.803e-04  1.055e-03   0.550  0.58216    
## VolatileAcidity          -2.371e-02  8.370e-03  -2.833  0.00461 ** 
## CitricAcid               -2.313e-03  7.581e-03  -0.305  0.76034    
## ResidualSugar            -7.068e-05  1.943e-04  -0.364  0.71599    
## Chlorides                -3.261e-02  2.056e-02  -1.587  0.11260    
## FreeSulfurDioxide         5.617e-05  4.400e-05   1.277  0.20174    
## TotalSulfurDioxide        1.985e-05  2.858e-05   0.695  0.48734    
## Density                  -3.803e-01  2.464e-01  -1.543  0.12274    
## pH                       -1.103e-03  9.614e-03  -0.115  0.90867    
## Sulphates                -5.343e-03  7.055e-03  -0.757  0.44884    
## Alcohol                   4.762e-03  1.774e-03   2.685  0.00726 ** 
## as.factor(LabelAppeal)-1  2.701e-01  5.337e-02   5.060 4.18e-07 ***
## as.factor(LabelAppeal)0   4.943e-01  5.205e-02   9.497  < 2e-16 ***
## as.factor(LabelAppeal)1   6.493e-01  5.277e-02  12.305  < 2e-16 ***
## as.factor(LabelAppeal)2   7.637e-01  5.840e-02  13.077  < 2e-16 ***
## as.factor(AcidIndex)5     1.186e-01  4.553e-01   0.260  0.79455    
## as.factor(AcidIndex)6     1.902e-01  4.487e-01   0.424  0.67167    
## as.factor(AcidIndex)7     1.536e-01  4.484e-01   0.343  0.73190    
## as.factor(AcidIndex)8     1.286e-01  4.485e-01   0.287  0.77430    
## as.factor(AcidIndex)9     7.794e-02  4.489e-01   0.174  0.86215    
## as.factor(AcidIndex)10   -1.910e-02  4.502e-01  -0.042  0.96617    
## as.factor(AcidIndex)11   -2.417e-01  4.540e-01  -0.532  0.59456    
## as.factor(AcidIndex)12   -2.259e-01  4.605e-01  -0.490  0.62380    
## as.factor(AcidIndex)13   -1.328e-01  4.634e-01  -0.287  0.77447    
## as.factor(AcidIndex)14   -1.998e-01  4.813e-01  -0.415  0.67806    
## as.factor(AcidIndex)15    2.322e-02  5.591e-01   0.042  0.96687    
## as.factor(AcidIndex)16   -2.005e-01  6.341e-01  -0.316  0.75185    
## as.factor(AcidIndex)17    7.303e-02  6.351e-01   0.115  0.90845    
## as.factor(STARS)2         3.175e-01  1.744e-02  18.198  < 2e-16 ***
## as.factor(STARS)3         4.320e-01  1.899e-02  22.747  < 2e-16 ***
## as.factor(STARS)4         5.519e-01  2.692e-02  20.504  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(134433) family taken to be 1)
## 
##     Null deviance: 5843.9  on 6435  degrees of freedom
## Residual deviance: 3890.6  on 6404  degrees of freedom
##   (6359 observations deleted due to missingness)
## AIC: 23089
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  134433 
##           Std. Err.:  217492 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -23023.41
Negative_Bin_Model2 <- glm.nb(TARGET~ VolatileAcidity + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Sulphates 
              + 
              Alcohol + 
              as.factor(LabelAppeal) + 
              as.factor(AcidIndex) + 
              as.factor(STARS),
              data=wine.train)
summary(Negative_Bin_Model2)
## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), data = wine.train, 
##     init.theta = 133561.251, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2167  -0.2669   0.0479   0.3933   2.0506  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               3.750e-01  4.512e-01   0.831  0.40593    
## VolatileAcidity          -2.133e-02  8.008e-03  -2.663  0.00775 ** 
## Chlorides                -2.800e-02  1.973e-02  -1.420  0.15572    
## FreeSulfurDioxide         7.109e-05  4.206e-05   1.690  0.09101 .  
## TotalSulfurDioxide        2.420e-05  2.745e-05   0.881  0.37807    
## Sulphates                -4.931e-03  6.766e-03  -0.729  0.46614    
## Alcohol                   4.738e-03  1.703e-03   2.782  0.00541 ** 
## as.factor(LabelAppeal)-1  2.515e-01  5.140e-02   4.892 9.96e-07 ***
## as.factor(LabelAppeal)0   4.722e-01  5.016e-02   9.414  < 2e-16 ***
## as.factor(LabelAppeal)1   6.277e-01  5.086e-02  12.340  < 2e-16 ***
## as.factor(LabelAppeal)2   7.395e-01  5.605e-02  13.194  < 2e-16 ***
## as.factor(AcidIndex)5     1.177e-01  4.545e-01   0.259  0.79572    
## as.factor(AcidIndex)6     1.790e-01  4.482e-01   0.399  0.68958    
## as.factor(AcidIndex)7     1.493e-01  4.479e-01   0.333  0.73883    
## as.factor(AcidIndex)8     1.213e-01  4.480e-01   0.271  0.78663    
## as.factor(AcidIndex)9     6.778e-02  4.483e-01   0.151  0.87982    
## as.factor(AcidIndex)10   -1.896e-02  4.496e-01  -0.042  0.96636    
## as.factor(AcidIndex)11   -2.527e-01  4.533e-01  -0.557  0.57725    
## as.factor(AcidIndex)12   -2.125e-01  4.590e-01  -0.463  0.64335    
## as.factor(AcidIndex)13   -1.314e-01  4.622e-01  -0.284  0.77628    
## as.factor(AcidIndex)14   -3.701e-01  4.807e-01  -0.770  0.44132    
## as.factor(AcidIndex)15    1.799e-02  5.585e-01   0.032  0.97430    
## as.factor(AcidIndex)16   -1.916e-01  6.333e-01  -0.303  0.76226    
## as.factor(AcidIndex)17    3.635e-02  6.337e-01   0.057  0.95425    
## as.factor(STARS)2         3.226e-01  1.679e-02  19.209  < 2e-16 ***
## as.factor(STARS)3         4.381e-01  1.826e-02  23.999  < 2e-16 ***
## as.factor(STARS)4         5.566e-01  2.565e-02  21.700  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(133561.3) family taken to be 1)
## 
##     Null deviance: 6339.2  on 6952  degrees of freedom
## Residual deviance: 4216.8  on 6926  degrees of freedom
##   (5842 observations deleted due to missingness)
## AIC: 24950
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  133561 
##           Std. Err.:  206991 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -24894.47
Quasi_Poisson_Model1<- glm(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
              Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
              pH + Sulphates + Alcohol + 
              as.factor(LabelAppeal) +
              as.factor(AcidIndex) +
              as.factor(STARS),
              data=wine.train, 
              family=quasipoisson
            )
summary(Quasi_Poisson_Model1)
## 
## Call:
## glm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), family = quasipoisson, 
##     data = wine.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2212  -0.2662   0.0461   0.3943   1.7274  
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               7.306e-01  3.338e-01   2.189   0.0287 *  
## FixedAcidity              5.803e-04  6.862e-04   0.846   0.3978    
## VolatileAcidity          -2.371e-02  5.446e-03  -4.354 1.36e-05 ***
## CitricAcid               -2.313e-03  4.933e-03  -0.469   0.6392    
## ResidualSugar            -7.068e-05  1.264e-04  -0.559   0.5761    
## Chlorides                -3.261e-02  1.337e-02  -2.438   0.0148 *  
## FreeSulfurDioxide         5.617e-05  2.863e-05   1.962   0.0498 *  
## TotalSulfurDioxide        1.985e-05  1.860e-05   1.067   0.2858    
## Density                  -3.803e-01  1.603e-01  -2.372   0.0177 *  
## pH                       -1.103e-03  6.256e-03  -0.176   0.8601    
## Sulphates                -5.343e-03  4.590e-03  -1.164   0.2445    
## Alcohol                   4.762e-03  1.154e-03   4.126 3.73e-05 ***
## as.factor(LabelAppeal)-1  2.701e-01  3.473e-02   7.777 8.57e-15 ***
## as.factor(LabelAppeal)0   4.943e-01  3.387e-02  14.596  < 2e-16 ***
## as.factor(LabelAppeal)1   6.493e-01  3.433e-02  18.912  < 2e-16 ***
## as.factor(LabelAppeal)2   7.637e-01  3.800e-02  20.098  < 2e-16 ***
## as.factor(AcidIndex)5     1.186e-01  2.963e-01   0.400   0.6890    
## as.factor(AcidIndex)6     1.902e-01  2.919e-01   0.651   0.5148    
## as.factor(AcidIndex)7     1.536e-01  2.917e-01   0.527   0.5985    
## as.factor(AcidIndex)8     1.286e-01  2.918e-01   0.441   0.6594    
## as.factor(AcidIndex)9     7.794e-02  2.920e-01   0.267   0.7896    
## as.factor(AcidIndex)10   -1.910e-02  2.929e-01  -0.065   0.9480    
## as.factor(AcidIndex)11   -2.417e-01  2.954e-01  -0.818   0.4134    
## as.factor(AcidIndex)12   -2.259e-01  2.996e-01  -0.754   0.4510    
## as.factor(AcidIndex)13   -1.328e-01  3.015e-01  -0.440   0.6597    
## as.factor(AcidIndex)14   -1.998e-01  3.132e-01  -0.638   0.5235    
## as.factor(AcidIndex)15    2.322e-02  3.638e-01   0.064   0.9491    
## as.factor(AcidIndex)16   -2.005e-01  4.126e-01  -0.486   0.6270    
## as.factor(AcidIndex)17    7.303e-02  4.132e-01   0.177   0.8597    
## as.factor(STARS)2         3.175e-01  1.135e-02  27.968  < 2e-16 ***
## as.factor(STARS)3         4.320e-01  1.236e-02  34.960  < 2e-16 ***
## as.factor(STARS)4         5.519e-01  1.751e-02  31.513  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 0.4233659)
## 
##     Null deviance: 5844.1  on 6435  degrees of freedom
## Residual deviance: 3890.7  on 6404  degrees of freedom
##   (6359 observations deleted due to missingness)
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5
Quasi_Poisson_Model2 <- glm(TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Sulphates +Alcohol + 
              as.factor(LabelAppeal) + 
              as.factor(AcidIndex) + 
              as.factor(STARS),
              data=wine.train, 
              family=quasipoisson
             )
summary(Quasi_Poisson_Model2)
## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), family = quasipoisson, 
##     data = wine.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2167  -0.2669   0.0479   0.3933   2.0507  
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               3.750e-01  2.942e-01   1.275  0.20247    
## VolatileAcidity          -2.133e-02  5.221e-03  -4.084 4.47e-05 ***
## Chlorides                -2.800e-02  1.286e-02  -2.177  0.02949 *  
## FreeSulfurDioxide         7.109e-05  2.743e-05   2.592  0.00956 ** 
## TotalSulfurDioxide        2.420e-05  1.790e-05   1.352  0.17645    
## Sulphates                -4.931e-03  4.411e-03  -1.118  0.26371    
## Alcohol                   4.738e-03  1.111e-03   4.267 2.01e-05 ***
## as.factor(LabelAppeal)-1  2.515e-01  3.351e-02   7.504 6.98e-14 ***
## as.factor(LabelAppeal)0   4.722e-01  3.271e-02  14.438  < 2e-16 ***
## as.factor(LabelAppeal)1   6.277e-01  3.316e-02  18.926  < 2e-16 ***
## as.factor(LabelAppeal)2   7.395e-01  3.654e-02  20.237  < 2e-16 ***
## as.factor(AcidIndex)5     1.177e-01  2.964e-01   0.397  0.69132    
## as.factor(AcidIndex)6     1.790e-01  2.922e-01   0.613  0.54015    
## as.factor(AcidIndex)7     1.493e-01  2.920e-01   0.511  0.60912    
## as.factor(AcidIndex)8     1.213e-01  2.921e-01   0.415  0.67803    
## as.factor(AcidIndex)9     6.778e-02  2.923e-01   0.232  0.81662    
## as.factor(AcidIndex)10   -1.896e-02  2.931e-01  -0.065  0.94843    
## as.factor(AcidIndex)11   -2.527e-01  2.955e-01  -0.855  0.39262    
## as.factor(AcidIndex)12   -2.125e-01  2.993e-01  -0.710  0.47764    
## as.factor(AcidIndex)13   -1.314e-01  3.014e-01  -0.436  0.66296    
## as.factor(AcidIndex)14   -3.701e-01  3.134e-01  -1.181  0.23766    
## as.factor(AcidIndex)15    1.800e-02  3.642e-01   0.049  0.96059    
## as.factor(AcidIndex)16   -1.916e-01  4.129e-01  -0.464  0.64268    
## as.factor(AcidIndex)17    3.635e-02  4.131e-01   0.088  0.92989    
## as.factor(STARS)2         3.226e-01  1.095e-02  29.462  < 2e-16 ***
## as.factor(STARS)3         4.381e-01  1.190e-02  36.808  < 2e-16 ***
## as.factor(STARS)4         5.566e-01  1.672e-02  33.283  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 0.4251092)
## 
##     Null deviance: 6339.3  on 6952  degrees of freedom
## Residual deviance: 4216.8  on 6926  degrees of freedom
##   (5842 observations deleted due to missingness)
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5
pois.coef = coef(Poisson_Model2)
negbinom.coef = coef(Negative_Bin_Model2)
pois.stderr = se.coef(Poisson_Model2)
negbinom.stderr = summary(Negative_Bin_Model2)$coefficients[, 2]
pois.quasi.coef = coef(Quasi_Poisson_Model2)
pois.quasi.stderr = se.coef(Quasi_Poisson_Model2)
df.analysis = cbind(pois.coef,   negbinom.coef,   pois.quasi.coef, 
                    pois.stderr, negbinom.stderr, pois.quasi.stderr)
head(df.analysis,10) %>% kable() %>% kable_styling(c("striped", "bordered"))
pois.coef negbinom.coef pois.quasi.coef pois.stderr negbinom.stderr pois.quasi.stderr
(Intercept) 0.3749910 0.3749910 0.3749910 0.4512069 0.4512152 0.2941887
VolatileAcidity -0.0213258 -0.0213259 -0.0213258 0.0080083 0.0080084 0.0052214
Chlorides -0.0280045 -0.0280046 -0.0280045 0.0197264 0.0197267 0.0128617
FreeSulfurDioxide 0.0000711 0.0000711 0.0000711 0.0000421 0.0000421 0.0000274
TotalSulfurDioxide 0.0000242 0.0000242 0.0000242 0.0000275 0.0000275 0.0000179
Sulphates -0.0049309 -0.0049309 -0.0049309 0.0067660 0.0067661 0.0044115
Alcohol 0.0047382 0.0047382 0.0047382 0.0017033 0.0017033 0.0011106
as.factor(LabelAppeal)-1 0.2514805 0.2514805 0.2514805 0.0514025 0.0514029 0.0335146
as.factor(LabelAppeal)0 0.4722327 0.4722327 0.4722327 0.0501642 0.0501646 0.0327072
as.factor(LabelAppeal)1 0.6276530 0.6276530 0.6276530 0.0508641 0.0508645 0.0331636
predictor_names <- colnames(wine.train %>% select(c("ResidualSugar","Chlorides","FreeSulfurDioxide",
                                                     "TotalSulfurDioxide","pH","Sulphates","Alcohol","STARS")))

missing_val <- data.frame(INDEX=NA, Variable=NA, value=NA)
colnames(missing_val) <- c("INDEX", "Variable", "value")

for (name in predictor_names) {
  
  #new_missing <- data.frame(cbind(rep(name,5), wine.train %>% filter(is.na(wine.train[name])) %>% count(STARS)))
  missing_stars_count <- wine.train %>% filter(is.na(wine.train[name])) %>% count(TARGET)
  new_missing <- data.frame(cbind(rep(name,nrow(missing_stars_count)), missing_stars_count))
  colnames(new_missing) <- c("INDEX", "Variable", "value")
  
  missing_val <-  rbind(missing_val, new_missing)
}


missing_val <- missing_val %>%
  filter(!row_number() %in% c(1))

ggplot(data=missing_val) + geom_bar(mapping=aes(x=Variable, y=value), stat="identity") + facet_wrap(~INDEX, scales = "fixed") + labs(title = "Missing Predictors vs Sales", x = "Num sales", y = "Count")  

ZInflatedModel <- zeroinfl(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
    ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
    Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + 
    as.factor(AcidIndex)  | STARS,
  data = wine.train, dist = "negbin")
summary(ZInflatedModel)
## 
## Call:
## zeroinfl(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) | 
##     STARS, data = wine.train, dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.43095 -0.25615  0.05577  0.35670  2.45205 
## 
## Count model coefficients (negbin with log link):
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               9.520e-01  5.154e-01   1.847   0.0647 .  
## FixedAcidity              6.000e-04  1.075e-03   0.558   0.5768    
## VolatileAcidity          -1.691e-02  8.479e-03  -1.995   0.0461 *  
## CitricAcid               -1.945e-03  7.700e-03  -0.253   0.8005    
## ResidualSugar            -7.271e-05  1.962e-04  -0.371   0.7109    
## Chlorides                -2.722e-02  2.101e-02  -1.296   0.1951    
## FreeSulfurDioxide         7.817e-06  4.425e-05   0.177   0.8598    
## TotalSulfurDioxide       -2.343e-06  2.827e-05  -0.083   0.9340    
## Density                  -3.595e-01  2.516e-01  -1.429   0.1530    
## pH                        4.591e-03  9.780e-03   0.469   0.6388    
## Sulphates                -1.880e-03  7.167e-03  -0.262   0.7930    
## Alcohol                   8.381e-03  1.794e-03   4.673 2.97e-06 ***
## as.factor(LabelAppeal)-1  3.574e-01  5.428e-02   6.585 4.56e-11 ***
## as.factor(LabelAppeal)0   6.550e-01  5.271e-02  12.427  < 2e-16 ***
## as.factor(LabelAppeal)1   8.703e-01  5.314e-02  16.377  < 2e-16 ***
## as.factor(LabelAppeal)2   1.060e+00  5.859e-02  18.099  < 2e-16 ***
## as.factor(AcidIndex)5    -3.590e-02  4.549e-01  -0.079   0.9371    
## as.factor(AcidIndex)6     5.884e-02  4.482e-01   0.131   0.8956    
## as.factor(AcidIndex)7     1.884e-02  4.479e-01   0.042   0.9665    
## as.factor(AcidIndex)8    -1.553e-03  4.480e-01  -0.003   0.9972    
## as.factor(AcidIndex)9    -3.515e-02  4.484e-01  -0.078   0.9375    
## as.factor(AcidIndex)10   -1.195e-01  4.499e-01  -0.266   0.7904    
## as.factor(AcidIndex)11   -1.876e-01  4.550e-01  -0.412   0.6800    
## as.factor(AcidIndex)12   -1.279e-01  4.633e-01  -0.276   0.7825    
## as.factor(AcidIndex)13   -3.604e-02  4.668e-01  -0.077   0.9385    
## as.factor(AcidIndex)14   -4.569e-02  4.891e-01  -0.093   0.9256    
## as.factor(AcidIndex)15   -2.199e-02  5.747e-01  -0.038   0.9695    
## as.factor(AcidIndex)16    2.464e-01  6.638e-01   0.371   0.7104    
## as.factor(AcidIndex)17   -1.053e-01  6.345e-01  -0.166   0.8683    
## Log(theta)                1.811e+01  1.852e+00   9.779  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.3357     0.5346   4.369 1.25e-05 ***
## STARS        -3.8692     0.5248  -7.373 1.66e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 73631876.5559 
## Number of iterations in BFGS optimization: 40 
## Log-likelihood: -1.134e+04 on 32 Df
ZInflatedModel2 <- zeroinfl(TARGET ~ VolatileAcidity + 
    Alcohol + as.factor(LabelAppeal)  | STARS,
  data = wine.train, dist = "negbin")
summary(ZInflatedModel2)
## 
## Call:
## zeroinfl(formula = TARGET ~ VolatileAcidity + Alcohol + as.factor(LabelAppeal) | 
##     STARS, data = wine.train, dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.44720 -0.29223  0.06532  0.35521  2.19805 
## 
## Count model coefficients (negbin with log link):
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               0.603290   0.047589  12.677  < 2e-16 ***
## VolatileAcidity          -0.014676   0.007191  -2.041   0.0413 *  
## Alcohol                   0.009111   0.001496   6.092 1.12e-09 ***
## as.factor(LabelAppeal)-1  0.362691   0.046620   7.780 7.26e-15 ***
## as.factor(LabelAppeal)0   0.657882   0.045361  14.503  < 2e-16 ***
## as.factor(LabelAppeal)1   0.879500   0.045672  19.257  < 2e-16 ***
## as.factor(LabelAppeal)2   1.065331   0.049974  21.318  < 2e-16 ***
## Log(theta)               17.274647        NaN     NaN      NaN    
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.2194     0.4315   5.143  2.7e-07 ***
## STARS        -3.7580     0.4227  -8.891  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 31789517.9144 
## Number of iterations in BFGS optimization: 22 
## Log-likelihood: -1.584e+04 on 10 Df
(colSums(is.na(wine.eval)) / nrow(wine.eval)) * 100
##                 IN             TARGET       FixedAcidity    VolatileAcidity 
##           0.000000         100.000000           0.000000           0.000000 
##         CitricAcid      ResidualSugar          Chlorides  FreeSulfurDioxide 
##           0.000000           5.037481           4.137931           4.557721 
## TotalSulfurDioxide            Density                 pH          Sulphates 
##           4.707646           0.000000           3.118441           9.295352 
##            Alcohol        LabelAppeal          AcidIndex              STARS 
##           5.547226           0.000000           0.000000          25.217391
colnames(wine.eval)[1] <- "INDEX"

wine.eval$TARGET <- round(predict(ZInflatedModel2, wine.eval %>% select(-c("INDEX", "TARGET")), type="response"))

wine.eval %>% filter(is.na(STARS) && is.na(TARGET)) %>% count()
##      n
## 1 3335
wine.eval %>% filter(is.na(STARS) && !is.na(TARGET)) %>% count()
##   n
## 1 0
wine.eval[is.na(wine.eval$TARGET), 'TARGET'] <- 0

#write_csv(wine.eval,"HW5_predictions.csv")

ggplot(data=wine.eval) + geom_histogram(mapping=aes(x=TARGET))