Overview

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

Data exploration

I uploaded the csv files on my github account and loaded here. First column INDEX in training set and IN in evaluation set doesn’t have any significance in our analysis so the columns will be dropped.

The training dataset has 12795 observations and 16 variables. We see from the summary there are missing values in many variables. We will clean up accordingly in the later sections.

## [1] 12795    16
##      INDEX           TARGET       FixedAcidity     VolatileAcidity  
##  Min.   :    1   Min.   :0.000   Min.   :-18.100   Min.   :-2.7900  
##  1st Qu.: 4038   1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300  
##  Median : 8110   Median :3.000   Median :  6.900   Median : 0.2800  
##  Mean   : 8070   Mean   :3.029   Mean   :  7.076   Mean   : 0.3241  
##  3rd Qu.:12106   3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400  
##  Max.   :16129   Max.   :8.000   Max.   : 34.400   Max.   : 3.6800  
##                                                                     
##    CitricAcid      ResidualSugar        Chlorides       FreeSulfurDioxide
##  Min.   :-3.2400   Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00  
##  1st Qu.: 0.0300   1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00  
##  Median : 0.3100   Median :   3.900   Median : 0.0460   Median :  30.00  
##  Mean   : 0.3084   Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85  
##  3rd Qu.: 0.5800   3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00  
##  Max.   : 3.8600   Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00  
##                    NA's   :616        NA's   :638       NA's   :647      
##  TotalSulfurDioxide    Density             pH          Sulphates      
##  Min.   :-823.0     Min.   :0.8881   Min.   :0.480   Min.   :-3.1300  
##  1st Qu.:  27.0     1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800  
##  Median : 123.0     Median :0.9945   Median :3.200   Median : 0.5000  
##  Mean   : 120.7     Mean   :0.9942   Mean   :3.208   Mean   : 0.5271  
##  3rd Qu.: 208.0     3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600  
##  Max.   :1057.0     Max.   :1.0992   Max.   :6.130   Max.   : 4.2400  
##  NA's   :682                         NA's   :395     NA's   :1210     
##     Alcohol       LabelAppeal          AcidIndex          STARS      
##  Min.   :-4.70   Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.: 9.00   1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median :10.40   Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :10.49   Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.:12.40   3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   :26.50   Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##  NA's   :653                                          NA's   :3359

The first column from both of the datasets will be dropped as they are just index numbers and doesn’t help in our analysis.

Data Visualization

We see from the histogram plots we see many of the variables have normal distribution,there is right skewing in AcidIndex and STARS

From the variable plots below plotted against target varaible, we see - the values of STARS and LabelAppeal increases with target variable, so they have a positive relationship with target variable.

From the correlation matrix below - STARS and LabelAppeal have positive reltionship and other variables are loosely correlated, there doesn’t seem to be any relationships.

Data preparation

In our earlier analysis we saw there are variables with missing values. So firstly for STARS we plan to replace NA it with 0. For the remaining missing data we will be using caret::preProcess and method=knnImpute. With preProcess it will also center, scale and BoxCox our features at the same time. df_clean is the cleaned dataframe.

Model Building

We now start building the models

Split the dataset df_clean into 80% as training and 20% as testing datasets.

Model 1

In the first model we use generalized linear model glm with poisson family and include all the features.

## 
## Call:
## glm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), family = poisson, 
##     data = training_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2254  -0.6505  -0.0054   0.4448   3.6717  
## 
## Coefficients:
##                                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)                               4.335e-01  4.496e-01   0.964 0.334905
## FixedAcidity                              7.682e-04  5.771e-03   0.133 0.894099
## VolatileAcidity                          -2.170e-02  5.738e-03  -3.782 0.000155
## CitricAcid                                2.702e-04  5.662e-03   0.048 0.961932
## ResidualSugar                             2.304e-03  5.820e-03   0.396 0.692234
## Chlorides                                -1.277e-02  5.871e-03  -2.175 0.029608
## FreeSulfurDioxide                         1.309e-02  5.775e-03   2.267 0.023378
## TotalSulfurDioxide                        1.792e-02  5.852e-03   3.063 0.002190
## Density                                  -1.038e-02  5.697e-03  -1.821 0.068587
## pH                                       -6.798e-03  5.829e-03  -1.166 0.243577
## Sulphates                                -1.048e-02  5.932e-03  -1.767 0.077181
## Alcohol                                   1.788e-02  5.903e-03   3.028 0.002459
## as.factor(LabelAppeal)-1.11204793733397   2.201e-01  4.277e-02   5.147 2.65e-07
## as.factor(LabelAppeal)0.0101741115806247  4.108e-01  4.172e-02   9.844  < 2e-16
## as.factor(LabelAppeal)1.13239616049522    5.502e-01  4.245e-02  12.960  < 2e-16
## as.factor(LabelAppeal)2.25461820940981    6.842e-01  4.770e-02  14.346  < 2e-16
## as.factor(AcidIndex)-3.59682937695875    -5.236e-01  4.532e-01  -1.155 0.247994
## as.factor(AcidIndex)-1.79176983045029    -5.106e-01  4.484e-01  -1.139 0.254765
## as.factor(AcidIndex)-0.545318540973785   -5.436e-01  4.481e-01  -1.213 0.225112
## as.factor(AcidIndex)0.362910765511677    -5.757e-01  4.481e-01  -1.285 0.198947
## as.factor(AcidIndex)1.05172974217783     -6.939e-01  4.484e-01  -1.547 0.121782
## as.factor(AcidIndex)1.59059728918163     -8.152e-01  4.493e-01  -1.814 0.069633
## as.factor(AcidIndex)2.02271372429848     -1.197e+00  4.526e-01  -2.644 0.008201
## as.factor(AcidIndex)2.37629509167962     -1.330e+00  4.591e-01  -2.898 0.003755
## as.factor(AcidIndex)2.67051656830802     -1.057e+00  4.610e-01  -2.292 0.021882
## as.factor(AcidIndex)2.9188445277671      -1.184e+00  4.728e-01  -2.505 0.012236
## as.factor(AcidIndex)3.13100139587667     -7.599e-01  5.330e-01  -1.426 0.153969
## as.factor(AcidIndex)3.31417429494859     -1.347e+01  1.626e+02  -0.083 0.933945
## as.factor(AcidIndex)3.47378568897179     -1.484e+00  6.335e-01  -2.342 0.019185
## as.factor(STARS)-0.42623524866846         7.568e-01  2.193e-02  34.514  < 2e-16
## as.factor(STARS)0.416552574962037         1.071e+00  2.050e-02  52.234  < 2e-16
## as.factor(STARS)1.25934039859254          1.186e+00  2.163e-02  54.804  < 2e-16
## as.factor(STARS)2.10212822222303          1.310e+00  2.724e-02  48.067  < 2e-16
##                                             
## (Intercept)                                 
## FixedAcidity                                
## VolatileAcidity                          ***
## CitricAcid                                  
## ResidualSugar                               
## Chlorides                                *  
## FreeSulfurDioxide                        *  
## TotalSulfurDioxide                       ** 
## Density                                  .  
## pH                                          
## Sulphates                                .  
## Alcohol                                  ** 
## as.factor(LabelAppeal)-1.11204793733397  ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522   ***
## as.factor(LabelAppeal)2.25461820940981   ***
## as.factor(AcidIndex)-3.59682937695875       
## as.factor(AcidIndex)-1.79176983045029       
## as.factor(AcidIndex)-0.545318540973785      
## as.factor(AcidIndex)0.362910765511677       
## as.factor(AcidIndex)1.05172974217783        
## as.factor(AcidIndex)1.59059728918163     .  
## as.factor(AcidIndex)2.02271372429848     ** 
## as.factor(AcidIndex)2.37629509167962     ** 
## as.factor(AcidIndex)2.67051656830802     *  
## as.factor(AcidIndex)2.9188445277671      *  
## as.factor(AcidIndex)3.13100139587667        
## as.factor(AcidIndex)3.31417429494859        
## as.factor(AcidIndex)3.47378568897179     *  
## as.factor(STARS)-0.42623524866846        ***
## as.factor(STARS)0.416552574962037        ***
## as.factor(STARS)1.25934039859254         ***
## as.factor(STARS)2.10212822222303         ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18300  on 10235  degrees of freedom
## Residual deviance: 10887  on 10203  degrees of freedom
## AIC: 36509
## 
## Number of Fisher Scoring iterations: 10

Model 2

For Model 2 we use significant features from Model 1 using the stepAIC

## Start:  AIC=36509.07
## TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
##     Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
##     pH + Sulphates + Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) + 
##     as.factor(STARS)
## 
##                          Df Deviance   AIC
## - CitricAcid              1    10887 36507
## - FixedAcidity            1    10887 36507
## - ResidualSugar           1    10887 36507
## - pH                      1    10888 36508
## <none>                         10887 36509
## - Sulphates               1    10890 36510
## - Density                 1    10890 36510
## - Chlorides               1    10892 36512
## - FreeSulfurDioxide       1    10892 36512
## - Alcohol                 1    10896 36516
## - TotalSulfurDioxide      1    10896 36516
## - VolatileAcidity         1    10901 36521
## - as.factor(AcidIndex)   13    11234 36831
## - as.factor(LabelAppeal)  4    11448 37063
## - as.factor(STARS)        4    15384 40998
## 
## Step:  AIC=36507.07
## TARGET ~ FixedAcidity + VolatileAcidity + ResidualSugar + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
##     Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) + 
##     as.factor(STARS)
## 
##                          Df Deviance   AIC
## - FixedAcidity            1    10887 36505
## - ResidualSugar           1    10887 36505
## - pH                      1    10888 36506
## <none>                         10887 36507
## - Sulphates               1    10890 36508
## - Density                 1    10890 36508
## + CitricAcid              1    10887 36509
## - Chlorides               1    10892 36510
## - FreeSulfurDioxide       1    10892 36510
## - Alcohol                 1    10896 36514
## - TotalSulfurDioxide      1    10896 36514
## - VolatileAcidity         1    10901 36519
## - as.factor(AcidIndex)   13    11235 36829
## - as.factor(LabelAppeal)  4    11449 37061
## - as.factor(STARS)        4    15386 40998
## 
## Step:  AIC=36505.09
## TARGET ~ VolatileAcidity + ResidualSugar + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + pH + Sulphates + Alcohol + 
##     as.factor(LabelAppeal) + as.factor(AcidIndex) + as.factor(STARS)
## 
##                          Df Deviance   AIC
## - ResidualSugar           1    10887 36503
## - pH                      1    10888 36504
## <none>                         10887 36505
## - Sulphates               1    10890 36506
## - Density                 1    10890 36506
## + FixedAcidity            1    10887 36507
## + CitricAcid              1    10887 36507
## - Chlorides               1    10892 36508
## - FreeSulfurDioxide       1    10892 36508
## - Alcohol                 1    10896 36512
## - TotalSulfurDioxide      1    10896 36512
## - VolatileAcidity         1    10901 36517
## - as.factor(AcidIndex)   13    11241 36833
## - as.factor(LabelAppeal)  4    11449 37059
## - as.factor(STARS)        4    15386 40996
## 
## Step:  AIC=36503.24
## TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS)
## 
##                          Df Deviance   AIC
## - pH                      1    10888 36503
## <none>                         10887 36503
## - Sulphates               1    10890 36504
## - Density                 1    10890 36505
## + ResidualSugar           1    10887 36505
## + FixedAcidity            1    10887 36505
## + CitricAcid              1    10887 36505
## - Chlorides               1    10892 36506
## - FreeSulfurDioxide       1    10892 36506
## - Alcohol                 1    10896 36510
## - TotalSulfurDioxide      1    10897 36511
## - VolatileAcidity         1    10902 36516
## - as.factor(AcidIndex)   13    11242 36832
## - as.factor(LabelAppeal)  4    11449 37057
## - as.factor(STARS)        4    15387 40995
## 
## Step:  AIC=36502.59
## TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS)
## 
##                          Df Deviance   AIC
## <none>                         10888 36503
## + pH                      1    10887 36503
## - Sulphates               1    10892 36504
## - Density                 1    10892 36504
## + ResidualSugar           1    10888 36504
## + FixedAcidity            1    10888 36505
## + CitricAcid              1    10888 36505
## - Chlorides               1    10893 36505
## - FreeSulfurDioxide       1    10894 36506
## - Alcohol                 1    10898 36510
## - TotalSulfurDioxide      1    10898 36510
## - VolatileAcidity         1    10903 36515
## - as.factor(AcidIndex)   13    11242 36830
## - as.factor(LabelAppeal)  4    11450 37056
## - as.factor(STARS)        4    15394 41000
## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), family = poisson, 
##     data = training_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2383  -0.6493  -0.0062   0.4429   3.6689  
## 
## Coefficients:
##                                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)                                0.429534   0.449207   0.956 0.338968
## VolatileAcidity                           -0.021755   0.005736  -3.792 0.000149
## Chlorides                                 -0.012601   0.005868  -2.147 0.031765
## FreeSulfurDioxide                          0.013217   0.005774   2.289 0.022075
## TotalSulfurDioxide                         0.018007   0.005848   3.079 0.002077
## Density                                   -0.010443   0.005696  -1.834 0.066721
## Sulphates                                 -0.010458   0.005930  -1.764 0.077808
## Alcohol                                    0.017944   0.005900   3.042 0.002354
## as.factor(LabelAppeal)-1.11204793733397    0.220335   0.042769   5.152 2.58e-07
## as.factor(LabelAppeal)0.0101741115806247   0.410655   0.041722   9.843  < 2e-16
## as.factor(LabelAppeal)1.13239616049522     0.550258   0.042448  12.963  < 2e-16
## as.factor(LabelAppeal)2.25461820940981     0.683988   0.047692  14.342  < 2e-16
## as.factor(AcidIndex)-3.59682937695875     -0.521321   0.452873  -1.151 0.249674
## as.factor(AcidIndex)-1.79176983045029     -0.508192   0.448002  -1.134 0.256646
## as.factor(AcidIndex)-0.545318540973785    -0.540219   0.447745  -1.207 0.227612
## as.factor(AcidIndex)0.362910765511677     -0.571580   0.447757  -1.277 0.201765
## as.factor(AcidIndex)1.05172974217783      -0.689490   0.447998  -1.539 0.123793
## as.factor(AcidIndex)1.59059728918163      -0.810650   0.448852  -1.806 0.070910
## as.factor(AcidIndex)2.02271372429848      -1.192447   0.452154  -2.637 0.008358
## as.factor(AcidIndex)2.37629509167962      -1.326170   0.458580  -2.892 0.003829
## as.factor(AcidIndex)2.67051656830802      -1.051957   0.460499  -2.284 0.022349
## as.factor(AcidIndex)2.9188445277671       -1.180235   0.472275  -2.499 0.012453
## as.factor(AcidIndex)3.13100139587667      -0.754149   0.532630  -1.416 0.156805
## as.factor(AcidIndex)3.31417429494859     -13.466519 162.534380  -0.083 0.933968
## as.factor(AcidIndex)3.47378568897179      -1.478336   0.632950  -2.336 0.019511
## as.factor(STARS)-0.42623524866846          0.757022   0.021925  34.528  < 2e-16
## as.factor(STARS)0.416552574962037          1.071567   0.020496  52.281  < 2e-16
## as.factor(STARS)1.25934039859254           1.186170   0.021629  54.841  < 2e-16
## as.factor(STARS)2.10212822222303           1.309666   0.027240  48.078  < 2e-16
##                                             
## (Intercept)                                 
## VolatileAcidity                          ***
## Chlorides                                *  
## FreeSulfurDioxide                        *  
## TotalSulfurDioxide                       ** 
## Density                                  .  
## Sulphates                                .  
## Alcohol                                  ** 
## as.factor(LabelAppeal)-1.11204793733397  ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522   ***
## as.factor(LabelAppeal)2.25461820940981   ***
## as.factor(AcidIndex)-3.59682937695875       
## as.factor(AcidIndex)-1.79176983045029       
## as.factor(AcidIndex)-0.545318540973785      
## as.factor(AcidIndex)0.362910765511677       
## as.factor(AcidIndex)1.05172974217783        
## as.factor(AcidIndex)1.59059728918163     .  
## as.factor(AcidIndex)2.02271372429848     ** 
## as.factor(AcidIndex)2.37629509167962     ** 
## as.factor(AcidIndex)2.67051656830802     *  
## as.factor(AcidIndex)2.9188445277671      *  
## as.factor(AcidIndex)3.13100139587667        
## as.factor(AcidIndex)3.31417429494859        
## as.factor(AcidIndex)3.47378568897179     *  
## as.factor(STARS)-0.42623524866846        ***
## as.factor(STARS)0.416552574962037        ***
## as.factor(STARS)1.25934039859254         ***
## as.factor(STARS)2.10212822222303         ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18300  on 10235  degrees of freedom
## Residual deviance: 10889  on 10207  degrees of freedom
## AIC: 36503
## 
## Number of Fisher Scoring iterations: 10

Model 3

We will build another model, model 3 we choose negative binomial model this time with all the predictors

## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) + 
##     as.factor(STARS), data = training_df, init.theta = 40382.20944, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2483  -0.6519  -0.0011   0.4398   3.6855  
## 
## Coefficients:
##                                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)                               4.316e-01  4.492e-01   0.961 0.336574
## VolatileAcidity                          -2.200e-02  5.737e-03  -3.835 0.000126
## FreeSulfurDioxide                         1.324e-02  5.769e-03   2.296 0.021688
## TotalSulfurDioxide                        1.781e-02  5.846e-03   3.047 0.002313
## Alcohol                                   1.819e-02  5.898e-03   3.083 0.002046
## as.factor(LabelAppeal)-1.11204793733397   2.204e-01  4.277e-02   5.154 2.55e-07
## as.factor(LabelAppeal)0.0101741115806247  4.111e-01  4.172e-02   9.854  < 2e-16
## as.factor(LabelAppeal)1.13239616049522    5.501e-01  4.245e-02  12.960  < 2e-16
## as.factor(LabelAppeal)2.25461820940981    6.831e-01  4.769e-02  14.323  < 2e-16
## as.factor(AcidIndex)-3.59682937695875    -5.203e-01  4.528e-01  -1.149 0.250608
## as.factor(AcidIndex)-1.79176983045029    -5.102e-01  4.480e-01  -1.139 0.254789
## as.factor(AcidIndex)-0.545318540973785   -5.423e-01  4.477e-01  -1.211 0.225791
## as.factor(AcidIndex)0.362910765511677    -5.744e-01  4.477e-01  -1.283 0.199516
## as.factor(AcidIndex)1.05172974217783     -6.939e-01  4.480e-01  -1.549 0.121397
## as.factor(AcidIndex)1.59059728918163     -8.164e-01  4.488e-01  -1.819 0.068900
## as.factor(AcidIndex)2.02271372429848     -1.199e+00  4.521e-01  -2.652 0.008002
## as.factor(AcidIndex)2.37629509167962     -1.332e+00  4.585e-01  -2.905 0.003674
## as.factor(AcidIndex)2.67051656830802     -1.058e+00  4.605e-01  -2.298 0.021577
## as.factor(AcidIndex)2.9188445277671      -1.184e+00  4.722e-01  -2.508 0.012145
## as.factor(AcidIndex)3.13100139587667     -7.431e-01  5.326e-01  -1.395 0.162964
## as.factor(AcidIndex)3.31417429494859     -3.814e+01  3.773e+07   0.000 0.999999
## as.factor(AcidIndex)3.47378568897179     -1.510e+00  6.328e-01  -2.386 0.017039
## as.factor(STARS)-0.42623524866846         7.580e-01  2.192e-02  34.575  < 2e-16
## as.factor(STARS)0.416552574962037         1.073e+00  2.049e-02  52.332  < 2e-16
## as.factor(STARS)1.25934039859254          1.188e+00  2.162e-02  54.951  < 2e-16
## as.factor(STARS)2.10212822222303          1.311e+00  2.724e-02  48.142  < 2e-16
##                                             
## (Intercept)                                 
## VolatileAcidity                          ***
## FreeSulfurDioxide                        *  
## TotalSulfurDioxide                       ** 
## Alcohol                                  ** 
## as.factor(LabelAppeal)-1.11204793733397  ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522   ***
## as.factor(LabelAppeal)2.25461820940981   ***
## as.factor(AcidIndex)-3.59682937695875       
## as.factor(AcidIndex)-1.79176983045029       
## as.factor(AcidIndex)-0.545318540973785      
## as.factor(AcidIndex)0.362910765511677       
## as.factor(AcidIndex)1.05172974217783        
## as.factor(AcidIndex)1.59059728918163     .  
## as.factor(AcidIndex)2.02271372429848     ** 
## as.factor(AcidIndex)2.37629509167962     ** 
## as.factor(AcidIndex)2.67051656830802     *  
## as.factor(AcidIndex)2.9188445277671      *  
## as.factor(AcidIndex)3.13100139587667        
## as.factor(AcidIndex)3.31417429494859        
## as.factor(AcidIndex)3.47378568897179     *  
## as.factor(STARS)-0.42623524866846        ***
## as.factor(STARS)0.416552574962037        ***
## as.factor(STARS)1.25934039859254         ***
## as.factor(STARS)2.10212822222303         ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(40382.19) family taken to be 1)
## 
##     Null deviance: 18299  on 10235  degrees of freedom
## Residual deviance: 10899  on 10210  degrees of freedom
## AIC: 36510
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  40382 
##           Std. Err.:  37785 
## Warning while fitting theta: alternation limit reached 
## 
##  2 x log-likelihood:  -36456.02

Model 4

We use the significant predictors using stepAIC and run again the negative binomial model.

## Start:  AIC=36508.02
## TARGET ~ VolatileAcidity + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) + 
##     as.factor(STARS)
## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) + 
##     as.factor(STARS), data = training_df, init.theta = 40382.20944, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2483  -0.6519  -0.0011   0.4398   3.6855  
## 
## Coefficients:
##                                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)                               4.316e-01  4.492e-01   0.961 0.336574
## VolatileAcidity                          -2.200e-02  5.737e-03  -3.835 0.000126
## FreeSulfurDioxide                         1.324e-02  5.769e-03   2.296 0.021688
## TotalSulfurDioxide                        1.781e-02  5.846e-03   3.047 0.002313
## Alcohol                                   1.819e-02  5.898e-03   3.083 0.002046
## as.factor(LabelAppeal)-1.11204793733397   2.204e-01  4.277e-02   5.154 2.55e-07
## as.factor(LabelAppeal)0.0101741115806247  4.111e-01  4.172e-02   9.854  < 2e-16
## as.factor(LabelAppeal)1.13239616049522    5.501e-01  4.245e-02  12.960  < 2e-16
## as.factor(LabelAppeal)2.25461820940981    6.831e-01  4.769e-02  14.323  < 2e-16
## as.factor(AcidIndex)-3.59682937695875    -5.203e-01  4.528e-01  -1.149 0.250608
## as.factor(AcidIndex)-1.79176983045029    -5.102e-01  4.480e-01  -1.139 0.254789
## as.factor(AcidIndex)-0.545318540973785   -5.423e-01  4.477e-01  -1.211 0.225791
## as.factor(AcidIndex)0.362910765511677    -5.744e-01  4.477e-01  -1.283 0.199516
## as.factor(AcidIndex)1.05172974217783     -6.939e-01  4.480e-01  -1.549 0.121397
## as.factor(AcidIndex)1.59059728918163     -8.164e-01  4.488e-01  -1.819 0.068900
## as.factor(AcidIndex)2.02271372429848     -1.199e+00  4.521e-01  -2.652 0.008002
## as.factor(AcidIndex)2.37629509167962     -1.332e+00  4.585e-01  -2.905 0.003674
## as.factor(AcidIndex)2.67051656830802     -1.058e+00  4.605e-01  -2.298 0.021577
## as.factor(AcidIndex)2.9188445277671      -1.184e+00  4.722e-01  -2.508 0.012145
## as.factor(AcidIndex)3.13100139587667     -7.431e-01  5.326e-01  -1.395 0.162964
## as.factor(AcidIndex)3.31417429494859     -3.814e+01  3.773e+07   0.000 0.999999
## as.factor(AcidIndex)3.47378568897179     -1.510e+00  6.328e-01  -2.386 0.017039
## as.factor(STARS)-0.42623524866846         7.580e-01  2.192e-02  34.575  < 2e-16
## as.factor(STARS)0.416552574962037         1.073e+00  2.049e-02  52.332  < 2e-16
## as.factor(STARS)1.25934039859254          1.188e+00  2.162e-02  54.951  < 2e-16
## as.factor(STARS)2.10212822222303          1.311e+00  2.724e-02  48.142  < 2e-16
##                                             
## (Intercept)                                 
## VolatileAcidity                          ***
## FreeSulfurDioxide                        *  
## TotalSulfurDioxide                       ** 
## Alcohol                                  ** 
## as.factor(LabelAppeal)-1.11204793733397  ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522   ***
## as.factor(LabelAppeal)2.25461820940981   ***
## as.factor(AcidIndex)-3.59682937695875       
## as.factor(AcidIndex)-1.79176983045029       
## as.factor(AcidIndex)-0.545318540973785      
## as.factor(AcidIndex)0.362910765511677       
## as.factor(AcidIndex)1.05172974217783        
## as.factor(AcidIndex)1.59059728918163     .  
## as.factor(AcidIndex)2.02271372429848     ** 
## as.factor(AcidIndex)2.37629509167962     ** 
## as.factor(AcidIndex)2.67051656830802     *  
## as.factor(AcidIndex)2.9188445277671      *  
## as.factor(AcidIndex)3.13100139587667        
## as.factor(AcidIndex)3.31417429494859        
## as.factor(AcidIndex)3.47378568897179     *  
## as.factor(STARS)-0.42623524866846        ***
## as.factor(STARS)0.416552574962037        ***
## as.factor(STARS)1.25934039859254         ***
## as.factor(STARS)2.10212822222303         ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(40382.19) family taken to be 1)
## 
##     Null deviance: 18299  on 10235  degrees of freedom
## Residual deviance: 10899  on 10210  degrees of freedom
## AIC: 36510
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  40382 
##           Std. Err.:  37785 
## Warning while fitting theta: alternation limit reached 
## 
##  2 x log-likelihood:  -36456.02

Model selection

Among the 4 model Poisson 2 fairs well among the others, it has lowest AIC and almost similar MSE as Poisson 1 which is in lower range. Due to lower AIC and lower MSE we choose Poisson Model 2 as our best Model.

Poisson1 Poisson2 Neg binomial1 Neg binomial2
MSE 6.759896 6.760038 7.117969 7.117969
Predictors 33.000000 29.000000 26.000000 26.000000
AIC 36509.069767 36502.586471 36510.019636 36510.019636