Overview

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided).

Data Exploration & Preparation

## 'data.frame':    12795 obs. of  15 variables:
##  $ TARGET            : int  3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity      : num  3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity   : num  1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid        : num  -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar     : num  54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides         : num  -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide : num  NA 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide: num  268 -327 142 115 108 15 156 551 NA 180 ...
##  $ Density           : num  0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num  3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates         : num  -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
##  $ Alcohol           : num  9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal       : int  0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : int  8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS             : int  2 3 3 1 2 NA NA 3 NA 4 ...
##      TARGET       FixedAcidity     VolatileAcidity     CitricAcid     
##  Min.   :0.000   Min.   :-18.100   Min.   :-2.7900   Min.   :-3.2400  
##  1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300   1st Qu.: 0.0300  
##  Median :3.000   Median :  6.900   Median : 0.2800   Median : 0.3100  
##  Mean   :3.029   Mean   :  7.076   Mean   : 0.3241   Mean   : 0.3084  
##  3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400   3rd Qu.: 0.5800  
##  Max.   :8.000   Max.   : 34.400   Max.   : 3.6800   Max.   : 3.8600  
##                                                                       
##  ResidualSugar        Chlorides       FreeSulfurDioxide TotalSulfurDioxide
##  Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00   Min.   :-823.0    
##  1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00   1st Qu.:  27.0    
##  Median :   3.900   Median : 0.0460   Median :  30.00   Median : 123.0    
##  Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85   Mean   : 120.7    
##  3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00   3rd Qu.: 208.0    
##  Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00   Max.   :1057.0    
##  NA's   :616        NA's   :638       NA's   :647       NA's   :682       
##     Density             pH          Sulphates          Alcohol     
##  Min.   :0.8881   Min.   :0.480   Min.   :-3.1300   Min.   :-4.70  
##  1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800   1st Qu.: 9.00  
##  Median :0.9945   Median :3.200   Median : 0.5000   Median :10.40  
##  Mean   :0.9942   Mean   :3.208   Mean   : 0.5271   Mean   :10.49  
##  3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600   3rd Qu.:12.40  
##  Max.   :1.0992   Max.   :6.130   Max.   : 4.2400   Max.   :26.50  
##                   NA's   :395     NA's   :1210      NA's   :653    
##   LabelAppeal          AcidIndex          STARS      
##  Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##                                       NA's   :3359
##                    vars     n   mean     sd median trimmed    mad     min
## TARGET                1 12795   3.03   1.93   3.00    3.05   1.48    0.00
## FixedAcidity          2 12795   7.08   6.32   6.90    7.07   3.26  -18.10
## VolatileAcidity       3 12795   0.32   0.78   0.28    0.32   0.43   -2.79
## CitricAcid            4 12795   0.31   0.86   0.31    0.31   0.42   -3.24
## ResidualSugar         5 12179   5.42  33.75   3.90    5.58  15.72 -127.80
## Chlorides             6 12157   0.05   0.32   0.05    0.05   0.13   -1.17
## FreeSulfurDioxide     7 12148  30.85 148.71  30.00   30.93  56.34 -555.00
## TotalSulfurDioxide    8 12113 120.71 231.91 123.00  120.89 134.92 -823.00
## Density               9 12795   0.99   0.03   0.99    0.99   0.01    0.89
## pH                   10 12400   3.21   0.68   3.20    3.21   0.39    0.48
## Sulphates            11 11585   0.53   0.93   0.50    0.53   0.44   -3.13
## Alcohol              12 12142  10.49   3.73  10.40   10.50   2.37   -4.70
## LabelAppeal          13 12795  -0.01   0.89   0.00   -0.01   1.48   -2.00
## AcidIndex            14 12795   7.77   1.32   8.00    7.64   1.48    4.00
## STARS                15  9436   2.04   0.90   2.00    1.97   1.48    1.00
##                        max   range  skew kurtosis   se
## TARGET                8.00    8.00 -0.33    -0.88 0.02
## FixedAcidity         34.40   52.50 -0.02     1.67 0.06
## VolatileAcidity       3.68    6.47  0.02     1.83 0.01
## CitricAcid            3.86    7.10 -0.05     1.84 0.01
## ResidualSugar       141.15  268.95 -0.05     1.88 0.31
## Chlorides             1.35    2.52  0.03     1.79 0.00
## FreeSulfurDioxide   623.00 1178.00  0.01     1.84 1.35
## TotalSulfurDioxide 1057.00 1880.00 -0.01     1.67 2.11
## Density               1.10    0.21 -0.02     1.90 0.00
## pH                    6.13    5.65  0.04     1.65 0.01
## Sulphates             4.24    7.37  0.01     1.75 0.01
## Alcohol              26.50   31.20 -0.03     1.54 0.03
## LabelAppeal           2.00    4.00  0.01    -0.26 0.01
## AcidIndex            17.00   13.00  1.65     5.19 0.01
## STARS                 4.00    3.00  0.45    -0.69 0.01

There are 12795 rows and 16 attributes of wine characterisitics data, each wine has 14 potential predictor variables. The response variable is TARGET (# of cases purchased)

Graphic Exploration

## Warning: Removed 616 rows containing non-finite values (stat_bin).
## Warning: Removed 647 rows containing non-finite values (stat_bin).
## Warning: Removed 682 rows containing non-finite values (stat_bin).
## Warning: Removed 395 rows containing non-finite values (stat_bin).
## Warning: Removed 653 rows containing non-finite values (stat_bin).

The following variables seem to have strong correlation to the response variable TARGET: - Chlorides, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS The following variables seem to have mild correlation to the response variable TARGET: - FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, FreeSulfurDioxide, TotalSulfurDioxide

Explore correlations between predictors

##                           TARGET FixedAcidity VolatileAcidity
## TARGET              1.0000000000 -0.012538100   -0.0759978765
## FixedAcidity       -0.0125380998  1.000000000    0.0190109733
## VolatileAcidity    -0.0759978765  0.019010973    1.0000000000
## CitricAcid          0.0023450490  0.014000376   -0.0234315631
## ResidualSugar       0.0035195999 -0.015429391    0.0015279517
## Chlorides          -0.0304301331 -0.006104447    0.0148489225
## FreeSulfurDioxide   0.0226398054  0.015438463   -0.0114408079
## TotalSulfurDioxide  0.0216020726 -0.023323485   -0.0007434083
## Density            -0.0475989086  0.011574241    0.0130977690
## pH                  0.0002198557 -0.004553886    0.0072030364
## Sulphates          -0.0212203783  0.042229181    0.0015161001
## Alcohol             0.0737771084 -0.013085026    0.0002603082
## LabelAppeal         0.4979464796  0.011375965   -0.0202419713
## AcidIndex          -0.1676430648  0.154167846    0.0250529742
## STARS               0.5546857223 -0.004937345   -0.0402432388
##                       CitricAcid ResidualSugar     Chlorides
## TARGET              0.0023450490   0.003519600 -0.0304301331
## FixedAcidity        0.0140003760  -0.015429391 -0.0061044471
## VolatileAcidity    -0.0234315631   0.001527952  0.0148489225
## CitricAcid          1.0000000000  -0.009843146 -0.0335608661
## ResidualSugar      -0.0098431456   1.000000000  0.0041215692
## Chlorides          -0.0335608661   0.004121569  1.0000000000
## FreeSulfurDioxide   0.0121132485   0.021959113 -0.0204924876
## TotalSulfurDioxide -0.0099174506   0.017030939  0.0004188605
## Density            -0.0169919691  -0.007120841  0.0206724860
## pH                 -0.0007581304   0.017563769 -0.0179702278
## Sulphates          -0.0144237270  -0.002705775  0.0026187777
## Alcohol             0.0169864284  -0.018943324 -0.0228849573
## LabelAppeal         0.0153315666  -0.004579308 -0.0063870237
## AcidIndex           0.0545838104  -0.020301890 -0.0017134096
## STARS               0.0071401699   0.019665541 -0.0063242568
##                    FreeSulfurDioxide TotalSulfurDioxide      Density
## TARGET                   0.022639805       0.0216020726 -0.047598909
## FixedAcidity             0.015438463      -0.0233234848  0.011574241
## VolatileAcidity         -0.011440808      -0.0007434083  0.013097769
## CitricAcid               0.012113248      -0.0099174506 -0.016991969
## ResidualSugar            0.021959113       0.0170309394 -0.007120841
## Chlorides               -0.020492488       0.0004188605  0.020672486
## FreeSulfurDioxide        1.000000000       0.0134616726 -0.008663509
## TotalSulfurDioxide       0.013461673       1.0000000000  0.023167955
## Density                 -0.008663509       0.0231679548  1.000000000
## pH                      -0.002008516      -0.0034227601 -0.002019229
## Sulphates                0.026829029       0.0025040509 -0.010609294
## Alcohol                 -0.023867458      -0.0168515467 -0.006128355
## LabelAppeal              0.014960087      -0.0027237419 -0.018094403
## AcidIndex               -0.014733717      -0.0221292631  0.047778830
## STARS                   -0.015390398       0.0220949002 -0.028492455
##                               pH    Sulphates       Alcohol   LabelAppeal
## TARGET              0.0002198557 -0.021220378  0.0737771084  0.4979464796
## FixedAcidity       -0.0045538857  0.042229181 -0.0130850260  0.0113759650
## VolatileAcidity     0.0072030364  0.001516100  0.0002603082 -0.0202419713
## CitricAcid         -0.0007581304 -0.014423727  0.0169864284  0.0153315666
## ResidualSugar       0.0175637691 -0.002705775 -0.0189433242 -0.0045793083
## Chlorides          -0.0179702278  0.002618778 -0.0228849573 -0.0063870237
## FreeSulfurDioxide  -0.0020085157  0.026829029 -0.0238674577  0.0149600871
## TotalSulfurDioxide -0.0034227601  0.002504051 -0.0168515467 -0.0027237419
## Density            -0.0020192285 -0.010609294 -0.0061283546 -0.0180944026
## pH                  1.0000000000  0.010449255 -0.0122034469  0.0002181758
## Sulphates           0.0104492547  1.000000000  0.0108443299  0.0037686996
## Alcohol            -0.0122034469  0.010844330  1.0000000000 -0.0006449123
## LabelAppeal         0.0002181758  0.003768700 -0.0006449123  1.0000000000
## AcidIndex          -0.0537128921  0.031071782 -0.0558919056  0.0103009840
## STARS              -0.0044002985 -0.023135130  0.0648544864  0.3188970216
##                      AcidIndex        STARS
## TARGET             -0.16764306  0.554685722
## FixedAcidity        0.15416785 -0.004937345
## VolatileAcidity     0.02505297 -0.040243239
## CitricAcid          0.05458381  0.007140170
## ResidualSugar      -0.02030189  0.019665541
## Chlorides          -0.00171341 -0.006324257
## FreeSulfurDioxide  -0.01473372 -0.015390398
## TotalSulfurDioxide -0.02212926  0.022094900
## Density             0.04777883 -0.028492455
## pH                 -0.05371289 -0.004400299
## Sulphates           0.03107178 -0.023135130
## Alcohol            -0.05589191  0.064854486
## LabelAppeal         0.01030098  0.318897022
## AcidIndex           1.00000000 -0.095482582
## STARS              -0.09548258  1.000000000

The correlation matrices show the impact of many missing values across the different predictors, preliminary and dropping missing values across, LabelAppeal, AcidIndex and STARS confirm to be the ones with the strongest correlation, followed by Akcohol and VolatileAcidity. We will proceed with Data Prep tasks to deal with missing values in a more cautionary way.

Data Preparation

Non_NAs NAs NA_Percent
TARGET 12795 0 0.0000000
FixedAcidity 12795 0 0.0000000
VolatileAcidity 12795 0 0.0000000
CitricAcid 12795 0 0.0000000
ResidualSugar 12179 616 0.0481438
Chlorides 12157 638 0.0498632
FreeSulfurDioxide 12148 647 0.0505666
TotalSulfurDioxide 12113 682 0.0533021
Density 12795 0 0.0000000
pH 12400 395 0.0308714
Sulphates 11585 1210 0.0945682
Alcohol 12142 653 0.0510356
LabelAppeal 12795 0 0.0000000
AcidIndex 12795 0 0.0000000
STARS 9436 3359 0.2625244

## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS

Build Models & Model Evaluation

Model 1 - Count (Poisson) Regression Model (No Imputations)

## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = wine.train1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2128  -0.2757   0.0647   0.3766   1.6981  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.608e+00  2.796e-01   5.750 8.90e-09 ***
## FixedAcidity        6.705e-04  1.177e-03   0.570  0.56901    
## VolatileAcidity    -2.750e-02  9.283e-03  -2.963  0.00305 ** 
## CitricAcid         -3.835e-03  8.519e-03  -0.450  0.65259    
## ResidualSugar       1.828e-05  2.152e-04   0.085  0.93232    
## Chlorides          -3.764e-02  2.314e-02  -1.627  0.10377    
## FreeSulfurDioxide   5.671e-05  4.892e-05   1.159  0.24630    
## TotalSulfurDioxide  2.230e-05  3.177e-05   0.702  0.48274    
## Density            -4.025e-01  2.749e-01  -1.464  0.14326    
## pH                  2.307e-04  1.085e-02   0.021  0.98303    
## Sulphates          -5.984e-03  7.973e-03  -0.751  0.45293    
## Alcohol             3.262e-03  2.004e-03   1.628  0.10360    
## LabelAppeal         1.730e-01  8.858e-03  19.530  < 2e-16 ***
## AcidIndex          -4.967e-02  6.666e-03  -7.451 9.28e-14 ***
## STARS               1.929e-01  8.328e-03  23.160  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4720.5  on 5143  degrees of freedom
## Residual deviance: 3242.8  on 5129  degrees of freedom
##   (5093 observations deleted due to missingness)
## AIC: 18545
## 
## Number of Fisher Scoring iterations: 5

Model 2 - Count (Poisson) Regression Model (No Imputations and removing non-significant variables)

## 
## Call:
## glm(formula = TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar - 
##     Chlorides - FreeSulfurDioxide - TotalSulfurDioxide - Density - 
##     pH - Sulphates - Alcohol, family = poisson, data = wine.train1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1898  -0.2777   0.0622   0.3764   1.6086  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.251442   0.054724  22.868  < 2e-16 ***
## VolatileAcidity -0.027581   0.009278  -2.973  0.00295 ** 
## LabelAppeal      0.173177   0.008853  19.562  < 2e-16 ***
## AcidIndex       -0.050616   0.006553  -7.724 1.13e-14 ***
## STARS            0.194208   0.008292  23.421  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4720.5  on 5143  degrees of freedom
## Residual deviance: 3253.1  on 5139  degrees of freedom
##   (5093 observations deleted due to missingness)
## AIC: 18535
## 
## Number of Fisher Scoring iterations: 5

Model 3 - Count (Poisson) Regression Model (With Imputations and only significant variable)

## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + LabelAppeal + AcidIndex + 
##     STARS, family = poisson, data = wine.train2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0381  -0.6778   0.1239   0.6394   2.6618  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.201889   0.042133  28.526  < 2e-16 ***
## VolatileAcidity -0.043501   0.007274  -5.981 2.22e-09 ***
## LabelAppeal      0.143130   0.006779  21.113  < 2e-16 ***
## AcidIndex       -0.102810   0.004986 -20.621  < 2e-16 ***
## STARS            0.340243   0.006238  54.545  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18291  on 10236  degrees of freedom
## Residual deviance: 12832  on 10232  degrees of freedom
## AIC: 38400
## 
## Number of Fisher Scoring iterations: 5

Model 4 - Count (Negative Binomial) Regression Model (No Imputations)

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## 
## Call:
## glm.nb(formula = TARGET ~ ., data = wine.train1, init.theta = 138898.9965, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2127  -0.2757   0.0647   0.3766   1.6981  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.608e+00  2.796e-01   5.750 8.91e-09 ***
## FixedAcidity        6.705e-04  1.177e-03   0.570  0.56900    
## VolatileAcidity    -2.750e-02  9.283e-03  -2.963  0.00305 ** 
## CitricAcid         -3.835e-03  8.519e-03  -0.450  0.65259    
## ResidualSugar       1.828e-05  2.152e-04   0.085  0.93231    
## Chlorides          -3.764e-02  2.314e-02  -1.627  0.10378    
## FreeSulfurDioxide   5.671e-05  4.892e-05   1.159  0.24630    
## TotalSulfurDioxide  2.230e-05  3.177e-05   0.702  0.48275    
## Density            -4.025e-01  2.750e-01  -1.464  0.14326    
## pH                  2.307e-04  1.085e-02   0.021  0.98303    
## Sulphates          -5.984e-03  7.973e-03  -0.751  0.45293    
## Alcohol             3.262e-03  2.004e-03   1.628  0.10360    
## LabelAppeal         1.730e-01  8.858e-03  19.529  < 2e-16 ***
## AcidIndex          -4.967e-02  6.666e-03  -7.451 9.28e-14 ***
## STARS               1.929e-01  8.328e-03  23.160  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(138899) family taken to be 1)
## 
##     Null deviance: 4720.4  on 5143  degrees of freedom
## Residual deviance: 3242.7  on 5129  degrees of freedom
##   (5093 observations deleted due to missingness)
## AIC: 18547
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  138899 
##           Std. Err.:  259921 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -18515.07

Model 5 - Count (Negative Binomial) Regression Model (No Imputations and removing non-significant variables)

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## 
## Call:
## glm.nb(formula = TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar - 
##     Chlorides - FreeSulfurDioxide - TotalSulfurDioxide - Density - 
##     pH - Sulphates - Alcohol, data = wine.train1, init.theta = 138402.1806, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1898  -0.2777   0.0622   0.3764   1.6086  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.251443   0.054725  22.868  < 2e-16 ***
## VolatileAcidity -0.027581   0.009279  -2.973  0.00295 ** 
## LabelAppeal      0.173177   0.008853  19.562  < 2e-16 ***
## AcidIndex       -0.050616   0.006553  -7.724 1.13e-14 ***
## STARS            0.194209   0.008292  23.421  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(138402.2) family taken to be 1)
## 
##     Null deviance: 4720.4  on 5143  degrees of freedom
## Residual deviance: 3253.0  on 5139  degrees of freedom
##   (5093 observations deleted due to missingness)
## AIC: 18537
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  138402 
##           Std. Err.:  258834 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -18525.37

Model 6 - Count (Negative Binomial) Regression Model (With Imputations and only significant variable)

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + LabelAppeal + AcidIndex + 
##     STARS, data = wine.train2, init.theta = 48614.35988, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0380  -0.6778   0.1239   0.6394   2.6617  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.201895   0.042134  28.526  < 2e-16 ***
## VolatileAcidity -0.043502   0.007274  -5.981 2.22e-09 ***
## LabelAppeal      0.143130   0.006780  21.112  < 2e-16 ***
## AcidIndex       -0.102812   0.004986 -20.621  < 2e-16 ***
## STARS            0.340248   0.006238  54.543  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(48614.36) family taken to be 1)
## 
##     Null deviance: 18290  on 10236  degrees of freedom
## Residual deviance: 12831  on 10232  degrees of freedom
## AIC: 38402
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  48614 
##           Std. Err.:  62794 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -38389.98

Model 7 - Linear Regression Model (With Imputations)

## 
## Call:
## lm(formula = TARGET ~ ., data = wine.train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5119 -0.9973  0.1659  1.0271  4.2662 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.936e+00  5.334e-01   7.380 1.71e-13 ***
## FixedAcidity        2.874e-04  2.253e-03   0.128  0.89847    
## VolatileAcidity    -1.245e-01  1.790e-02  -6.960 3.62e-12 ***
## CitricAcid          2.889e-02  1.628e-02   1.775  0.07598 .  
## ResidualSugar       4.461e-04  4.131e-04   1.080  0.28023    
## Chlorides          -1.963e-01  4.391e-02  -4.471 7.88e-06 ***
## FreeSulfurDioxide   2.881e-04  9.384e-05   3.070  0.00214 ** 
## TotalSulfurDioxide  2.285e-04  5.997e-05   3.810  0.00014 ***
## Density            -1.101e+00  5.254e-01  -2.096  0.03614 *  
## pH                 -3.884e-02  2.067e-02  -1.879  0.06030 .  
## Sulphates          -3.503e-02  1.516e-02  -2.310  0.02089 *  
## Alcohol             1.167e-02  3.775e-03   3.090  0.00201 ** 
## LabelAppeal         4.400e-01  1.642e-02  26.799  < 2e-16 ***
## AcidIndex          -2.509e-01  1.099e-02 -22.825  < 2e-16 ***
## STARS               1.158e+00  1.664e-02  69.584  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.415 on 10222 degrees of freedom
## Multiple R-squared:  0.4615, Adjusted R-squared:  0.4608 
## F-statistic: 625.8 on 14 and 10222 DF,  p-value: < 2.2e-16

Model 8 - Linear Regression Model (With Imputations and only significant variable)

## 
## Call:
## lm(formula = TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar - 
##     Density - pH - Sulphates, data = wine.train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4936 -1.0062  0.1739  1.0227  4.3350 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.701e+00  1.035e-01  26.108  < 2e-16 ***
## VolatileAcidity    -1.259e-01  1.790e-02  -7.037 2.09e-12 ***
## Chlorides          -1.978e-01  4.391e-02  -4.503 6.76e-06 ***
## FreeSulfurDioxide   2.860e-04  9.387e-05   3.047 0.002315 ** 
## TotalSulfurDioxide  2.316e-04  5.997e-05   3.862 0.000113 ***
## Alcohol             1.181e-02  3.775e-03   3.129 0.001762 ** 
## LabelAppeal         4.399e-01  1.642e-02  26.780  < 2e-16 ***
## AcidIndex          -2.502e-01  1.076e-02 -23.248  < 2e-16 ***
## STARS               1.160e+00  1.664e-02  69.682  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.416 on 10228 degrees of freedom
## Multiple R-squared:  0.4606, Adjusted R-squared:  0.4602 
## F-statistic:  1092 on 8 and 10228 DF,  p-value: < 2.2e-16

## [1] 2.528024
## [1] 2.527991
## [1] 2.618705
## [1] 2.528024
## [1] 2.527991
## [1] 2.618704
## [1] 1.420752
## [1] 1.422984

Model Selection

Model10 - Simple Linear Regression with Imputation and only significant variables produces the best performance, with the lowest RMSE (1.42) - it is the simpler model as it only considers the most significant predictors

Model Predictions

## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
## Warning: Number of logged events: 1