Objective

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales. Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

Data Exploration

Load Data

train_df <- read.csv("https://raw.githubusercontent.com/tylerbaker01/data621/main/wine-training-data.csv")
test_df <- read.csv("https://raw.githubusercontent.com/tylerbaker01/data621/main/wine-evaluation-data.csv")
head(train_df)
##   ï..INDEX TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## 1        1      3          3.2           1.160      -0.98          54.2
## 2        2      3          4.5           0.160      -0.81          26.1
## 3        4      5          7.1           2.640      -0.88          14.8
## 4        5      3          5.7           0.385       0.04          18.8
## 5        6      4          8.0           0.330      -1.26           9.4
## 6        7      0         11.3           0.320       0.59           2.2
##   Chlorides FreeSulfurDioxide TotalSulfurDioxide Density   pH Sulphates Alcohol
## 1    -0.567                NA                268 0.99280 3.33     -0.59     9.9
## 2    -0.425                15               -327 1.02792 3.38      0.70      NA
## 3     0.037               214                142 0.99518 3.12      0.48    22.0
## 4    -0.425                22                115 0.99640 2.24      1.83     6.2
## 5        NA              -167                108 0.99457 3.12      1.77    13.7
## 6     0.556               -37                 15 0.99940 3.20      1.29    15.4
##   LabelAppeal AcidIndex STARS
## 1           0         8     2
## 2          -1         7     3
## 3          -1         8     3
## 4          -1         6     1
## 5           0         9     2
## 6           0        11    NA

Stats

describe(train_df)
## train_df 
## 
##  16  Variables      12795  Observations
## --------------------------------------------------------------------------------
## ï..INDEX 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12795        0    12795        1     8070     5378    804.7   1610.4 
##      .25      .50      .75      .90      .95 
##   4037.5   8110.0  12106.5  14515.6  15309.3 
## 
## lowest :     1     2     4     5     6, highest: 16120 16123 16127 16128 16129
## --------------------------------------------------------------------------------
## TARGET 
##        n  missing distinct     Info     Mean      Gmd 
##    12795        0        9    0.962    3.029    2.141 
## 
## lowest : 0 1 2 3 4, highest: 4 5 6 7 8
##                                                                 
## Value          0     1     2     3     4     5     6     7     8
## Frequency   2734   244  1091  2611  3177  2014   765   142    17
## Proportion 0.214 0.019 0.085 0.204 0.248 0.157 0.060 0.011 0.001
## --------------------------------------------------------------------------------
## FixedAcidity 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12795        0      470        1    7.076    6.688     -3.6     -1.2 
##      .25      .50      .75      .90      .95 
##      5.2      6.9      9.5     15.6     17.8 
## 
## lowest : -18.1 -18.0 -17.7 -17.5 -17.4, highest:  32.4  32.5  32.6  34.1  34.4
## --------------------------------------------------------------------------------
## VolatileAcidity 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12795        0      815        1   0.3241   0.8262   -1.023   -0.720 
##      .25      .50      .75      .90      .95 
##    0.130    0.280    0.640    1.350    1.640 
## 
## lowest : -2.790 -2.750 -2.745 -2.730 -2.720, highest:  3.500  3.550  3.565  3.590  3.680
## --------------------------------------------------------------------------------
## CitricAcid 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12795        0      602        1   0.3084   0.9057    -1.16    -0.84 
##      .25      .50      .75      .90      .95 
##     0.03     0.31     0.58     1.43     1.79 
## 
## lowest : -3.24 -3.16 -3.10 -3.08 -3.06, highest:  3.63  3.68  3.70  3.77  3.86
## --------------------------------------------------------------------------------
## ResidualSugar 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12179      616     2077        1    5.419    35.31   -52.70   -39.66 
##      .25      .50      .75      .90      .95 
##    -2.00     3.90    15.90    49.72    62.70 
## 
## lowest : -127.80 -127.10 -126.20 -126.10 -125.70
## highest:  136.50  137.60  138.00  140.65  141.15
## --------------------------------------------------------------------------------
## Chlorides 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12157      638     1663        1  0.05482   0.3311   -0.489   -0.372 
##      .25      .50      .75      .90      .95 
##   -0.031    0.046    0.153    0.481    0.598 
## 
## lowest : -1.171 -1.170 -1.158 -1.156 -1.155, highest:  1.260  1.261  1.270  1.275  1.351
## --------------------------------------------------------------------------------
## FreeSulfurDioxide 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12148      647      999        1    30.85    155.2     -224     -171 
##      .25      .50      .75      .90      .95 
##        0       30       70      230      284 
## 
## lowest : -555 -546 -536 -535 -532, highest:  613  617  618  622  623
## --------------------------------------------------------------------------------
## TotalSulfurDioxide 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12113      682     1370        1    120.7    246.9   -273.0   -185.0 
##      .25      .50      .75      .90      .95 
##     27.0    123.0    208.0    421.8    513.4 
## 
## lowest : -823 -816 -793 -781 -779, highest: 1032 1041 1048 1054 1057
## --------------------------------------------------------------------------------
## Density 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12795        0     5933        1   0.9942  0.02769   0.9488   0.9587 
##      .25      .50      .75      .90      .95 
##   0.9877   0.9945   1.0005   1.0295   1.0398 
## 
## lowest : 0.88809 0.88949 0.88978 0.88983 0.89167
## highest: 1.09658 1.09679 1.09695 1.09791 1.09924
## --------------------------------------------------------------------------------
## pH 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12400      395      497        1    3.208   0.7242     2.06     2.31 
##      .25      .50      .75      .90      .95 
##     2.96     3.20     3.47     4.10     4.37 
## 
## lowest : 0.48 0.53 0.54 0.58 0.59, highest: 5.91 5.94 6.02 6.05 6.13
## --------------------------------------------------------------------------------
## Sulphates 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    11585     1210      630        1   0.5271   0.9827    -1.05    -0.70 
##      .25      .50      .75      .90      .95 
##     0.28     0.50     0.86     1.77     2.09 
## 
## lowest : -3.13 -3.12 -3.10 -3.07 -3.03, highest:  4.11  4.16  4.19  4.21  4.24
## --------------------------------------------------------------------------------
## Alcohol 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12142      653      401        1    10.49    4.015      4.1      5.7 
##      .25      .50      .75      .90      .95 
##      9.0     10.4     12.4     15.2     16.7 
## 
## lowest : -4.7 -4.5 -4.4 -4.3 -4.1, highest: 25.4 25.6 26.0 26.1 26.5
## --------------------------------------------------------------------------------
## LabelAppeal 
##         n   missing  distinct      Info      Mean       Gmd 
##     12795         0         5     0.887 -0.009066    0.9566 
## 
## lowest : -2 -1  0  1  2, highest: -2 -1  0  1  2
##                                         
## Value         -2    -1     0     1     2
## Frequency    504  3136  5617  3048   490
## Proportion 0.039 0.245 0.439 0.238 0.038
## --------------------------------------------------------------------------------
## AcidIndex 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12795        0       14    0.908    7.773    1.316        6        7 
##      .25      .50      .75      .90      .95 
##        7        8        8        9       10 
## 
## lowest :  4  5  6  7  8, highest: 13 14 15 16 17
##                                                                             
## Value          4     5     6     7     8     9    10    11    12    13    14
## Frequency      3    75  1197  4878  4142  1427   551   258   128    69    47
## Proportion 0.000 0.006 0.094 0.381 0.324 0.112 0.043 0.020 0.010 0.005 0.004
##                             
## Value         15    16    17
## Frequency      8     5     7
## Proportion 0.001 0.000 0.001
## --------------------------------------------------------------------------------
## STARS 
##        n  missing distinct     Info     Mean      Gmd 
##     9436     3359        4    0.899    2.042   0.9777 
##                                   
## Value          1     2     3     4
## Frequency   3042  3570  2212   612
## Proportion 0.322 0.378 0.234 0.065
## --------------------------------------------------------------------------------

Basic Plots

Looking at the histogram for each variable

hist.data.frame(train_df)

Everything looks pretty close to a normal distribution. Also, LabelAppeal, AcidIndex, and STARS are actually categorical variables.

NAs

Next we want to count the number of NAs.

colSums(is.na(train_df))
##           ï..INDEX             TARGET       FixedAcidity    VolatileAcidity 
##                  0                  0                  0                  0 
##         CitricAcid      ResidualSugar          Chlorides  FreeSulfurDioxide 
##                  0                616                638                647 
## TotalSulfurDioxide            Density                 pH          Sulphates 
##                682                  0                395               1210 
##            Alcohol        LabelAppeal          AcidIndex              STARS 
##                653                  0                  0               3359

Well, since we have almost 13,000 observations, we have decided that we can simply remove the NAs from the data set. Except for the STARS variable. Since it takes factors, we will simply turn NA into a factor.

Data Tidying

Data Types

We need to change the datatypes of a few variables.

train_df$LabelAppeal <- as.factor(train_df$LabelAppeal)
train_df$AcidIndex <- as.factor(train_df$AcidIndex)
train_df$STARS <- as.factor(train_df$STARS)
test_df$LabelAppeal <- as.factor(test_df$LabelAppeal)
test_df$AcidIndex <- as.factor(test_df$AcidIndex)
test_df$STARS <- as.factor(test_df$STARS)

NA handling

train_df$STARS <- addNA(train_df$STARS)
train_df <- na.omit(train_df)
test_df$STARS <- addNA(test_df$STARS)

Building Models

We will build a poisson model, and a negative binomial regression model.

Splitting the training data.

set.seed(123)
partition <- sample(1:nrow(train_df), size=nrow(train_df)*0.7,replace=FALSE)

train.data <-train_df[partition, ]
test.data <- train_df[-partition, ]

target <- test.data$`TARGET`
colnames(test.data) <- c("index", "TARGET", "FixedAcidity", "VolatileAcidity", "CitricAcid", "ResidualSugar", "Chlorides", "FreeSulfurDioxide", "TotalSulfurDioxide", "Density", "pH", "Sulphates", "Alcohol", "LabelAppeal", "AcidIndex", "STARS")
colnames(train.data) <- c("index", "TARGET", "FixedAcidity", "VolatileAcidity", "CitricAcid", "ResidualSugar", "Chlorides", "FreeSulfurDioxide", "TotalSulfurDioxide", "Density", "pH", "Sulphates", "Alcohol", "LabelAppeal", "AcidIndex", "STARS")

Poisson W/ All Variables

model1.a <- glm(TARGET ~., data=train.data, family = poisson)
summary(model1.a)
## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = train.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2220  -0.6228  -0.0136   0.4347   3.4509  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.407e+00  5.325e-01   4.519 6.21e-06 ***
## index              -6.786e-08  1.590e-06  -0.043 0.965952    
## FixedAcidity        5.335e-04  1.175e-03   0.454 0.649915    
## VolatileAcidity    -2.790e-02  9.421e-03  -2.962 0.003057 ** 
## CitricAcid         -6.226e-04  8.462e-03  -0.074 0.941349    
## ResidualSugar      -2.643e-05  2.181e-04  -0.121 0.903562    
## Chlorides          -6.161e-02  2.291e-02  -2.690 0.007155 ** 
## FreeSulfurDioxide   9.513e-05  4.954e-05   1.920 0.054833 .  
## TotalSulfurDioxide  8.573e-05  3.242e-05   2.645 0.008176 ** 
## Density            -6.018e-01  2.776e-01  -2.168 0.030174 *  
## pH                 -9.465e-03  1.091e-02  -0.867 0.385743    
## Sulphates          -1.157e-02  7.932e-03  -1.458 0.144718    
## Alcohol             4.095e-03  1.971e-03   2.077 0.037780 *  
## LabelAppeal-1       2.342e-01  5.464e-02   4.286 1.82e-05 ***
## LabelAppeal0        4.346e-01  5.311e-02   8.184 2.75e-16 ***
## LabelAppeal1        5.717e-01  5.403e-02  10.580  < 2e-16 ***
## LabelAppeal2        6.958e-01  6.110e-02  11.388  < 2e-16 ***
## AcidIndex5         -1.248e+00  4.573e-01  -2.730 0.006341 ** 
## AcidIndex6         -1.158e+00  4.491e-01  -2.578 0.009938 ** 
## AcidIndex7         -1.197e+00  4.487e-01  -2.667 0.007652 ** 
## AcidIndex8         -1.216e+00  4.487e-01  -2.710 0.006730 ** 
## AcidIndex9         -1.314e+00  4.492e-01  -2.924 0.003450 ** 
## AcidIndex10        -1.483e+00  4.507e-01  -3.291 0.001000 ***
## AcidIndex11        -1.909e+00  4.572e-01  -4.175 2.98e-05 ***
## AcidIndex12        -1.749e+00  4.607e-01  -3.797 0.000147 ***
## AcidIndex13        -1.707e+00  4.691e-01  -3.639 0.000273 ***
## AcidIndex14        -2.015e+00  4.966e-01  -4.058 4.94e-05 ***
## AcidIndex15        -1.165e+00  5.336e-01  -2.184 0.028967 *  
## AcidIndex16        -1.455e+01  2.847e+02  -0.051 0.959229    
## AcidIndex17        -1.377e+01  2.013e+02  -0.068 0.945472    
## STARS2              3.150e-01  2.085e-02  15.106  < 2e-16 ***
## STARS3              4.253e-01  2.278e-02  18.673  < 2e-16 ***
## STARS4              5.382e-01  3.230e-02  16.664  < 2e-16 ***
## STARSNA            -7.818e-01  2.885e-02 -27.100  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 10739.9  on 6071  degrees of freedom
## Residual deviance:  6297.7  on 6038  degrees of freedom
## AIC: 21589
## 
## Number of Fisher Scoring iterations: 10

Poisson W/ only the Significant Variables

model1.b<- glm(TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + Alcohol + LabelAppeal + AcidIndex + STARS ,data=train.data, family = poisson)
summary(model1.b)
## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + Alcohol + LabelAppeal + AcidIndex + 
##     STARS, family = poisson, data = train.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2506  -0.6271  -0.0154   0.4309   3.4213  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.357e+00  5.298e-01   4.449 8.61e-06 ***
## VolatileAcidity    -2.792e-02  9.417e-03  -2.965 0.003028 ** 
## Chlorides          -6.117e-02  2.288e-02  -2.674 0.007501 ** 
## FreeSulfurDioxide   9.260e-05  4.949e-05   1.871 0.061332 .  
## TotalSulfurDioxide  8.555e-05  3.241e-05   2.640 0.008298 ** 
## Density            -5.996e-01  2.776e-01  -2.160 0.030762 *  
## Alcohol             4.100e-03  1.970e-03   2.081 0.037423 *  
## LabelAppeal-1       2.355e-01  5.463e-02   4.311 1.63e-05 ***
## LabelAppeal0        4.360e-01  5.310e-02   8.211  < 2e-16 ***
## LabelAppeal1        5.723e-01  5.403e-02  10.593  < 2e-16 ***
## LabelAppeal2        6.964e-01  6.109e-02  11.400  < 2e-16 ***
## AcidIndex5         -1.236e+00  4.570e-01  -2.705 0.006826 ** 
## AcidIndex6         -1.147e+00  4.488e-01  -2.556 0.010581 *  
## AcidIndex7         -1.184e+00  4.483e-01  -2.641 0.008269 ** 
## AcidIndex8         -1.203e+00  4.484e-01  -2.683 0.007305 ** 
## AcidIndex9         -1.299e+00  4.487e-01  -2.895 0.003786 ** 
## AcidIndex10        -1.469e+00  4.502e-01  -3.262 0.001106 ** 
## AcidIndex11        -1.898e+00  4.568e-01  -4.156 3.24e-05 ***
## AcidIndex12        -1.736e+00  4.602e-01  -3.773 0.000161 ***
## AcidIndex13        -1.695e+00  4.687e-01  -3.617 0.000298 ***
## AcidIndex14        -2.000e+00  4.961e-01  -4.032 5.53e-05 ***
## AcidIndex15        -1.145e+00  5.332e-01  -2.148 0.031734 *  
## AcidIndex16        -1.452e+01  2.847e+02  -0.051 0.959315    
## AcidIndex17        -1.374e+01  2.013e+02  -0.068 0.945563    
## STARS2              3.147e-01  2.083e-02  15.106  < 2e-16 ***
## STARS3              4.264e-01  2.275e-02  18.744  < 2e-16 ***
## STARS4              5.388e-01  3.226e-02  16.702  < 2e-16 ***
## STARSNA            -7.825e-01  2.884e-02 -27.132  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 10739.9  on 6071  degrees of freedom
## Residual deviance:  6300.8  on 6044  degrees of freedom
## AIC: 21580
## 
## Number of Fisher Scoring iterations: 10

Negative Binomial Regression W/ All Variables

model2.a <- glm.nb(TARGET ~., data=train.data)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in glm.nb(TARGET ~ ., data = train.data): alternation limit reached
summary(model2.a)
## 
## Call:
## glm.nb(formula = TARGET ~ ., data = train.data, init.theta = 41041.20828, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2219  -0.6228  -0.0136   0.4347   3.4508  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.407e+00  5.326e-01   4.519 6.21e-06 ***
## index              -6.787e-08  1.590e-06  -0.043 0.965949    
## FixedAcidity        5.335e-04  1.175e-03   0.454 0.649909    
## VolatileAcidity    -2.791e-02  9.421e-03  -2.962 0.003057 ** 
## CitricAcid         -6.224e-04  8.462e-03  -0.074 0.941372    
## ResidualSugar      -2.642e-05  2.182e-04  -0.121 0.903590    
## Chlorides          -6.161e-02  2.291e-02  -2.690 0.007156 ** 
## FreeSulfurDioxide   9.513e-05  4.954e-05   1.920 0.054838 .  
## TotalSulfurDioxide  8.574e-05  3.242e-05   2.645 0.008175 ** 
## Density            -6.018e-01  2.776e-01  -2.168 0.030179 *  
## pH                 -9.466e-03  1.091e-02  -0.867 0.385716    
## Sulphates          -1.157e-02  7.932e-03  -1.458 0.144719    
## Alcohol             4.095e-03  1.972e-03   2.077 0.037792 *  
## LabelAppeal-1       2.342e-01  5.464e-02   4.286 1.82e-05 ***
## LabelAppeal0        4.346e-01  5.311e-02   8.184 2.75e-16 ***
## LabelAppeal1        5.717e-01  5.403e-02  10.580  < 2e-16 ***
## LabelAppeal2        6.958e-01  6.110e-02  11.387  < 2e-16 ***
## AcidIndex5         -1.248e+00  4.573e-01  -2.729 0.006345 ** 
## AcidIndex6         -1.158e+00  4.491e-01  -2.578 0.009943 ** 
## AcidIndex7         -1.197e+00  4.487e-01  -2.667 0.007656 ** 
## AcidIndex8         -1.216e+00  4.488e-01  -2.710 0.006733 ** 
## AcidIndex9         -1.314e+00  4.492e-01  -2.924 0.003452 ** 
## AcidIndex10        -1.483e+00  4.507e-01  -3.290 0.001001 ** 
## AcidIndex11        -1.909e+00  4.573e-01  -4.174 2.99e-05 ***
## AcidIndex12        -1.749e+00  4.607e-01  -3.797 0.000147 ***
## AcidIndex13        -1.707e+00  4.692e-01  -3.639 0.000274 ***
## AcidIndex14        -2.015e+00  4.966e-01  -4.058 4.95e-05 ***
## AcidIndex15        -1.165e+00  5.337e-01  -2.184 0.028976 *  
## AcidIndex16        -3.929e+01  6.711e+07   0.000 1.000000    
## AcidIndex17        -3.851e+01  4.745e+07   0.000 0.999999    
## STARS2              3.150e-01  2.085e-02  15.106  < 2e-16 ***
## STARS3              4.253e-01  2.278e-02  18.672  < 2e-16 ***
## STARS4              5.382e-01  3.230e-02  16.663  < 2e-16 ***
## STARSNA            -7.818e-01  2.885e-02 -27.099  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(41041.22) family taken to be 1)
## 
##     Null deviance: 10739.4  on 6071  degrees of freedom
## Residual deviance:  6297.4  on 6038  degrees of freedom
## AIC: 21591
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  41041 
##           Std. Err.:  49314 
## Warning while fitting theta: alternation limit reached 
## 
##  2 x log-likelihood:  -21520.76

Negative Binomial Regression W/ Only Significant Variables

model2.b <- glm.nb(TARGET~ VolatileAcidity + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + Alcohol + LabelAppeal + AcidIndex + STARS , data=train.data)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in glm.nb(TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + :
## alternation limit reached
summary(model2.b)
## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + Alcohol + LabelAppeal + AcidIndex + 
##     STARS, data = train.data, init.theta = 41023.47334, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2505  -0.6271  -0.0154   0.4308   3.4212  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.357e+00  5.298e-01   4.449 8.62e-06 ***
## VolatileAcidity    -2.792e-02  9.418e-03  -2.965 0.003028 ** 
## Chlorides          -6.117e-02  2.288e-02  -2.674 0.007502 ** 
## FreeSulfurDioxide   9.260e-05  4.949e-05   1.871 0.061336 .  
## TotalSulfurDioxide  8.556e-05  3.241e-05   2.640 0.008298 ** 
## Density            -5.996e-01  2.776e-01  -2.160 0.030767 *  
## Alcohol             4.099e-03  1.970e-03   2.081 0.037436 *  
## LabelAppeal-1       2.355e-01  5.463e-02   4.311 1.63e-05 ***
## LabelAppeal0        4.360e-01  5.310e-02   8.211  < 2e-16 ***
## LabelAppeal1        5.723e-01  5.403e-02  10.593  < 2e-16 ***
## LabelAppeal2        6.964e-01  6.109e-02  11.400  < 2e-16 ***
## AcidIndex5         -1.236e+00  4.570e-01  -2.705 0.006829 ** 
## AcidIndex6         -1.147e+00  4.488e-01  -2.556 0.010586 *  
## AcidIndex7         -1.184e+00  4.483e-01  -2.641 0.008273 ** 
## AcidIndex8         -1.203e+00  4.484e-01  -2.682 0.007309 ** 
## AcidIndex9         -1.299e+00  4.488e-01  -2.895 0.003788 ** 
## AcidIndex10        -1.469e+00  4.502e-01  -3.262 0.001106 ** 
## AcidIndex11        -1.898e+00  4.568e-01  -4.156 3.24e-05 ***
## AcidIndex12        -1.736e+00  4.602e-01  -3.773 0.000161 ***
## AcidIndex13        -1.695e+00  4.687e-01  -3.617 0.000298 ***
## AcidIndex14        -2.000e+00  4.961e-01  -4.032 5.53e-05 ***
## AcidIndex15        -1.145e+00  5.332e-01  -2.148 0.031744 *  
## AcidIndex16        -3.926e+01  6.711e+07   0.000 1.000000    
## AcidIndex17        -3.848e+01  4.745e+07   0.000 0.999999    
## STARS2              3.147e-01  2.083e-02  15.106  < 2e-16 ***
## STARS3              4.264e-01  2.275e-02  18.744  < 2e-16 ***
## STARS4              5.388e-01  3.226e-02  16.701  < 2e-16 ***
## STARSNA            -7.825e-01  2.884e-02 -27.131  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(41023.49) family taken to be 1)
## 
##     Null deviance: 10739.4  on 6071  degrees of freedom
## Residual deviance:  6300.5  on 6044  degrees of freedom
## AIC: 21582
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  41023 
##           Std. Err.:  49274 
## Warning while fitting theta: alternation limit reached 
## 
##  2 x log-likelihood:  -21523.86

Model Selection

We will compare the models by they’re accuracy scores.

Model 1.A

predictions <- as.data.frame(predict(model1.a, newdata = test.data))

accuracy_df <- target

accuracy_df <- merge(accuracy_df, predictions)
accuracy_df$error <- abs(accuracy_df$x - accuracy_df$`predict(model1.a, newdata = test.data)`)
accuracy_df$error_percentage <- ((accuracy_df$`predict(model1.a, newdata = test.data)` - accuracy_df$x)/accuracy_df$x) * 100
avg_error <- mean(accuracy_df$error)
avg_percentage_error <- mean(accuracy_df$error_percentage)
print(avg_error)
## [1] 2.500863
print(avg_percentage_error)
## [1] NaN

Model 1.B

predictions <- as.data.frame(predict(model1.b, newdata = test.data))

accuracy_df <- target

accuracy_df <- merge(accuracy_df, predictions)
accuracy_df$error <- abs(accuracy_df$x - accuracy_df$`predict(model1.b, newdata = test.data)`)
accuracy_df$error_percentage <- ((accuracy_df$`predict(model1.b, newdata = test.data)` - accuracy_df$x)/accuracy_df$x) * 100
avg_error <- mean(accuracy_df$error)
avg_percentage_error <- mean(accuracy_df$error_percentage)
print(avg_error)
## [1] 2.500699
print(avg_percentage_error)
## [1] NaN

Model 2.A

predictions <- as.data.frame(predict(model2.a, newdata = test.data))

accuracy_df <- target

accuracy_df <- merge(accuracy_df, predictions)
accuracy_df$error <- abs(accuracy_df$x - accuracy_df$`predict(model2.a, newdata = test.data)`)
accuracy_df$error_percentage <- ((accuracy_df$`predict(model2.a, newdata = test.data)` - accuracy_df$x)/accuracy_df$x) * 100
avg_error <- mean(accuracy_df$error)
avg_percentage_error <- mean(accuracy_df$error_percentage)
print(avg_error)
## [1] 2.557892
print(avg_percentage_error)
## [1] NaN

Model 2.B

predictions <- as.data.frame(predict(model2.b, newdata = test.data))

accuracy_df <- target

accuracy_df <- merge(accuracy_df, predictions)
accuracy_df$error <- abs(accuracy_df$x - accuracy_df$`predict(model2.b, newdata = test.data)`)
accuracy_df$error_percentage <- ((accuracy_df$`predict(model2.b, newdata = test.data)` - accuracy_df$x)/accuracy_df$x) * 100
avg_error <- mean(accuracy_df$error)
avg_percentage_error <- mean(accuracy_df$error_percentage)
print(avg_error)
## [1] 2.557729
print(avg_percentage_error)
## [1] NaN

The NaNs are due to the fact that some of the wines simply didn’t sell a case. So that would make a 0 in the denomenator. Thus, it won’t work. We can still check the average rate however.

Our best model was model1.b with an average error rate of 2.500699

Predictions

predictions <- as.data.frame(predict(model1.b, newdata = test_df))
test_df$predictions <- predictions
results <- test_df[c("IN", "predictions")]