In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales. Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
train_df <- read.csv("https://raw.githubusercontent.com/tylerbaker01/data621/main/wine-training-data.csv")
test_df <- read.csv("https://raw.githubusercontent.com/tylerbaker01/data621/main/wine-evaluation-data.csv")
head(train_df)
## ï..INDEX TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## 1 1 3 3.2 1.160 -0.98 54.2
## 2 2 3 4.5 0.160 -0.81 26.1
## 3 4 5 7.1 2.640 -0.88 14.8
## 4 5 3 5.7 0.385 0.04 18.8
## 5 6 4 8.0 0.330 -1.26 9.4
## 6 7 0 11.3 0.320 0.59 2.2
## Chlorides FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol
## 1 -0.567 NA 268 0.99280 3.33 -0.59 9.9
## 2 -0.425 15 -327 1.02792 3.38 0.70 NA
## 3 0.037 214 142 0.99518 3.12 0.48 22.0
## 4 -0.425 22 115 0.99640 2.24 1.83 6.2
## 5 NA -167 108 0.99457 3.12 1.77 13.7
## 6 0.556 -37 15 0.99940 3.20 1.29 15.4
## LabelAppeal AcidIndex STARS
## 1 0 8 2
## 2 -1 7 3
## 3 -1 8 3
## 4 -1 6 1
## 5 0 9 2
## 6 0 11 NA
describe(train_df)
## train_df
##
## 16 Variables 12795 Observations
## --------------------------------------------------------------------------------
## ï..INDEX
## n missing distinct Info Mean Gmd .05 .10
## 12795 0 12795 1 8070 5378 804.7 1610.4
## .25 .50 .75 .90 .95
## 4037.5 8110.0 12106.5 14515.6 15309.3
##
## lowest : 1 2 4 5 6, highest: 16120 16123 16127 16128 16129
## --------------------------------------------------------------------------------
## TARGET
## n missing distinct Info Mean Gmd
## 12795 0 9 0.962 3.029 2.141
##
## lowest : 0 1 2 3 4, highest: 4 5 6 7 8
##
## Value 0 1 2 3 4 5 6 7 8
## Frequency 2734 244 1091 2611 3177 2014 765 142 17
## Proportion 0.214 0.019 0.085 0.204 0.248 0.157 0.060 0.011 0.001
## --------------------------------------------------------------------------------
## FixedAcidity
## n missing distinct Info Mean Gmd .05 .10
## 12795 0 470 1 7.076 6.688 -3.6 -1.2
## .25 .50 .75 .90 .95
## 5.2 6.9 9.5 15.6 17.8
##
## lowest : -18.1 -18.0 -17.7 -17.5 -17.4, highest: 32.4 32.5 32.6 34.1 34.4
## --------------------------------------------------------------------------------
## VolatileAcidity
## n missing distinct Info Mean Gmd .05 .10
## 12795 0 815 1 0.3241 0.8262 -1.023 -0.720
## .25 .50 .75 .90 .95
## 0.130 0.280 0.640 1.350 1.640
##
## lowest : -2.790 -2.750 -2.745 -2.730 -2.720, highest: 3.500 3.550 3.565 3.590 3.680
## --------------------------------------------------------------------------------
## CitricAcid
## n missing distinct Info Mean Gmd .05 .10
## 12795 0 602 1 0.3084 0.9057 -1.16 -0.84
## .25 .50 .75 .90 .95
## 0.03 0.31 0.58 1.43 1.79
##
## lowest : -3.24 -3.16 -3.10 -3.08 -3.06, highest: 3.63 3.68 3.70 3.77 3.86
## --------------------------------------------------------------------------------
## ResidualSugar
## n missing distinct Info Mean Gmd .05 .10
## 12179 616 2077 1 5.419 35.31 -52.70 -39.66
## .25 .50 .75 .90 .95
## -2.00 3.90 15.90 49.72 62.70
##
## lowest : -127.80 -127.10 -126.20 -126.10 -125.70
## highest: 136.50 137.60 138.00 140.65 141.15
## --------------------------------------------------------------------------------
## Chlorides
## n missing distinct Info Mean Gmd .05 .10
## 12157 638 1663 1 0.05482 0.3311 -0.489 -0.372
## .25 .50 .75 .90 .95
## -0.031 0.046 0.153 0.481 0.598
##
## lowest : -1.171 -1.170 -1.158 -1.156 -1.155, highest: 1.260 1.261 1.270 1.275 1.351
## --------------------------------------------------------------------------------
## FreeSulfurDioxide
## n missing distinct Info Mean Gmd .05 .10
## 12148 647 999 1 30.85 155.2 -224 -171
## .25 .50 .75 .90 .95
## 0 30 70 230 284
##
## lowest : -555 -546 -536 -535 -532, highest: 613 617 618 622 623
## --------------------------------------------------------------------------------
## TotalSulfurDioxide
## n missing distinct Info Mean Gmd .05 .10
## 12113 682 1370 1 120.7 246.9 -273.0 -185.0
## .25 .50 .75 .90 .95
## 27.0 123.0 208.0 421.8 513.4
##
## lowest : -823 -816 -793 -781 -779, highest: 1032 1041 1048 1054 1057
## --------------------------------------------------------------------------------
## Density
## n missing distinct Info Mean Gmd .05 .10
## 12795 0 5933 1 0.9942 0.02769 0.9488 0.9587
## .25 .50 .75 .90 .95
## 0.9877 0.9945 1.0005 1.0295 1.0398
##
## lowest : 0.88809 0.88949 0.88978 0.88983 0.89167
## highest: 1.09658 1.09679 1.09695 1.09791 1.09924
## --------------------------------------------------------------------------------
## pH
## n missing distinct Info Mean Gmd .05 .10
## 12400 395 497 1 3.208 0.7242 2.06 2.31
## .25 .50 .75 .90 .95
## 2.96 3.20 3.47 4.10 4.37
##
## lowest : 0.48 0.53 0.54 0.58 0.59, highest: 5.91 5.94 6.02 6.05 6.13
## --------------------------------------------------------------------------------
## Sulphates
## n missing distinct Info Mean Gmd .05 .10
## 11585 1210 630 1 0.5271 0.9827 -1.05 -0.70
## .25 .50 .75 .90 .95
## 0.28 0.50 0.86 1.77 2.09
##
## lowest : -3.13 -3.12 -3.10 -3.07 -3.03, highest: 4.11 4.16 4.19 4.21 4.24
## --------------------------------------------------------------------------------
## Alcohol
## n missing distinct Info Mean Gmd .05 .10
## 12142 653 401 1 10.49 4.015 4.1 5.7
## .25 .50 .75 .90 .95
## 9.0 10.4 12.4 15.2 16.7
##
## lowest : -4.7 -4.5 -4.4 -4.3 -4.1, highest: 25.4 25.6 26.0 26.1 26.5
## --------------------------------------------------------------------------------
## LabelAppeal
## n missing distinct Info Mean Gmd
## 12795 0 5 0.887 -0.009066 0.9566
##
## lowest : -2 -1 0 1 2, highest: -2 -1 0 1 2
##
## Value -2 -1 0 1 2
## Frequency 504 3136 5617 3048 490
## Proportion 0.039 0.245 0.439 0.238 0.038
## --------------------------------------------------------------------------------
## AcidIndex
## n missing distinct Info Mean Gmd .05 .10
## 12795 0 14 0.908 7.773 1.316 6 7
## .25 .50 .75 .90 .95
## 7 8 8 9 10
##
## lowest : 4 5 6 7 8, highest: 13 14 15 16 17
##
## Value 4 5 6 7 8 9 10 11 12 13 14
## Frequency 3 75 1197 4878 4142 1427 551 258 128 69 47
## Proportion 0.000 0.006 0.094 0.381 0.324 0.112 0.043 0.020 0.010 0.005 0.004
##
## Value 15 16 17
## Frequency 8 5 7
## Proportion 0.001 0.000 0.001
## --------------------------------------------------------------------------------
## STARS
## n missing distinct Info Mean Gmd
## 9436 3359 4 0.899 2.042 0.9777
##
## Value 1 2 3 4
## Frequency 3042 3570 2212 612
## Proportion 0.322 0.378 0.234 0.065
## --------------------------------------------------------------------------------
Looking at the histogram for each variable
hist.data.frame(train_df)
Everything looks pretty close to a normal distribution. Also, LabelAppeal, AcidIndex, and STARS are actually categorical variables.
Next we want to count the number of NAs.
colSums(is.na(train_df))
## ï..INDEX TARGET FixedAcidity VolatileAcidity
## 0 0 0 0
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## 0 616 638 647
## TotalSulfurDioxide Density pH Sulphates
## 682 0 395 1210
## Alcohol LabelAppeal AcidIndex STARS
## 653 0 0 3359
Well, since we have almost 13,000 observations, we have decided that we can simply remove the NAs from the data set. Except for the STARS variable. Since it takes factors, we will simply turn NA into a factor.
We need to change the datatypes of a few variables.
train_df$LabelAppeal <- as.factor(train_df$LabelAppeal)
train_df$AcidIndex <- as.factor(train_df$AcidIndex)
train_df$STARS <- as.factor(train_df$STARS)
test_df$LabelAppeal <- as.factor(test_df$LabelAppeal)
test_df$AcidIndex <- as.factor(test_df$AcidIndex)
test_df$STARS <- as.factor(test_df$STARS)
train_df$STARS <- addNA(train_df$STARS)
train_df <- na.omit(train_df)
test_df$STARS <- addNA(test_df$STARS)
We will build a poisson model, and a negative binomial regression model.
set.seed(123)
partition <- sample(1:nrow(train_df), size=nrow(train_df)*0.7,replace=FALSE)
train.data <-train_df[partition, ]
test.data <- train_df[-partition, ]
target <- test.data$`TARGET`
colnames(test.data) <- c("index", "TARGET", "FixedAcidity", "VolatileAcidity", "CitricAcid", "ResidualSugar", "Chlorides", "FreeSulfurDioxide", "TotalSulfurDioxide", "Density", "pH", "Sulphates", "Alcohol", "LabelAppeal", "AcidIndex", "STARS")
colnames(train.data) <- c("index", "TARGET", "FixedAcidity", "VolatileAcidity", "CitricAcid", "ResidualSugar", "Chlorides", "FreeSulfurDioxide", "TotalSulfurDioxide", "Density", "pH", "Sulphates", "Alcohol", "LabelAppeal", "AcidIndex", "STARS")
model1.a <- glm(TARGET ~., data=train.data, family = poisson)
summary(model1.a)
##
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = train.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2220 -0.6228 -0.0136 0.4347 3.4509
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.407e+00 5.325e-01 4.519 6.21e-06 ***
## index -6.786e-08 1.590e-06 -0.043 0.965952
## FixedAcidity 5.335e-04 1.175e-03 0.454 0.649915
## VolatileAcidity -2.790e-02 9.421e-03 -2.962 0.003057 **
## CitricAcid -6.226e-04 8.462e-03 -0.074 0.941349
## ResidualSugar -2.643e-05 2.181e-04 -0.121 0.903562
## Chlorides -6.161e-02 2.291e-02 -2.690 0.007155 **
## FreeSulfurDioxide 9.513e-05 4.954e-05 1.920 0.054833 .
## TotalSulfurDioxide 8.573e-05 3.242e-05 2.645 0.008176 **
## Density -6.018e-01 2.776e-01 -2.168 0.030174 *
## pH -9.465e-03 1.091e-02 -0.867 0.385743
## Sulphates -1.157e-02 7.932e-03 -1.458 0.144718
## Alcohol 4.095e-03 1.971e-03 2.077 0.037780 *
## LabelAppeal-1 2.342e-01 5.464e-02 4.286 1.82e-05 ***
## LabelAppeal0 4.346e-01 5.311e-02 8.184 2.75e-16 ***
## LabelAppeal1 5.717e-01 5.403e-02 10.580 < 2e-16 ***
## LabelAppeal2 6.958e-01 6.110e-02 11.388 < 2e-16 ***
## AcidIndex5 -1.248e+00 4.573e-01 -2.730 0.006341 **
## AcidIndex6 -1.158e+00 4.491e-01 -2.578 0.009938 **
## AcidIndex7 -1.197e+00 4.487e-01 -2.667 0.007652 **
## AcidIndex8 -1.216e+00 4.487e-01 -2.710 0.006730 **
## AcidIndex9 -1.314e+00 4.492e-01 -2.924 0.003450 **
## AcidIndex10 -1.483e+00 4.507e-01 -3.291 0.001000 ***
## AcidIndex11 -1.909e+00 4.572e-01 -4.175 2.98e-05 ***
## AcidIndex12 -1.749e+00 4.607e-01 -3.797 0.000147 ***
## AcidIndex13 -1.707e+00 4.691e-01 -3.639 0.000273 ***
## AcidIndex14 -2.015e+00 4.966e-01 -4.058 4.94e-05 ***
## AcidIndex15 -1.165e+00 5.336e-01 -2.184 0.028967 *
## AcidIndex16 -1.455e+01 2.847e+02 -0.051 0.959229
## AcidIndex17 -1.377e+01 2.013e+02 -0.068 0.945472
## STARS2 3.150e-01 2.085e-02 15.106 < 2e-16 ***
## STARS3 4.253e-01 2.278e-02 18.673 < 2e-16 ***
## STARS4 5.382e-01 3.230e-02 16.664 < 2e-16 ***
## STARSNA -7.818e-01 2.885e-02 -27.100 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 10739.9 on 6071 degrees of freedom
## Residual deviance: 6297.7 on 6038 degrees of freedom
## AIC: 21589
##
## Number of Fisher Scoring iterations: 10
model1.b<- glm(TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + Alcohol + LabelAppeal + AcidIndex + STARS ,data=train.data, family = poisson)
summary(model1.b)
##
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Density + Alcohol + LabelAppeal + AcidIndex +
## STARS, family = poisson, data = train.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2506 -0.6271 -0.0154 0.4309 3.4213
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.357e+00 5.298e-01 4.449 8.61e-06 ***
## VolatileAcidity -2.792e-02 9.417e-03 -2.965 0.003028 **
## Chlorides -6.117e-02 2.288e-02 -2.674 0.007501 **
## FreeSulfurDioxide 9.260e-05 4.949e-05 1.871 0.061332 .
## TotalSulfurDioxide 8.555e-05 3.241e-05 2.640 0.008298 **
## Density -5.996e-01 2.776e-01 -2.160 0.030762 *
## Alcohol 4.100e-03 1.970e-03 2.081 0.037423 *
## LabelAppeal-1 2.355e-01 5.463e-02 4.311 1.63e-05 ***
## LabelAppeal0 4.360e-01 5.310e-02 8.211 < 2e-16 ***
## LabelAppeal1 5.723e-01 5.403e-02 10.593 < 2e-16 ***
## LabelAppeal2 6.964e-01 6.109e-02 11.400 < 2e-16 ***
## AcidIndex5 -1.236e+00 4.570e-01 -2.705 0.006826 **
## AcidIndex6 -1.147e+00 4.488e-01 -2.556 0.010581 *
## AcidIndex7 -1.184e+00 4.483e-01 -2.641 0.008269 **
## AcidIndex8 -1.203e+00 4.484e-01 -2.683 0.007305 **
## AcidIndex9 -1.299e+00 4.487e-01 -2.895 0.003786 **
## AcidIndex10 -1.469e+00 4.502e-01 -3.262 0.001106 **
## AcidIndex11 -1.898e+00 4.568e-01 -4.156 3.24e-05 ***
## AcidIndex12 -1.736e+00 4.602e-01 -3.773 0.000161 ***
## AcidIndex13 -1.695e+00 4.687e-01 -3.617 0.000298 ***
## AcidIndex14 -2.000e+00 4.961e-01 -4.032 5.53e-05 ***
## AcidIndex15 -1.145e+00 5.332e-01 -2.148 0.031734 *
## AcidIndex16 -1.452e+01 2.847e+02 -0.051 0.959315
## AcidIndex17 -1.374e+01 2.013e+02 -0.068 0.945563
## STARS2 3.147e-01 2.083e-02 15.106 < 2e-16 ***
## STARS3 4.264e-01 2.275e-02 18.744 < 2e-16 ***
## STARS4 5.388e-01 3.226e-02 16.702 < 2e-16 ***
## STARSNA -7.825e-01 2.884e-02 -27.132 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 10739.9 on 6071 degrees of freedom
## Residual deviance: 6300.8 on 6044 degrees of freedom
## AIC: 21580
##
## Number of Fisher Scoring iterations: 10
model2.a <- glm.nb(TARGET ~., data=train.data)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in glm.nb(TARGET ~ ., data = train.data): alternation limit reached
summary(model2.a)
##
## Call:
## glm.nb(formula = TARGET ~ ., data = train.data, init.theta = 41041.20828,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2219 -0.6228 -0.0136 0.4347 3.4508
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.407e+00 5.326e-01 4.519 6.21e-06 ***
## index -6.787e-08 1.590e-06 -0.043 0.965949
## FixedAcidity 5.335e-04 1.175e-03 0.454 0.649909
## VolatileAcidity -2.791e-02 9.421e-03 -2.962 0.003057 **
## CitricAcid -6.224e-04 8.462e-03 -0.074 0.941372
## ResidualSugar -2.642e-05 2.182e-04 -0.121 0.903590
## Chlorides -6.161e-02 2.291e-02 -2.690 0.007156 **
## FreeSulfurDioxide 9.513e-05 4.954e-05 1.920 0.054838 .
## TotalSulfurDioxide 8.574e-05 3.242e-05 2.645 0.008175 **
## Density -6.018e-01 2.776e-01 -2.168 0.030179 *
## pH -9.466e-03 1.091e-02 -0.867 0.385716
## Sulphates -1.157e-02 7.932e-03 -1.458 0.144719
## Alcohol 4.095e-03 1.972e-03 2.077 0.037792 *
## LabelAppeal-1 2.342e-01 5.464e-02 4.286 1.82e-05 ***
## LabelAppeal0 4.346e-01 5.311e-02 8.184 2.75e-16 ***
## LabelAppeal1 5.717e-01 5.403e-02 10.580 < 2e-16 ***
## LabelAppeal2 6.958e-01 6.110e-02 11.387 < 2e-16 ***
## AcidIndex5 -1.248e+00 4.573e-01 -2.729 0.006345 **
## AcidIndex6 -1.158e+00 4.491e-01 -2.578 0.009943 **
## AcidIndex7 -1.197e+00 4.487e-01 -2.667 0.007656 **
## AcidIndex8 -1.216e+00 4.488e-01 -2.710 0.006733 **
## AcidIndex9 -1.314e+00 4.492e-01 -2.924 0.003452 **
## AcidIndex10 -1.483e+00 4.507e-01 -3.290 0.001001 **
## AcidIndex11 -1.909e+00 4.573e-01 -4.174 2.99e-05 ***
## AcidIndex12 -1.749e+00 4.607e-01 -3.797 0.000147 ***
## AcidIndex13 -1.707e+00 4.692e-01 -3.639 0.000274 ***
## AcidIndex14 -2.015e+00 4.966e-01 -4.058 4.95e-05 ***
## AcidIndex15 -1.165e+00 5.337e-01 -2.184 0.028976 *
## AcidIndex16 -3.929e+01 6.711e+07 0.000 1.000000
## AcidIndex17 -3.851e+01 4.745e+07 0.000 0.999999
## STARS2 3.150e-01 2.085e-02 15.106 < 2e-16 ***
## STARS3 4.253e-01 2.278e-02 18.672 < 2e-16 ***
## STARS4 5.382e-01 3.230e-02 16.663 < 2e-16 ***
## STARSNA -7.818e-01 2.885e-02 -27.099 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(41041.22) family taken to be 1)
##
## Null deviance: 10739.4 on 6071 degrees of freedom
## Residual deviance: 6297.4 on 6038 degrees of freedom
## AIC: 21591
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 41041
## Std. Err.: 49314
## Warning while fitting theta: alternation limit reached
##
## 2 x log-likelihood: -21520.76
model2.b <- glm.nb(TARGET~ VolatileAcidity + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + Alcohol + LabelAppeal + AcidIndex + STARS , data=train.data)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in glm.nb(TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + :
## alternation limit reached
summary(model2.b)
##
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Density + Alcohol + LabelAppeal + AcidIndex +
## STARS, data = train.data, init.theta = 41023.47334, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2505 -0.6271 -0.0154 0.4308 3.4212
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.357e+00 5.298e-01 4.449 8.62e-06 ***
## VolatileAcidity -2.792e-02 9.418e-03 -2.965 0.003028 **
## Chlorides -6.117e-02 2.288e-02 -2.674 0.007502 **
## FreeSulfurDioxide 9.260e-05 4.949e-05 1.871 0.061336 .
## TotalSulfurDioxide 8.556e-05 3.241e-05 2.640 0.008298 **
## Density -5.996e-01 2.776e-01 -2.160 0.030767 *
## Alcohol 4.099e-03 1.970e-03 2.081 0.037436 *
## LabelAppeal-1 2.355e-01 5.463e-02 4.311 1.63e-05 ***
## LabelAppeal0 4.360e-01 5.310e-02 8.211 < 2e-16 ***
## LabelAppeal1 5.723e-01 5.403e-02 10.593 < 2e-16 ***
## LabelAppeal2 6.964e-01 6.109e-02 11.400 < 2e-16 ***
## AcidIndex5 -1.236e+00 4.570e-01 -2.705 0.006829 **
## AcidIndex6 -1.147e+00 4.488e-01 -2.556 0.010586 *
## AcidIndex7 -1.184e+00 4.483e-01 -2.641 0.008273 **
## AcidIndex8 -1.203e+00 4.484e-01 -2.682 0.007309 **
## AcidIndex9 -1.299e+00 4.488e-01 -2.895 0.003788 **
## AcidIndex10 -1.469e+00 4.502e-01 -3.262 0.001106 **
## AcidIndex11 -1.898e+00 4.568e-01 -4.156 3.24e-05 ***
## AcidIndex12 -1.736e+00 4.602e-01 -3.773 0.000161 ***
## AcidIndex13 -1.695e+00 4.687e-01 -3.617 0.000298 ***
## AcidIndex14 -2.000e+00 4.961e-01 -4.032 5.53e-05 ***
## AcidIndex15 -1.145e+00 5.332e-01 -2.148 0.031744 *
## AcidIndex16 -3.926e+01 6.711e+07 0.000 1.000000
## AcidIndex17 -3.848e+01 4.745e+07 0.000 0.999999
## STARS2 3.147e-01 2.083e-02 15.106 < 2e-16 ***
## STARS3 4.264e-01 2.275e-02 18.744 < 2e-16 ***
## STARS4 5.388e-01 3.226e-02 16.701 < 2e-16 ***
## STARSNA -7.825e-01 2.884e-02 -27.131 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(41023.49) family taken to be 1)
##
## Null deviance: 10739.4 on 6071 degrees of freedom
## Residual deviance: 6300.5 on 6044 degrees of freedom
## AIC: 21582
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 41023
## Std. Err.: 49274
## Warning while fitting theta: alternation limit reached
##
## 2 x log-likelihood: -21523.86
We will compare the models by they’re accuracy scores.
predictions <- as.data.frame(predict(model1.a, newdata = test.data))
accuracy_df <- target
accuracy_df <- merge(accuracy_df, predictions)
accuracy_df$error <- abs(accuracy_df$x - accuracy_df$`predict(model1.a, newdata = test.data)`)
accuracy_df$error_percentage <- ((accuracy_df$`predict(model1.a, newdata = test.data)` - accuracy_df$x)/accuracy_df$x) * 100
avg_error <- mean(accuracy_df$error)
avg_percentage_error <- mean(accuracy_df$error_percentage)
print(avg_error)
## [1] 2.500863
print(avg_percentage_error)
## [1] NaN
predictions <- as.data.frame(predict(model1.b, newdata = test.data))
accuracy_df <- target
accuracy_df <- merge(accuracy_df, predictions)
accuracy_df$error <- abs(accuracy_df$x - accuracy_df$`predict(model1.b, newdata = test.data)`)
accuracy_df$error_percentage <- ((accuracy_df$`predict(model1.b, newdata = test.data)` - accuracy_df$x)/accuracy_df$x) * 100
avg_error <- mean(accuracy_df$error)
avg_percentage_error <- mean(accuracy_df$error_percentage)
print(avg_error)
## [1] 2.500699
print(avg_percentage_error)
## [1] NaN
predictions <- as.data.frame(predict(model2.a, newdata = test.data))
accuracy_df <- target
accuracy_df <- merge(accuracy_df, predictions)
accuracy_df$error <- abs(accuracy_df$x - accuracy_df$`predict(model2.a, newdata = test.data)`)
accuracy_df$error_percentage <- ((accuracy_df$`predict(model2.a, newdata = test.data)` - accuracy_df$x)/accuracy_df$x) * 100
avg_error <- mean(accuracy_df$error)
avg_percentage_error <- mean(accuracy_df$error_percentage)
print(avg_error)
## [1] 2.557892
print(avg_percentage_error)
## [1] NaN
predictions <- as.data.frame(predict(model2.b, newdata = test.data))
accuracy_df <- target
accuracy_df <- merge(accuracy_df, predictions)
accuracy_df$error <- abs(accuracy_df$x - accuracy_df$`predict(model2.b, newdata = test.data)`)
accuracy_df$error_percentage <- ((accuracy_df$`predict(model2.b, newdata = test.data)` - accuracy_df$x)/accuracy_df$x) * 100
avg_error <- mean(accuracy_df$error)
avg_percentage_error <- mean(accuracy_df$error_percentage)
print(avg_error)
## [1] 2.557729
print(avg_percentage_error)
## [1] NaN
The NaNs are due to the fact that some of the wines simply didn’t sell a case. So that would make a 0 in the denomenator. Thus, it won’t work. We can still check the average rate however.
Our best model was model1.b with an average error rate of 2.500699
predictions <- as.data.frame(predict(model1.b, newdata = test_df))
test_df$predictions <- predictions
results <- test_df[c("IN", "predictions")]