In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales. The objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target.
packages <- c('tidyverse', 'broom', 'caret', 'kableExtra', 'janitor', 'Hmisc', 'MASS', 'corrplot', 'Metrics', 'purrr', 'janitor', 'ggmosaic', 'pscl')
pacman::p_load(char = packages)
# read data
dfTrain <- read.csv("wine-training-data.csv", header = TRUE)
dfEval <- read.csv("wine-evaluation-data.csv", header = TRUE)
# dim
lapply(list(`wine-training` = dfTrain, `wine-evaluation` = dfEval), function(x) dim(x))
## $`wine-training`
## [1] 12795 16
##
## $`wine-evaluation`
## [1] 3335 16
# head
head(dfTrain)
## ï..INDEX TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## 1 1 3 3.2 1.160 -0.98 54.2
## 2 2 3 4.5 0.160 -0.81 26.1
## 3 4 5 7.1 2.640 -0.88 14.8
## 4 5 3 5.7 0.385 0.04 18.8
## 5 6 4 8.0 0.330 -1.26 9.4
## 6 7 0 11.3 0.320 0.59 2.2
## Chlorides FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol
## 1 -0.567 NA 268 0.99280 3.33 -0.59 9.9
## 2 -0.425 15 -327 1.02792 3.38 0.70 NA
## 3 0.037 214 142 0.99518 3.12 0.48 22.0
## 4 -0.425 22 115 0.99640 2.24 1.83 6.2
## 5 NA -167 108 0.99457 3.12 1.77 13.7
## 6 0.556 -37 15 0.99940 3.20 1.29 15.4
## LabelAppeal AcidIndex STARS
## 1 0 8 2
## 2 -1 7 3
## 3 -1 8 3
## 4 -1 6 1
## 5 0 9 2
## 6 0 11 NA
# let's clean up variable names and remove the index column
dfTrain <- janitor::clean_names(dfTrain) %>% dplyr::select(-i_index)
dfEval <- janitor::clean_names(dfEval) %>% dplyr::select(-`in`)
# meta data
metaDataTrain <- data.frame(fields = names(dfTrain),
missing = colSums(is.na(dfTrain)),
unique = sapply(dfTrain, function(x) length(unique(x))),
type = sapply(dfTrain, class))
metaDataEval <- data.frame(fields = names(dfEval),
missing = colSums(is.na(dfEval)),
unique = sapply(dfEval, function(x) length(unique(x))),
type = sapply(dfEval, class))
metaDataTrain
## fields missing unique type
## target target 0 9 integer
## fixed_acidity fixed_acidity 0 470 numeric
## volatile_acidity volatile_acidity 0 815 numeric
## citric_acid citric_acid 0 602 numeric
## residual_sugar residual_sugar 616 2078 numeric
## chlorides chlorides 638 1664 numeric
## free_sulfur_dioxide free_sulfur_dioxide 647 1000 numeric
## total_sulfur_dioxide total_sulfur_dioxide 682 1371 numeric
## density density 0 5933 numeric
## p_h p_h 395 498 numeric
## sulphates sulphates 1210 631 numeric
## alcohol alcohol 653 402 numeric
## label_appeal label_appeal 0 5 integer
## acid_index acid_index 0 14 integer
## stars stars 3359 5 integer
metaDataEval
## fields missing unique type
## target target 3335 1 logical
## fixed_acidity fixed_acidity 0 366 numeric
## volatile_acidity volatile_acidity 0 533 numeric
## citric_acid citric_acid 0 448 numeric
## residual_sugar residual_sugar 168 1224 numeric
## chlorides chlorides 138 1051 numeric
## free_sulfur_dioxide free_sulfur_dioxide 152 700 numeric
## total_sulfur_dioxide total_sulfur_dioxide 157 940 numeric
## density density 0 2179 numeric
## p_h p_h 104 379 numeric
## sulphates sulphates 310 466 numeric
## alcohol alcohol 185 288 numeric
## label_appeal label_appeal 0 5 integer
## acid_index acid_index 0 13 integer
## stars stars 841 5 integer
# barplot
barplot(table(dfTrain$target), main = "Train Set : Target")
# corrplot
cor(dfTrain, method = "pearson", use = "complete.obs") %>%
corrplot::corrplot(., method = "shade", type = "upper")
# density curve
dfGather <- dfTrain %>%
dplyr::select(-label_appeal, -acid_index, -stars) %>%
dplyr::mutate(target = as.factor(target)) %>%
tidyr::gather(key, value, -target)
ggplot(dfGather, aes(value, color = target)) +
geom_density() +
geom_vline(data = aggregate(value ~ target + key, dfGather, median),
aes(xintercept = value,
color = target),
linetype = "dashed") +
facet_wrap(~ key, nrow = 5, scales = "free")
# mosaic plot
dfTrain %>%
ggplot(.) +
geom_mosaic(aes(x = product(target, label_appeal), fill = as.factor(target)), na.rm = TRUE) +
labs(x = "label_appeal", fill = "Target") +
ggtitle("Mosaic Plot : Label Appeal")
dfTrain %>%
ggplot(.) +
geom_mosaic(aes(x = product(target, acid_index), fill = as.factor(target)), na.rm = TRUE) +
labs(x = "acid_index", fill = "Target") +
ggtitle("Mosaic Plot : Acid Index")
dfTrain %>%
ggplot(.) +
geom_mosaic(aes(x = product(target, stars), fill = as.factor(target)), na.rm = TRUE) +
labs(x = "stars", fill = "Target") +
ggtitle("Mosaic Plot : Stars")
Most of the data is normally distributed. We do not seem to have any variable that strongly correlates with our target variable. In fact, majority of these variables do not seem to corrlelate with one another, except stars and label_appeal (not surprising). We have many missing values that we need to take care of first.
# impute - dfTrain
preProcValuesTrain <- preProcess(dfTrain, method = c("medianImpute"))
dfTrain <- predict(preProcValuesTrain, dfTrain)
# impute - dfEval
preProcValuesEval <- preProcess(dfEval, method = c("medianImpute"))
dfEval <- predict(preProcValuesEval, dfEval)
# check again
sapply(list(dfTrain, dfEval), function(x) colSums(is.na(x)))
## [,1] [,2]
## target 0 3335
## fixed_acidity 0 0
## volatile_acidity 0 0
## citric_acid 0 0
## residual_sugar 0 0
## chlorides 0 0
## free_sulfur_dioxide 0 0
## total_sulfur_dioxide 0 0
## density 0 0
## p_h 0 0
## sulphates 0 0
## alcohol 0 0
## label_appeal 0 0
## acid_index 0 0
## stars 0 0
# split data
set.seed(1234)
index = sample(1:nrow(dfTrain), size = nrow(dfTrain) * .7, replace = FALSE)
trainSet <- dfTrain[index, ]
testSet <- dfTrain[-index, ]
# poisson models
poisson_model_1 <- glm(target ~ ., family = "poisson", data = trainSet)
poisson_model_1 <- step(poisson_model_1, direction = "backward")
## Start: AIC=35197.29
## target ~ fixed_acidity + volatile_acidity + citric_acid + residual_sugar +
## chlorides + free_sulfur_dioxide + total_sulfur_dioxide +
## density + p_h + sulphates + alcohol + label_appeal + acid_index +
## stars
##
## Df Deviance AIC
## - fixed_acidity 1 12808 35196
## - residual_sugar 1 12809 35196
## <none> 12808 35197
## - density 1 12811 35199
## - p_h 1 12813 35200
## - citric_acid 1 12814 35201
## - sulphates 1 12814 35202
## - alcohol 1 12816 35203
## - total_sulfur_dioxide 1 12816 35204
## - free_sulfur_dioxide 1 12820 35207
## - chlorides 1 12821 35208
## - volatile_acidity 1 12852 35239
## - acid_index 1 13393 35781
## - label_appeal 1 13573 35960
## - stars 1 13593 35980
##
## Step: AIC=35195.58
## target ~ volatile_acidity + citric_acid + residual_sugar + chlorides +
## free_sulfur_dioxide + total_sulfur_dioxide + density + p_h +
## sulphates + alcohol + label_appeal + acid_index + stars
##
## Df Deviance AIC
## - residual_sugar 1 12809 35195
## <none> 12808 35196
## - density 1 12812 35197
## - p_h 1 12813 35199
## - citric_acid 1 12814 35199
## - sulphates 1 12815 35200
## - alcohol 1 12816 35201
## - total_sulfur_dioxide 1 12817 35202
## - free_sulfur_dioxide 1 12820 35206
## - chlorides 1 12821 35207
## - volatile_acidity 1 12852 35237
## - acid_index 1 13413 35799
## - label_appeal 1 13574 35960
## - stars 1 13593 35979
##
## Step: AIC=35194.59
## target ~ volatile_acidity + citric_acid + chlorides + free_sulfur_dioxide +
## total_sulfur_dioxide + density + p_h + sulphates + alcohol +
## label_appeal + acid_index + stars
##
## Df Deviance AIC
## <none> 12809 35195
## - density 1 12813 35196
## - p_h 1 12814 35198
## - citric_acid 1 12815 35198
## - sulphates 1 12816 35199
## - alcohol 1 12817 35200
## - total_sulfur_dioxide 1 12818 35201
## - free_sulfur_dioxide 1 12821 35205
## - chlorides 1 12822 35206
## - volatile_acidity 1 12853 35236
## - acid_index 1 13414 35798
## - label_appeal 1 13575 35959
## - stars 1 13596 35979
summary(poisson_model_1)
##
## Call:
## glm(formula = target ~ volatile_acidity + citric_acid + chlorides +
## free_sulfur_dioxide + total_sulfur_dioxide + density + p_h +
## sulphates + alcohol + label_appeal + acid_index + stars,
## family = "poisson", data = trainSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4689 -0.5213 0.2005 0.6340 2.5391
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.025e+00 2.335e-01 8.674 < 2e-16 ***
## volatile_acidity -5.181e-02 7.831e-03 -6.616 3.68e-11 ***
## citric_acid 1.656e-02 6.989e-03 2.370 0.017812 *
## chlorides -7.126e-02 1.968e-02 -3.621 0.000293 ***
## free_sulfur_dioxide 1.452e-04 4.150e-05 3.497 0.000470 ***
## total_sulfur_dioxide 7.928e-05 2.698e-05 2.939 0.003297 **
## density -4.238e-01 2.284e-01 -1.855 0.063561 .
## p_h -2.068e-02 9.172e-03 -2.255 0.024140 *
## sulphates -1.759e-02 6.871e-03 -2.560 0.010467 *
## alcohol 4.627e-03 1.696e-03 2.729 0.006356 **
## label_appeal 1.996e-01 7.213e-03 27.677 < 2e-16 ***
## acid_index -1.249e-01 5.266e-03 -23.711 < 2e-16 ***
## stars 2.204e-01 7.791e-03 28.285 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 15963 on 8955 degrees of freedom
## Residual deviance: 12809 on 8943 degrees of freedom
## AIC: 35195
##
## Number of Fisher Scoring iterations: 5
poisson_model_2 <- pscl::zeroinfl(target ~ ., data = trainSet)
summary(poisson_model_2)
##
## Call:
## pscl::zeroinfl(formula = target ~ ., data = trainSet)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.0030 -0.3625 0.1578 0.5023 4.2839
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.551e+00 2.474e-01 6.271 3.60e-10 ***
## fixed_acidity 4.890e-04 1.027e-03 0.476 0.63391
## volatile_acidity -1.526e-02 8.326e-03 -1.833 0.06679 .
## citric_acid -5.891e-05 7.294e-03 -0.008 0.99356
## residual_sugar -9.120e-05 1.953e-04 -0.467 0.64043
## chlorides -1.701e-02 2.071e-02 -0.822 0.41133
## free_sulfur_dioxide 4.438e-05 4.238e-05 1.047 0.29496
## total_sulfur_dioxide -3.596e-05 2.741e-05 -1.312 0.18961
## density -4.440e-01 2.423e-01 -1.832 0.06693 .
## p_h 6.828e-03 9.697e-03 0.704 0.48138
## sulphates 2.226e-03 7.265e-03 0.306 0.75930
## alcohol 7.806e-03 1.765e-03 4.422 9.77e-06 ***
## label_appeal 2.507e-01 7.618e-03 32.911 < 2e-16 ***
## acid_index -1.799e-02 5.965e-03 -3.017 0.00256 **
## stars 1.010e-01 7.778e-03 12.979 < 2e-16 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.7934085 1.2454084 -3.849 0.000119 ***
## fixed_acidity 0.0064796 0.0052037 1.245 0.213060
## volatile_acidity 0.2376106 0.0419756 5.661 1.51e-08 ***
## citric_acid -0.0993046 0.0378625 -2.623 0.008722 **
## residual_sugar -0.0015738 0.0009954 -1.581 0.113839
## chlorides 0.3491774 0.1049534 3.327 0.000878 ***
## free_sulfur_dioxide -0.0006455 0.0002220 -2.907 0.003647 **
## total_sulfur_dioxide -0.0007660 0.0001437 -5.330 9.85e-08 ***
## density -0.1038197 1.2239272 -0.085 0.932400
## p_h 0.1707674 0.0490124 3.484 0.000494 ***
## sulphates 0.1254033 0.0367406 3.413 0.000642 ***
## alcohol 0.0156930 0.0090488 1.734 0.082872 .
## label_appeal 0.3154615 0.0397371 7.939 2.04e-15 ***
## acid_index 0.4904284 0.0237354 20.662 < 2e-16 ***
## stars -0.6951840 0.0463883 -14.986 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 35
## Log-likelihood: -1.575e+04 on 30 Df
poisson_model_2$coefficients
## $count
## (Intercept) fixed_acidity volatile_acidity
## 1.551071e+00 4.890169e-04 -1.526259e-02
## citric_acid residual_sugar chlorides
## -5.890782e-05 -9.120066e-05 -1.701147e-02
## free_sulfur_dioxide total_sulfur_dioxide density
## 4.438266e-05 -3.595520e-05 -4.440066e-01
## p_h sulphates alcohol
## 6.827573e-03 2.226029e-03 7.806270e-03
## label_appeal acid_index stars
## 2.507153e-01 -1.799343e-02 1.009544e-01
##
## $zero
## (Intercept) fixed_acidity volatile_acidity
## -4.7934084628 0.0064795831 0.2376106142
## citric_acid residual_sugar chlorides
## -0.0993046359 -0.0015738373 0.3491773765
## free_sulfur_dioxide total_sulfur_dioxide density
## -0.0006454633 -0.0007660401 -0.1038197157
## p_h sulphates alcohol
## 0.1707674358 0.1254033497 0.0156929524
## label_appeal acid_index stars
## 0.3154615094 0.4904283874 -0.6951839602
# Vuong comparison
pscl::vuong(poisson_model_1, poisson_model_2)
## Vuong Non-Nested Hypothesis Test-Statistic:
## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## -------------------------------------------------------------
## Vuong z-statistic H_A p-value
## Raw -32.10931 model2 > model1 < 2.22e-16
## AIC-corrected -31.81212 model2 > model1 < 2.22e-16
## BIC-corrected -30.75709 model2 > model1 < 2.22e-16
Overall, it seems to be the case that stronger sales is associated with more citric, sulfur dioxide containing, alcohol, label appeal, and stars. On the other hand, the more dense, more sulphates, more chlorides and less acid is more likely to contribute to less sales. We conduct the Vuong test to compare the first model with the zero-inflated model and it turns out that the zero-inflated model outperforms the first model as we can see from the output.
# negative binomial models
negative_binomial_3 <- glm.nb(target ~ ., data = trainSet)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
summary(negative_binomial_3)
##
## Call:
## glm.nb(formula = target ~ ., data = trainSet, init.theta = 40280.1003,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4671 -0.5236 0.1992 0.6334 2.5376
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.025e+00 2.335e-01 8.675 < 2e-16 ***
## fixed_acidity -5.329e-04 9.818e-04 -0.543 0.587304
## volatile_acidity -5.178e-02 7.832e-03 -6.611 3.82e-11 ***
## citric_acid 1.667e-02 6.990e-03 2.384 0.017107 *
## residual_sugar 1.846e-04 1.857e-04 0.994 0.320274
## chlorides -7.117e-02 1.968e-02 -3.617 0.000298 ***
## free_sulfur_dioxide 1.448e-04 4.151e-05 3.488 0.000486 ***
## total_sulfur_dioxide 7.863e-05 2.699e-05 2.914 0.003574 **
## density -4.242e-01 2.284e-01 -1.857 0.063340 .
## p_h -2.090e-02 9.176e-03 -2.278 0.022728 *
## sulphates -1.744e-02 6.873e-03 -2.538 0.011164 *
## alcohol 4.668e-03 1.696e-03 2.752 0.005928 **
## label_appeal 1.995e-01 7.214e-03 27.661 < 2e-16 ***
## acid_index -1.244e-01 5.325e-03 -23.366 < 2e-16 ***
## stars 2.202e-01 7.793e-03 28.261 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(40280.1) family taken to be 1)
##
## Null deviance: 15962 on 8955 degrees of freedom
## Residual deviance: 12807 on 8941 degrees of freedom
## AIC: 35199
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 40280
## Std. Err.: 71902
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -35167.38
# let's remove two variables from above model that are not significant and try again
negative_binomial_4 <- glm.nb(target ~ ., data = trainSet %>% dplyr::select(-fixed_acidity, -residual_sugar))
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
summary(negative_binomial_4)
##
## Call:
## glm.nb(formula = target ~ ., data = trainSet %>% dplyr::select(-fixed_acidity,
## -residual_sugar), init.theta = 40292.42361, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4688 -0.5212 0.2005 0.6339 2.5390
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.025e+00 2.335e-01 8.674 < 2e-16 ***
## volatile_acidity -5.181e-02 7.831e-03 -6.616 3.68e-11 ***
## citric_acid 1.656e-02 6.989e-03 2.369 0.017817 *
## chlorides -7.126e-02 1.968e-02 -3.621 0.000293 ***
## free_sulfur_dioxide 1.452e-04 4.151e-05 3.497 0.000470 ***
## total_sulfur_dioxide 7.928e-05 2.698e-05 2.939 0.003298 **
## density -4.238e-01 2.285e-01 -1.855 0.063570 .
## p_h -2.068e-02 9.172e-03 -2.255 0.024143 *
## sulphates -1.759e-02 6.871e-03 -2.560 0.010468 *
## alcohol 4.627e-03 1.696e-03 2.729 0.006358 **
## label_appeal 1.996e-01 7.213e-03 27.676 < 2e-16 ***
## acid_index -1.249e-01 5.266e-03 -23.711 < 2e-16 ***
## stars 2.204e-01 7.791e-03 28.284 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(40292.42) family taken to be 1)
##
## Null deviance: 15962 on 8955 degrees of freedom
## Residual deviance: 12809 on 8943 degrees of freedom
## AIC: 35197
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 40292
## Std. Err.: 71929
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -35168.68
The above two models perform pretty much the same, so in this case we can go after the one that has less variables (thus more precise).
# multi-linear model
# multi_linear_5 <- lm(target ~ ., trainSet)
# multi_linear_5 <- step(multi_linear_5, direction = 'backward')
summary(multi_linear_5)
##
## Call:
## lm(formula = target ~ volatile_acidity + citric_acid + chlorides +
## free_sulfur_dioxide + total_sulfur_dioxide + density + p_h +
## sulphates + alcohol + label_appeal + acid_index + stars,
## data = trainSet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1942 -0.7512 0.3593 1.1162 4.3042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.347e+00 6.533e-01 8.184 3.11e-16 ***
## volatile_acidity -1.548e-01 2.202e-02 -7.030 2.22e-12 ***
## citric_acid 5.099e-02 1.975e-02 2.582 0.009833 **
## chlorides -2.260e-01 5.508e-02 -4.104 4.10e-05 ***
## free_sulfur_dioxide 4.422e-04 1.170e-04 3.780 0.000158 ***
## total_sulfur_dioxide 2.305e-04 7.561e-05 3.049 0.002306 **
## density -1.240e+00 6.411e-01 -1.935 0.053080 .
## p_h -5.085e-02 2.569e-02 -1.980 0.047764 *
## sulphates -4.925e-02 1.926e-02 -2.557 0.010564 *
## alcohol 1.692e-02 4.750e-03 3.561 0.000371 ***
## label_appeal 6.034e-01 2.015e-02 29.952 < 2e-16 ***
## acid_index -3.322e-01 1.304e-02 -25.469 < 2e-16 ***
## stars 7.481e-01 2.333e-02 32.067 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.62 on 8943 degrees of freedom
## Multiple R-squared: 0.292, Adjusted R-squared: 0.2911
## F-statistic: 307.4 on 12 and 8943 DF, p-value: < 2.2e-16
# try two degree of interaction
# multi_linear_6 <- lm(target ~.^2, trainSet)
# multi_linear_6 <- step(multi_linear_6, direction = 'backward')
summary(multi_linear_6)
##
## Call:
## lm(formula = target ~ fixed_acidity + volatile_acidity + citric_acid +
## chlorides + free_sulfur_dioxide + total_sulfur_dioxide +
## density + p_h + sulphates + alcohol + label_appeal + acid_index +
## stars + fixed_acidity:free_sulfur_dioxide + fixed_acidity:density +
## fixed_acidity:acid_index + volatile_acidity:density + volatile_acidity:p_h +
## volatile_acidity:stars + citric_acid:free_sulfur_dioxide +
## citric_acid:total_sulfur_dioxide + citric_acid:sulphates +
## citric_acid:label_appeal + chlorides:acid_index + chlorides:stars +
## free_sulfur_dioxide:density + free_sulfur_dioxide:p_h + free_sulfur_dioxide:acid_index +
## total_sulfur_dioxide:density + total_sulfur_dioxide:acid_index +
## density:alcohol + sulphates:stars + alcohol:label_appeal +
## alcohol:acid_index + alcohol:stars + label_appeal:acid_index +
## label_appeal:stars + acid_index:stars, data = trainSet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2526 -0.7560 0.3177 1.1023 5.0959
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.162e+01 2.189e+00 5.309 1.13e-07 ***
## fixed_acidity -2.708e-01 9.910e-02 -2.733 0.006290 **
## volatile_acidity -1.485e+00 7.972e-01 -1.862 0.062610 .
## citric_acid 3.550e-02 2.574e-02 1.379 0.167843
## chlorides 6.291e-01 3.522e-01 1.786 0.074087 .
## free_sulfur_dioxide -8.892e-03 4.350e-03 -2.044 0.040948 *
## total_sulfur_dioxide 3.769e-03 2.770e-03 1.361 0.173612
## density -6.630e+00 2.156e+00 -3.075 0.002114 **
## p_h -3.500e-02 2.766e-02 -1.265 0.205821
## sulphates -1.642e-01 5.458e-02 -3.009 0.002625 **
## alcohol -2.432e-01 1.806e-01 -1.347 0.178136
## label_appeal 6.317e-01 1.442e-01 4.381 1.19e-05 ***
## acid_index -3.767e-01 5.827e-02 -6.465 1.07e-10 ***
## stars -2.632e-01 1.696e-01 -1.552 0.120620
## fixed_acidity:free_sulfur_dioxide -4.458e-05 1.872e-05 -2.381 0.017292 *
## fixed_acidity:density 3.043e-01 9.886e-02 3.079 0.002086 **
## fixed_acidity:acid_index -4.135e-03 1.918e-03 -2.155 0.031153 *
## volatile_acidity:density 1.381e+00 7.928e-01 1.742 0.081478 .
## volatile_acidity:p_h -4.838e-02 3.232e-02 -1.497 0.134390
## volatile_acidity:stars 5.831e-02 2.836e-02 2.056 0.039788 *
## citric_acid:free_sulfur_dioxide 3.082e-04 1.327e-04 2.322 0.020254 *
## citric_acid:total_sulfur_dioxide -2.010e-04 8.593e-05 -2.339 0.019339 *
## citric_acid:sulphates 5.731e-02 2.171e-02 2.640 0.008309 **
## citric_acid:label_appeal 4.534e-02 2.147e-02 2.112 0.034718 *
## chlorides:acid_index -1.354e-01 3.953e-02 -3.425 0.000617 ***
## chlorides:stars 1.023e-01 7.139e-02 1.433 0.151797
## free_sulfur_dioxide:density 7.817e-03 4.296e-03 1.820 0.068811 .
## free_sulfur_dioxide:p_h -2.620e-04 1.673e-04 -1.566 0.117342
## free_sulfur_dioxide:acid_index 3.347e-04 9.271e-05 3.610 0.000308 ***
## total_sulfur_dioxide:density -4.461e-03 2.752e-03 -1.621 0.105034
## total_sulfur_dioxide:acid_index 1.235e-04 5.775e-05 2.139 0.032478 *
## density:alcohol 2.894e-01 1.793e-01 1.613 0.106678
## sulphates:stars 4.727e-02 2.516e-02 1.879 0.060266 .
## alcohol:label_appeal -8.775e-03 5.677e-03 -1.546 0.122238
## alcohol:acid_index -9.119e-03 3.492e-03 -2.611 0.009032 **
## alcohol:stars 2.064e-02 6.357e-03 3.247 0.001172 **
## label_appeal:acid_index -7.326e-02 1.502e-02 -4.879 1.09e-06 ***
## label_appeal:stars 3.099e-01 2.468e-02 12.556 < 2e-16 ***
## acid_index:stars 8.903e-02 1.953e-02 4.559 5.20e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.595 on 8917 degrees of freedom
## Multiple R-squared: 0.3162, Adjusted R-squared: 0.3133
## F-statistic: 108.5 on 38 and 8917 DF, p-value: < 2.2e-16
The second multi-linear regression model (multi_linear_6) is better than the first one (multi_linear_5), e.g. adjsuted R-squared 0.3133 versus 0.2911; however, it is on two degree of interaction and therefore it is more complicated and difficult to explain to stakeholders. It may result in overfitting or simply too complicated to be applicable.
# poisson model
targetMean <- round(mean(trainSet$target), 3)
targetVar <- round(var(trainSet$target), 3)
print(paste0("the mean of the targt is approximately ", targetMean))
## [1] "the mean of the targt is approximately 3.025"
print(paste0("the variance of the targt is approximately ", targetVar))
## [1] "the variance of the targt is approximately 3.703"
Since the mean and variance of the target variable is not identical (not even close), we suspect that there is a case of overdispersion and therefore it is less than ideal to apply the poisson models to fit our data in this case.
# prediction accuracy
testSet <- testSet %>%
dplyr::mutate(pred_nb_3 = round(predict(negative_binomial_3, newdata = testSet, type = "response"), 0),
pred_nb_4 = round(predict(negative_binomial_4, newdata = testSet, type = "response"), 0),
pred_lm_5 = round(predict(multi_linear_5, newdata = testSet, type = "response"), 0),
pred_lm_6 = round(predict(multi_linear_6, newdata = testSet, type = "response"), 0),
pred_nb_3_accuracy = dplyr::case_when(pred_nb_3 == target ~ 1, TRUE ~ 0),
pred_nb_4_accuracy = dplyr::case_when(pred_nb_4 == target ~ 1, TRUE ~ 0),
pred_lm_5_accuracy = dplyr::case_when(pred_lm_5 == target ~ 1, TRUE ~ 0),
pred_lm_6_accuracy = dplyr::case_when(pred_lm_6 == target ~ 1, TRUE ~ 0))
accuracy <- lapply(testSet %>% dplyr::select(pred_nb_3_accuracy, pred_nb_4_accuracy, pred_lm_5_accuracy, pred_lm_6_accuracy),
sum) %>% unlist
print(round(accuracy / nrow(testSet), 3)) %>% kable()
## pred_nb_3_accuracy pred_nb_4_accuracy pred_lm_5_accuracy pred_lm_6_accuracy
## 0.238 0.238 0.258 0.255
| x | |
|---|---|
| pred_nb_3_accuracy | 0.238 |
| pred_nb_4_accuracy | 0.238 |
| pred_lm_5_accuracy | 0.258 |
| pred_lm_6_accuracy | 0.255 |
# mean squared error
mse <- lapply(list(negative_binomial_3, negative_binomial_4, multi_linear_5, multi_linear_6),
function(x) mean(x$residuals ^2))
names(mse) <- c("negative_binomial_3: mse", "negative_binomial_4: mse", "multi_linear_5: mse", "multi_linear_6: mse")
print(sapply(mse, function(x) round(x, 3))) %>% kable()
## negative_binomial_3: mse negative_binomial_4: mse multi_linear_5: mse
## 0.382 0.382 2.621
## multi_linear_6: mse
## 2.532
| x | |
|---|---|
| negative_binomial_3: mse | 0.382 |
| negative_binomial_4: mse | 0.382 |
| multi_linear_5: mse | 2.621 |
| multi_linear_6: mse | 2.532 |
# evaluation set
dfEval <- dfEval %>%
dplyr::mutate(pred = round(predict(negative_binomial_4, newdata = dfEval, type = "response"), 0))
# barplot for evaluation set (side by side with the train set)
par(mfrow = c(1, 2))
barplot(table(dfTrain$target), main = "Train Set : Target")
barplot(table(dfEval$pred), main = "Evaluation Set : Target Prediction")
Since the negative binomial models report its result with an AIC and log likelihood, whereas the multi-linear regression approach reports with adjusted R-squared, we cannot directly compare their outputs. Instead, we look at the accuracy of their predictions and calculate the mean squared error. The overall accuracy among the models actually do not differ so much. The linear models actually do slightly better than the negative binomial models (26% vs 24%); however, these two approaches differ significantly in the mean squared error. As a result, we would choose negative_binomial_4, which is highly similar to model 3 but we remove two variables that have no predictive power, i.e. fixed_acidity, and residual_sugar.